Awesome tools for data engineers
In the sea of new tools and libraries coming out daily, there are some you don’t want to miss.
It’s difficult to not be overwhelmed by the sheer number of libraries, frameworks, and cloud solutions available to engineers for extracting insights from data and building data pipelines. I would like to point out a few tools that stand out. They can be beneficial to your work, and certainly worth investing some time into learning. And I know, everything is AI nowdays, but they are not AI tools.
dbt
What is it: SQL-based framework for structured data transformation.
Website: https://www.getdbt.com/product/what-is-dbt
Why use it:
- define data transformations in a clear and modular way
- store data transformations as code under version control
- test transformations before applying changes
- validate source data before applying transformations
- transparently deal with scheduling.
When dealing with structured data, i.e. something that resembles a table and is stored in a database, warehouse, or a data lake, the need often arises to transform and reshape it into a different structure, granularity, or dimension. One can always write some custom code for that, but dbt handles routine operations and leaves one to define inputs, expectations, and outputs, avoiding a lot of boilerplate code in between.
With dbt, the engineer is expected to define the source where the data is coming from — i. e. the database, tables and columns. Next step is writing models, i.e. the logic for selecting and reshaping the source data. Models can be written in SQL or Python. To validate expected inputs and outputs, users may write tests. After that, you run a dbt commands to apply the transformation. If using dbt cloud, the schedule to run transformations periodically can be created right there in the UI, otherwise, you will need other tools (e.g. cron or scripts) to handle that.
If you’re a data engineer or a data scientist looking for a job, you may have encountered dbt in many, many open job descriptions.
SQL Mesh
What is it: an open source data transformation framework written in Python.
Website: https://sqlmesh.readthedocs.io/en/stable/
Why use it:
- full compatibility with dbt, plus
- a built-in scheduler
- column-level lineage
- SQL dialect translator
- virtual data environments (production / development)
- partition-based incremental models
- no need for jinja templates, just SQL and Python.
SQLMesh took dbt and made it better. It does not just automate away routine parts of data transformation, it sets out to be a single tool to build, test, deploy and operate a data pipeline. It integrates with orchestration tools such as Airflow, dbt and Github Actions, and ships with a CI/CD bot.
Ibis
What is it: An open source dataframe library that works with multiple backends.
Website: https://ibis-project.org/
Why use it:
- familiar dataframe API inspired by pandas and polars
- 20+ backends supported
- no code change whether dealing with a local DuckDb, or a production Oracle warehouse
- can translate dataframe operations to SQL
- visualizations with Altair, GraphViz, marimo, matplotlib, plotly, plotnine, seaborn, Streamlit.
There are many choices of dataframe libraries. However, you would usually need a query engine, such as SQLAlchemy, to connect the dataframe library to a database backend. In addition, since the dataframe library would materialize the queried data, you’d have to be mindful of limiting and subsetting the data before exploration. With Ibis, SQL is a first-class citizen, with dataframe transformations also translated into SQL, which delegates all the heavy lifting to the database engine, and reduces the need to hold a lot of data in-memory. With Ibis, there’s no need to choose between dataframe API or SQL, they can complement one another.
Ibis is written and supported by Voltron Data.
Polars
What is it: DataFrame library for manipulating structured data.
Website: https://docs.pola.rs/
Why use it:
- very fast, written in Rust
- familiar dataframe API
- parallel data processing, use all CPUs
- vectorized query engine
- GPU support.
Polars library was created as a faster alternative to pandas. Contrary to Ibis, Polars translates SQL into dataframe expressions, which are optimized down the line in its vectorized query engine. To represent the data in memory, Polars uses Apache Arrow columnar data format.
Polars is able to handle datasets up to 200Gb on a single machine, if utilizing lazy API, since it does not have to load all the data in memory. For data exploration, either Polars or Ibis are a better alternative to pandas. With Polars, it’s important to learn about expression API, lazy evaluation, and differences in syntax between pandas and Polars that might affect performance.
SQLGlot
What is it: SQL parser, transpiler, optimizer, and engine.
Website: https://github.com/tobymao/sqlglot
Why use it:
- parse SQL into an AST tree
- optimize SQL
- transpile between dialects
- dynamically build an AST tree and generate SQL.
To handle translation between SQL dialects, both SQLMesh and Ibis mentioned above use the excellent SQLGlot library. As a data engineer, you may or may not use it directly, however, you should know about it, because it might come in handy in more cases than just SQL dialect translation.
Airbyte
What is it: an open-source data integration engine.
Website: https://airbyte.com/
Why use it:
- move data from a source to a destination
- over 350 pre-built connectors
- custom connector builder.
Airbyte enables building data pipelines with no code. Data can be replicated from a source to a destination with a few clicks in a UI. One can configure how frequently the data is synced, what data is replicated, and how the data is written to the destination. Pipelines can be hosted in Airbyte Cloud or self-hosted. Payment model is based on MAR — monthly active (synced) records. Airbyte is like Fivetran (below) for the middle class.
Fivetran
What is it: a powerful and feature-rich data movement engine.
Website: https://www.fivetran.com/
Why use it:
- you have a lot of data in a lot of different sources, and complex requirements as to how it should be moved and synced
- speed and scalability are extremely important
- you don’t want to write any code
- you have a lot of money to pay for it
- over 500 connectors out of the box.
Fivetran is reliable, easy to use, feature-rich, and much, much more expensive than Airbyte.
Apache Arrow
What is it: library providing columnar data structures for in-memory computing.
Website: https://arrow.apache.org/
Why use it:
- Efficient data transfer between processes on the same machine
- Data structures for in-memory computing
- Columnar file format can be directly mapped from disk to in-memory structures to perform analytics without the overload of deserialization.
If you say “all of that just went over my head”, I don’t blame you. It’s what I felt too when I first learned about Arrow. The best way I can explain it is this: imagine you have table-like data on disk, but when you need to perform a query on it, the query engine works directly with bytes on disk, without reading them into an intermediary structure in memory. This makes the process both speedier and more memory-efficient, since the data is not copied over.
A common use case for Arrow is transferring a large table between processes on the same machine, or connected by a very fast network.
DuckDB
What is it: an in-process database with a rich SQL engine that doesn’t require running a server.
Website: https://duckdb.org/
Why use it:
- Portable, no external dependencies
- Feature-rich SQL dialect
- Vectorized columnar query engine
- Able to analyze medium-large data.
DuckDB is a perfect fit for use cases when you need a “test database” in addition to “production database”. In the past, developers would often use SQLite, but DuckDB is a lot more powerful than SQLite, while just as easy to use.
Another use case for DuckDB is data exploration. It can load csv or parquet files from disk and expose them as tables queryable with the SQL query engine.
Keep in mind that DuckDB is not intended to be a shared database. The whole database is stored in a single file, and while it can support multiple reader processes, there should be only one writer process.
Summary
These are the tools that I found useful, powerful, and worth knowing, for a data engineer (with Airbyte and Fivetran being perhaps a better fit for a data scientist). But there are so much more that I didn’t mention, and probably even more that I never heard of or explored. Comment to let me know what’s on your “must learn” list.