Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow
Recently, I was involved in building an ETL(Extract-Transform-Load) pipeline. It included extracting data from MongoDB collections, perform transformations and then loading it into Redshift tables. Many ETL solutions are available in the market which kind-of solves the issue, but the key part of an ETL process lies in its ability to transform or process raw data before it is pushed to its destination.
Each ETL pipeline comes with a specific business requirement around processing data which is hard to be achieved using off-the-shelf ETL solutions. This is why a majority of ETL solutions are custom built manually, from scratch. In this blog, I am going to talk about my learnings around building a custom ETL solution which involved moving data from MongoDB to Redshift using Apache Airflow.
I began by writing a Python-based command line tool which supported different phases of ETL, like extracting data from MongoDB, processing extracted data locally, uploading the processed data to S3, loading data from S3 to Redshift, post-processing, and cleanup. I used the PyMongo library to interact with MongoDB and the Boto library for interacting with Redshift and S3.
I kept each operation atomic so that multiple instances of each operation can run independently of each other, which will help to achieve parallelism. One of the major challenges was to achieve parallelism while running the ETL tasks. One option was to develop our own framework based on threads or developing a distributed task scheduler tool using a message broker tool like Celery combined with RabbitMQ. After doing some research I settled for Apache Airflow. Airflow is a Python-based scheduler where you can define DAGs (Directed Acyclic Graphs), which would run as per the given schedule and run tasks in parallel in each phase of your ETL. You can define DAG as Python code and it also enables you to handle the state of your DAG run using environment variables. Features like task retries on failure handling are a plus.
We faced several challenges while getting the above ETL workflow to be near real-time and fault tolerant. We discuss the challenges faced and the solutions below:
Keeping your ETL code changes in sync with Redshift schema
While you are building the ETL tool, you may end up fetching a new field from MongoDB, but at the same time, you have to add that column to the corresponding Redshift table. If you fail to do so the ETL pipeline will start failing. In order to tackle this, I created a database migration tool which would become the first step in my ETL workflow.
The migration tool would:
- keep the migration status in a Redshift table and
- would track all migration scripts in a code directory.
In each ETL run, it would get the most recently ran migrations from Redshift and would search for any new migration script available in the code directory. If found it would run the newly found migration script after which the regular ETL tasks would run. This adds the onus on the developer to add a migration script if he is making any changes like addition or removal of a field that he is fetching from MongoDB.
Maintaining data consistency
While extracting data from MongoDB, one needs to ensure all the collections are extracted at a specific point in time else there can be data inconsistency issues. We need to solve this problem at multiple levels:
- While extracting data from MongoDB define parameters like modified date and extract data from different collections with a filter as records less than or equal to that date. This will ensure you fetch point in time data from MongoDB.
- While loading data into Redshift tables, don’t load directly to master table, instead load it to some staging table. Once you are done loading data in staging for all related collections, load it to master from staging within a single transaction. This way data is either updated in all related tables or in none of the tables.
A single bad record can break your ETL
While moving data across the ETL pipeline into Redshift, one needs to take care of field formats. For example, the Date field in the incoming data can be different than that in the Redshift schema design. Another example can be that the incoming data can exceed the length of the field in the schema. Redshift’s COPY command which is used to load data from files to redshift tables is very vulnerable to such changes in data types. Even a single incorrectly formatted record will lead to all your data getting rejected and effectively breaking the ETL pipeline.
There are multiple ways in which we can solve this problem. Either handle it in one of the transform jobs in the pipeline. Alternately we put the onus on Redshift to handle these variances. Redshift’s COPY command has many options which can help you solve these problems. Some of the very useful options are
- ACCEPTANYDATE: Allows any date format, including invalid formats such as 00/00/00 00:00:00, to be loaded without generating an error.
- ACCEPTINVCHARS: Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters.
- TRUNCATECOLUMNS: Truncates data in columns to the appropriate number of characters so that it fits the column specification.
Redshift going out of storage
Redshift is based on PostgreSQL and one of the common problems is when you delete records from Redshift tables it does not actually free up space. So if your ETL process is deleting and creating new records frequently, then you may run out of Redshift storage space. VACUUM operation for Redshift is the solution to this problem. Instead of making VACUUM operation a part of your main ETL flow, define a different workflow which runs on a different schedule to run VACUUM operation. VACUUM operation reclaims space and resorts rows in either a specified table or all tables in the current database. VACUUM operation can be FULL, SORT ONLY, DELETE ONLY & REINDEX. More information on VACUUM can be found here.
ETL instance going out of storage
Your ETL will be generating a lot of files by extracting data from MongoDB onto your ETL instance. It is very important to periodically delete those files otherwise you are very likely to go out of storage on your ETL server. If your data from MongoDB is huge, you might end up creating large files on your ETL server. Again, I would recommend defining a different workflow which runs on a different schedule to run a cleanup operation.
Making ETL Near Real Time
Processing only the delta rather than doing a full load in each ETL run
ETL would be faster if you keep track of the already processed data and process only the new data. If you are doing a full load of data in each ETL run, then the solution would not scale as your data scales. As a solution to this, we made it mandatory for the collection in our MongoDB to have a created and modified date. Our ETL would check the maximum value of the modified date for the given collection from the Redshift table. It will then generate the filter query to fetch only those records from MongoDB which have modified date greater than that of the maximum value. It may be difficult for you to make changes in your product, but it’s worth the effort!
Compressing and splitting files while loading
A good approach is to write files in some compressed format. It saves your storage space on ETL server and also helps when you load data to Redshift. Redshift COPY command suggests that you provide compressed files as input. Also instead of a single huge file, you should split your files into parts and give all files to a single COPY command. This will enable Redshift to use it’s computing resources across the cluster to do the copy in parallel, leading to faster loads.
Streaming mongo data directly to S3 instead of writing it to ETL server
One of the major overhead in the ETL process is to write data first to ETL server and then uploading it to S3. In order to reduce disk IO, you should not store data to ETL server. Instead, use MongoDB’s handy stream API. For MongoDB Node driver, both the collection.find() and the collection.aggregate() function return cursors. The stream method also accepts a transform function as a parameter. All your custom transform logic could go into the transform function. AWS S3’s node library’s upload() function, also accepts readable streams. Use the stream from the MongoDB Node stream method, pipe it into zlib to gzip it, then feed the readable stream into AWS S3’s Node library. Simple! You will see a large improvement in your ETL process by this simple but important change.
Optimizing Redshift Queries
Optimizing Redshift Queries helps in making the ETL system highly scalable, efficient and also reduce the cost. Let's look at some of the approaches:
Add a distribution key
Redshift database is clustered, meaning your data is stored across cluster nodes. When you query for certain set of records, Redshift has to search for those records in each node, leading to slow queries. A distribution key is a single metric, which will decide the data distribution of all data records across your tables. If you have a single metric which is available for all your data, you can specify it as a distribution key. When loading data into Redshift, all data for a certain value of distribution key will be placed on a single node of Redshift cluster. So when you query for certain records Redshift knows exactly where to search for your data. This is only useful when you are also using the distribution key to query the data.
Generating a numeric primary key for string primary key
In MongoDB, you can have any type of field as your primary key. If your Mongo collections are having a non-numeric primary key and you are using those same keys in Redshift, your joins will end up being on string keys which are slower. Instead, generate numeric keys for your string keys and joining on it which will make queries run much faster. Redshift supports specifying a column with an attribute as IDENTITY which will auto-generate numeric unique value for the column which you can use as your primary key.
In this blog, I have covered the best practices around building ETL pipelines for Redshift based on my learnings. There are many more recommended practices which can be easily found in Redshift and MongoDB documentation.
Do let us know if you have any questions, queries and additional thoughts in the comments section below.
Velotio Technologies is an outsourced software product development partner for technology startups and enterprises. We specialize in enterprise B2B and SaaS product development with a focus on artificial intelligence and machine learning, DevOps, and test engineering. We combine innovative ideas with business expertise and cutting-edge technology to drive business success for our customers.