An NLP Data Pipeline for… $5? Sure!

Published in

weles.ai

8 min readAug 4, 2021

In this article, we will explore how to create a data pipeline for just $5/month.

Data pipeline — it sounds proud. And expensive. This does not have to be true! Quite the opposite, it can be cheap and powerful.

To start we need to create a generic architecture in your NLP projects. Therefore, you will understand the difference between batch and stream processing. This will allow you to properly select services for your data pipeline. Later we will split the data preparation process into steps and dive into each of them. Last, but not least we will calculate the costs of given architecture and propose some improvements for future success.

After reading, create a data pipeline using AWS services. There is nothing better than getting your hands dirty. Use free tier to avoid costs.

NLP

Our data pipeline will be the heart of the NLP project, so let’s see what NLP is. NLP stands for Natural Language Processing. It is responsible for understanding the information in the text or generate a new one. Sounds like talking with Alexa or Siri? You are right. Voice assistance has a Machine Learning model responsible for extracting information from questions and generating output text as the answer. Developing such models can be an output of an NLP project.

NLP projects heavily use Machine Learning to get new insights from the textual data. One of the key points is that most models work best for English. On the other hand, most companies have multi-language texts. The simplest approach to solve this challenge is translation. And we will do it in our pipeline.

Project

Now we know what NLP is and what kind of data it uses. Before we jump into the world of cloud architectures, let’s think about a problem that our project can solve. Let’s imagine that we work in an e-commerce company. It is collecting satisfactory opinions from clients after buying a product. The company is present in many markets so comments can be written in many languages. The data volume is not large, because not everyone drops a comment after buying. During daily meetings, we want to present the dashboard with the sentiment and key topics mentioned. We have high hopes, a tinny budget, and AWS cloud behind us — what can stop us?!

Data pipeline

We are building the data pipeline to present our data on our dashboard. In the pipeline, we will support automated data acquisition, transformation, enrichment and storage. Data sources can deliver data in form of batches and streams.

Source — Batch or Stream

The type of data source will tell us if we are working with batch or streaming data. But first, we need to know what’s the difference. Let’s use mining sites as a way to explain. You use a big truck to move coal to the storage area every hour — you do batch processing. You use a transmission belt that will constantly move coal to the storage area — you do stream processing.

Data acquisition and storage

All the comments are stored in the AWS S3 bucket for analytics purposes by survey microservice. The data is extracted from the database daily. This indicates that we will be working with batches of data. In general batch processing is cheaper than stream processing. If you do not want to incur additional costs keep processing and storage in the same region. Otherwise, you will have to pay a transfer fee.

One thing to remember when we are doing a joint venture — ML project with data scientists. They will not be able to do any work without the data. As soon as the data source is identified, extract a sample and provide it to them. A data engineer thinking step ahead is a valuable asset for the team.

Data for ML purposes need to be stored cheaply and allow convenient access from the Jupyter notebook. AWS S3 is great for this. Considerating the file format — JSON is not optimal to store analytical data. Parquet is much better because it is a binary, column-oriented format. It takes less storage space and performs great with operations like aggregations or statistics. Pandas — one of data scientists favourite libraries have support for using files directly from the S3 bucket.

Data transformation

Before you start any transformations Exploratory Data Analysis (EDA) is needed. In short, you need to understand the data. The main finding is that our texts are in multiple languages. Looking for the most frequently used words will not be effective. Let’s translate the text into English, which will allow us to do EDA and use many pre-trained models. Writing your own translation Machine Learning model would take too much time and money. Let’s use the ML service provided by AWS — AWS Translate. Text translation of historical data is one of the biggest upfront costs in an NLP project. This is done only once. AWS Translate is a good translator, but you need to be prepared that it will not be able to translate a text. Create a process to solve this rare occurrence. Manual intervention is also a way to do it, your orchestration service should support this.

Data scientists use the translated data to experiment. Data from the data pipeline are used for downstream processing like ML models inference. Data Scientists are interested in translated data as well as raw data samples. Keep that in mind and give them access ahead of time.

Models hosting and inference

Data scientists create new models using translated data. The model saved into binary format is called an artefact. We will use them during the inference process. The inference is a process that models are solving problems they did not see during training — the real work!

Our models are lightweight and require a small number of resources. The cheapest way to host them is using AWS Lambda functions. We have already discussed how to deploy AWS Lambda efficiently:

4 Tools for Easy AWS Lambda Deployment

Dive into the world of AWS Lambda deployment tools and select the perfect one for your project.

medium.com

We will have a single spaCy model deployed to extract sentiment, part-of-speech (POS) and named entity recognition (NER) with custom entities. The configured model will be wrapped in Docker Image — ready to be deployed as AWS Lambda.

If your model is big or requires GPU for inference, check out AWS Batch for inference. It works with spot instances, which makes your inference much cheaper.

Data access layer

The data will be used in a report written in one of the visualization tools. The frequency of data refresh is once per day. This process will be asynchronous so the time needed to get the result can be in seconds, not milliseconds. In AWS, the SQL queries are used to access the data stored in the AWS S3 bucket using the AWS Athena. It is a Presto based tool allowing you to write complex queries and execute them directly on data stored in files. We pay only for “scanned” data for each query. It’s a good choice for our use case. If you want to optimize your cost more look at the AWS blog for more details.

If you need quick access to data using complex SQL queries, a data warehouse solution is what you should look for. Both AWS Redshift and Snowflake are valid options. Be aware of data warehouse cost, it’s a tradeoff between access time and cost.

Process orchestration

Now let’s put it all together in a form of a scheduled process. There are many choices on which orchestration framework to use. Some of the most frequently used are Airflow, Argo, StepFunctions and Luigi. To keep a budget around $5 we will have to drop all self-managed solutions. What’s left from the list is AWS Managed Airflow and AWS StepFunctions. For the second one, we pay only for the executions without any constant cost. Additionally, the integration with AWS services is swift. We will go with the AWS StepFunctions.

Cost calculation

Let’s explore the costs of our data pipeline. We aim to keep the cost of monthly data pipeline usage under $5.

Our assumptions for cost calculations were provided by data scientists while performing EDA. We expect to support 50 comments per day 150 characters on average each.

AWS services costs with AWS Translate as biggest ($3.38), sum of all services $3.55 — Calculation of costs for infrastructure — 50 comments per day with 150 chars

We assumed that 1GB is used to keep historical storage — it’s a lot more than required. Keeping old records can be further optimized using S3 tiering options. For more hints on storage cost-effective solutions by AWS.

AWS Translate is a significant part of infrastructure cost. The bill will scale proportionally to the number of requests and length of comments. For PoC’s purpose, we are within the budget and got it done quickly. We leveraged the translation service provided by AWS. This allows us to quickly deliver a working solution.

Now it is time to improve our cost-effectiveness. We can do this thanks to loosely coupled architecture design. Let’s modify “Translation Lambda” to use a pre-trained spaCy model and save results. This allows us to spare quite a bit of money from the infrastructure cost. On the other hand, it increases the solution’s development cost. Additionally, AWS Translate provides a bit better quality of translation and supports more languages. There is always a trade-off.

AWS services costs after using Lambda for translation, sum of all services $4.56 — Calculation of costs for infrastructure — 1500 comments per day with 150 chars

How many more comments can we support within a budget? The introduced change allows us to handle 1500 comments per day within a $5 budget. That’s the 3000% increase!

Summary

The data pipeline can be further improved by adding process monitoring and centralized logging. New steps can be introduced easily by attaching them in AWS StepFunctions. The significant cost that is not accounted for is the data engineer’s working time. Automation will minimize the time we spend maintaining the solutions in the future. However, the upfront cost is required.

We walked through the whole process of the data pipeline for NLP project creation. We helped Data Scientists to put their hands on the data quickly. We understand how all parts of the data pipeline fit together. In the end, we estimated the price of the solution and improve system performance within budget. Great job!

Now roll up your sleeves and get to work!

For more MLOps Hands-on guides, tutorials and code examples, follow me on Medium and contact me via social media.