Data Pipelines

Store →Process →Consume

OMAR ELGABRY
May 1 · 8 min read
A Pipeline (Store →Process →Consume)

Most of the data processing applications look like a pipeline. A pipeline is a flow, a pipe-and-filter architecture, where data gets stored, processed and finally consumed. These are the three main stages of a pipeline.

One of the key designs here is that each stage should be independent of the other stages. For example, the place where the data gets stored is independent of who is processing it. Similarly, whoever is processing the data doesn’t need to know who is going to consume it.

Another characteristic is that at any given point we should be able to replace an existing tool or a service without having to worry about changing other parts in the pipeline. For example, one can decide to switch from one visualization tool to another to reduce costs.

Finally, what each stage actually means and does will be discussed below, but the trade-offs and which tool to use, won’t. It is important to choose the right tool based on various aspects such as data format, latency, throughput, reliability, availability, security, no/low administration, and cost.

A Pipeline (Store →Process →Consume)

Store

The first stage where the data is being collected and stored. This essentially can be seen as a two-steps stage. Data can be transactional (customers, orders, etc), a file, or streaming logs and sensor data.

And while there are many tools one can use to collect and store the data, the decision is based on the project requirements and the data type.

Transactional data can be captured by backend APIs, while sensor data generated from sensor devices is transmitted off to, say, AWS IoT.

On the other hand, AWS S3, for example, is ideal for storing media and log files, while NoSQL database for storing transactional data.

This stage is solely responsible for storing the raw data. There is almost no transformation or parsing. It is just saving the data, which can then be parsed, cleaned, and analyzed.

Input — Output

  • Input: The data in any format.
  • Output: The raw data has been saved.

Workflow

  1. Collect. A backend API accepts data coming from users through the UI. Or, perhaps, application logs that are sent off to AWS CloudWatch.
  2. Store the incoming data. As shown above, storage can vary from in-memory, NoSQL database, to streaming storage like Kafka.
  3. The last step is a transition step from this stage to the next one.

One common way of doing so is to push an event into a queue saying “Hey, a new record/file has been stored”, and whoever is interested will capture that event. This step is important to kick off the next stage: Process. We’ll dive a bit into this step.


Transition step

Generally, if you have one component that needs to kick off another, what would you do?

Transition step (notice the arrows).

A method call. If the application code is simple, all in one place, method calls are then possible.

class ComponentA {
public void execute() {
ComponentB componentB = new ComponentB();
componentB.execute();
}
}

Request-Response. A common approach is when one service sends a request to another service, either asking for information or delegating a command to do something and waiting until it gets back a response. Each of these components can be an API, lambda function, etc.

Events. In event-driven systems, instead of components talking to each other directly, they emit events that something has happened. Other components can then listen to these events if needed and react accordingly.

Data storage. Instead of having a component talking to another, the result data by one component can be stored somewhere in a database or object storage. Other components are free to pull out this data and use it.

Queues. Using queues to distribute work among workers. A component pushes a task into the queue, and that queue is responsible for delegating the task to the corresponding worker to carry out the work.

This operation can be done synchronously or asynchronously, while either serializing these tasks or run them concurrently.

Lambda. Similar to request-response approach, but instead of sending it to a particular component, a request is being sent to a generic, single, lambda function that acts as a single entry point and decides where the request should be forwarded.

Of course, there might be an overlap. A publish/subscribe event system might use a queue to keep track of, and distribute, the emitted events.

Moreover, one can use a hybrid approach. Having a lambda function that sits in front of a queue so that any component needs to push a task, just ping that lambda function with the task, instead of worrying about queue itself.

Or, a lambda function that listens to, or periodically pulls out, logs being inserted into an event log (as opposed to using a queue) and notifies the subsequent stage.

Having said that, we’ll now move on to the next stage: Process.


Process

Where the “Store” stage is seen as a two steps stage, the process stage can also be seen as a pipe-and-filter. In other words, a step-by-step, a chain of small processors, where data is converted or parsed, cleaned and analyzed. Each step feeds the next with the result data to carry on the journey.

Although, that’s not always the case. We might have multiple processors, each is independent, analyzing data for different purposes using the initial given data.

While the flow of the data seems like sequential, from Store → Process, however, the processed data is then stored back to be consumed, say, by any visualization tools or web UI.

Which tools can be used?

  • Batch processing on a large amount of data?. “Amazon EMR provides a managed Hadoop framework including Spark, Pig, and Hive”. Examples like daily insights about the incoming traffic, daily/monthly reports, etc.
  • Analyzing a stream of real-time data? Spark Streaming, Apache Storm, Flink, Kensis Analytics, or event Lambda functions. Immediate recommendations, alert messages (upon failures, high spikes), are all common examples.
  • Adding a bit of AI on a batch or real-time data? Tensorflow, Torch, Amazon Lex (voice and text recognition), Spark ML, and the list goes on.

Again. The list of these tools is endless. Choosing the one that fits your current needs is crucial.

Input — Output

  • Input: This stage is kicked-off by one of the “Transition step”s discussed.
  • Output: The data is saved after being analyzed.

Workflow

  1. The data gets pulled out from storage like AWS S3 or sent directly via an HTTP request.
  2. The data will then be parsed, cleaned, and analyzed. These operations can be split each on its own server or a lambda function.
  3. When done, it stores the processed, analyzed data in the database or object storage to be visualized and queried by other applications.

Additionally, we can do things sending a notification to a lambda function, which in turn can decide where to send the data. Or, a lambda function can regularly scan the table, and forward a message to somewhere, say, Slack.

That database, object storage, or lambda function, act as a glue, a transition step between the current stage “Process” and the next one.

Consume

The final stage where data can be queried, visualized or explored using Notebooks for data science gigs. Not to mention, alerts or messages can be sent to emails, Slack, Jira, GitHub, etc.

Consuming the same data by multiple consumers or tools can be useful for different purposes. Sending monthly reports by email, using visualization tools to get real-time insights, or querying the data by a web application.

Input — Output

  • Input: Again. It is kicked off by one of the “Transition steps” discussed.
  • Output: The data is being visualized, queried, or an alert has been sent.

Workflow

  1. The data gets pulled out from storage like AWS S3 or a database. Or, sent directly, say, from a lambda function.
  2. Now data is ready to be consumed by whatever the service or tool. A common consumer is our web UI reaching out to the result stored in the database. Or, perhaps, a visualization tool that links to the database.

Case Studies

Because these three stages are absolutely generic, and not specific to one case, they become more clear when real examples are demonstrated.

Alerts / Log analysis

Sending alerts upon detecting failures, frauds, exceeding a certain CPU usage threshold, or whatever the reason.

  1. The data is coming from our live web applications, users are committing transactions, ordering products, modifying their shopping list, etc.
  2. This stream of data is fed to AWS Kinesis, where the data stored for a certain time. It also sends the data to a lambda function where all the processing kicks-off.
  3. The lambda function responsible for parsing, classifying and performing normalization on these logs. It then passes the result data to the next lambda function.
  4. The next lambda function detects potential frauds. If detected, an alert is committed to a DynamoDB table. We can also use AWS SQS(an alert queue).
  5. A final lambda function periodically scans the alerts DynamoDB table, send that alert to Slack, and to AWS S3 bucket for historical retention and searching via AWS Athena. The alerts in DynamoDB table can then be deleted once consumed.

File uploads

When working with institutions like banks or hospitals, they are likely to have the data in CSV format. Tapping into their internal system is not possible for their security reasons. So, we’ll end up extracting the information from these CSV files and display it on our web UI application.

  1. The files are being uploaded to AWS S3. These files are uploaded directly to S3 or through a web application that provides a file upload utility.
  2. A lambda function gets triggered when a file is created in AWS S3. This function will extract the data and save it back to AWS RDS database. If this operation is time and CPU consuming, it is better to use a dedicated server instance with enough hardware resources. Or, perhaps, we can use Spark for big data workloads.
  3. A summary can be queried and displayed on the web application. In addition, visualization tools can hook into AWS RDS database to create interactive charts.

Sensor data

Sensor data coming from devices that monitor our health, traffic or weather conditions. These devices emit messages, each is usually small, with a timestamp, but they come in with high frequency.

Not only do we need to store this time series data, but we also want to run some ML/DL algorithms, and visualize as they are rolling into our application.

  1. Sensor data are flooding through Apache Kafka from the sensor devices.
  2. It then sends the data to Spark Streaming which detects any unusual event across a 10-minute interval. The raw data is also stored in a time-series database like InfluxDB.
  3. A visualization tool like Grafana is easily integrated with InfluxDB. One can also query the data as if it is a relational database.


Thank you for reading!

Feel free to reach out on LinkedIn or Medium.

OmarElGabry's Blog

This is my freedom area. Don't underestimate it. The devil is in the detail.

OMAR ELGABRY

Written by

Software Engineer. Writer. In love with 🇹🇷. When I die, turn my blog into a story.

OmarElGabry's Blog

This is my freedom area. Don't underestimate it. The devil is in the detail.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade