Our unlimited, constant access to the internet generates vast amounts of data from all over the world. We use social media, watch movies, read books and shop online for entertainment, but we don’t see the epic data streams and infrastructure that power and manage all that activity. It all happens in milliseconds, in data centres far away from where we live.
But how does it work? How does Google show me relevant adverts alongside my search results? How does Netflix recommend movies that match my taste? What makes these processes so fast and so accurate?
In this article, I will give an overview of how data processing works. It is not an advanced tutorial on how to build a platform, but you will find explanations of common buzzwords and some practical examples of data processing from the real world. Think of it as a primer based on the business and engineering experience that I have gained during a few years of working at Azimo.
Azimo as a data-driven company
Azimo is fintech company. Millions of financial transactions are made using our platform, which creates a lot of data. In the beginning, we couldn’t make data-driven decisions because we weren’t tracking and storing that data correctly. Thanks to better data processing tools and practices, we can now improve our product based on the data insights that we have.
Over time we’ve built many tools to help us improve our data processing. We built an automated financial reporting system called Ebeneezer, which is responsible for collecting financial data. A tool called Bravo helps us to distribute user data to other tools used by internal teams like customer service, operations and marketing.
These functions of data processing are incredibly important to our business. Making real-time customer data available to customer service, for instance, allows a customer service agent to diagnose issues with a user’s transfer or account and resolve them quickly.
What is data?
We talk a lot about data, but what is it? Why do people spend so much money creating and maintaining systems to collect and manage it?
Data in this context can be any kind of information about a person. Gender, eye colour, hobbies, activities, job etc. Everything you do on the internet is theoretically useful to a company somewhere. If you search for holiday destinations, for instance, then you will probably start seeing adverts for flights and hotels.
Data isn’t just generated by a human being clicking things online. It can be created by anything connected to the internet, from your smartphone to your car GPS, and even your fridge.
All this data can be used by companies to make better business decisions. Data is processed and analysed repeatedly to draw various insights. A marketing team, for instance, will use customer data to manage their advertising budget or to find potential customers.
Before we can start using data effectively, however, we need to know how data collection works and where to start.
Data processing is the act of taking data from one place, analysing it and/or moving it to another place where it can be useful.
A simple example of data processing is when car manufacturers collect sales information to find out which models and colours are the most popular. This data can be used to make better production decisions.
But data processing is not possible without a sufficiently large data set. In theory, a car manufacturer could do the whole thing manually by surveying every car dealership in every city. In practice, that’s impossible, which is why we created software to do it for us.
Let’s look at a more technical example:
The image above shows some data (the colourful blocks) flowing into an application. The yellow mesh represents a filter, which only allows relevant data through.
The yellow mesh is a part of the software which implements business logic using code written in Java, Scala or another programming language. In this case, the business logic says that all colours except yellow should be filtered out.
Example in Java 8:
In this example all colours which are not yellow are filtered out. When we iterate through to the result and print it we get something like this:
A more complex example can be seen below:
The code above calculates total net production from a moulding machine. You can find whole code for this project here.
Data processing is not the preserve of advanced programmers at top tech companies using the most complex technologies. It can be explained and presented using simple Java 8 streams.
That said, you should be aware that if you intend to process petabytes of data, creating streams in Java and processing them on a single machine is not the best idea. For larger projects, you need a more advanced solution:
Spring Cloud Stream is a framework for building highly scalable event-driven microservices connected with shared messaging systems.
The framework provides a flexible programming model built on already-established Spring idioms and best practices, including support for persistent pub/sub semantics, consumer groups, and stateful partitions.
Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. Beam Pipelines are defined using one of the provided SDKs and executed in one of the Beam’s supported runners (distributed processing back-ends) including Apache Apex, Apache Flink, Apache Gearpump (incubating), Apache Samza, Apache Spark, and Google Cloud Dataflow
…and many, many other technologies dedicated to data processing. Frameworks are great but you will also need runners — those applications/streams/pipelines need somewhere to run, after all. You can use a variety of runners with Apache Beam: Apache Flink, Google Cloud Dataflow, Apache Spark etc. These will give you an environment to run your code. Apache Beam’s documentation explains this well:
When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform. The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs, and provide:
1) A fully managed service
2) Autoscaling of the number of workers throughout the lifetime of the job
3) Dynamic work rebalancing
Spring Cloud Stream uses the Spring Cloud Data Flow as the orchestrator for running jobs. It provides mechanisms to create complex topologies for streaming and batch data pipelines, GUI to create pipelines manually, a REST API to manage resources inside and many other features helpful in data processing. Spring Cloud Data Flow does similar job as e.g Google Cloud Dataflow but is less advanced and useful in a bit different cases.
We’ve talked about data and data processing but there is another crucial piece of the data puzzle: data streams. According to Wikipedia:
… a data stream is a sequence of digitally encoded coherent signals (packets of data or data packets) used to transmit or receive information that is in the process of being transmitted. A data stream is a set of extracted information from a data provider. It contains raw data that was gathered out of users’ browser behaviour from websites, where a dedicated pixel is placed. Data streams are useful for data scientists who work with Big Data and AI algorithms.
The first thing that comes to most people’s minds when they hear the word “stream” is just a stream…yes, with water. I Googled a picture for this word and this is what I found:
But after many years as a software engineer, the first thing that comes to my mind when I hear “stream” is this:
The word “stream” is used in many contexts in the programming world, but all are similar. Just like a real stream, a stream in software is something that takes data from one place (a source) to a destination while transforming it along the way, often generating energy on its journey that is used to power other things.
The picture above shows something called an “event stream” . Events are things like user actions, such as logging into the Azimo app or making a transfer. A group of these events in sequence is a stream. Some of these event streams have an end point (bounded), others do not (unbounded). Apache Flink’s documentation explains the difference:
Unbounded streams have a start but no defined end. They do not terminate and provide data as it is generated. Unbounded streams must be continuously processed, i.e., events must be promptly handled after they have been ingested. It is not possible to wait for all input data to arrive because the input is unbounded and will not be complete at any point in time. Processing unbounded data often requires that events are ingested in a specific order, such as the order in which events occurred, to be able to reason about result completeness.
Bounded streams have a defined start and end. Bounded streams can be processed by ingesting all data before performing any computations. Ordered ingestion is not required to process bounded streams because a bounded data set can always be sorted. Processing of bounded streams is also known as batch processing.
Bounded streams are mainly used in analysis where the results do not have to be immediate. An example would be daily financial reports, where the input data has an end result. Data is processed and the results are saved to a database. Jobs in this case are run periodically (batch processing).
Unbounded stream are more useful for processing data in real time, such as when collecting data about a user’s behaviour on our mobile apps. The incoming data is theoretically endless because we are not able to assume when users will stop using the app. Data is transformed and saved somewhere in storage (similar to a bounded stream) but an unbounded stream is working all the time. The stream is connected to a data source, such as Kafka.
I am often asked how to start programming data processing streams. I always reply that it’s worth starting from the very beginning. In this article, I tried to introduce the basic concepts using comparisons to everyday systems.
It seems to me that there is a kind of invisible barrier, where novice programmers think that BigData is very difficult and unattainable knowledge, which is not the case. In the next article, we’ll start looking at more advanced concepts.