Nowadays being able to handle huge amount of data can be an interesting skill: analytics, user profiling, statistics, virtually any business that needs to extrapolate information from whatever data is in a way or another using some big data tools or platforms.
One of the most interesting tool is Apache Beam, a framework that gives us the instruments to generate procedures to transform, process, aggregate and manipulate data for our needs.
Let’s try and see how we can use in a very simple scenario.
The context
Imagine we have a database with records containing information about users visiting a website, each record containing:
- country of the visiting user
- duration of the visit
- user name
We want to create some reports containing:
- for each country, the number of users visiting the website
- for each country, the average visit time
We will use Apache Beam, a Google SDK (previously called Dataflow) representing a programming model aimed to simplify the mechanism of large-scale data processing. It’s been donated to the Apache Foundation, and called beam because it’s able to process data in whatever form you need: batches and streams…