Apache Beam: a python example

Bruno Ripa
6 min readJan 31, 2018

Nowadays being able to handle huge amount of data can be an interesting skill: analytics, user profiling, statistics, virtually any business that needs to extrapolate information from whatever data is in a way or another using some big data tools or platforms.

One of the most interesting tool is Apache Beam, a framework that gives us the instruments to generate procedures to transform, process, aggregate and manipulate data for our needs.

Let’s try and see how we can use in a very simple scenario.

The context

Imagine we have a database with records containing information about users visiting a website, each record containing:

  • country of the visiting user
  • duration of the visit
  • user name

We want to create some reports containing:

  1. for each country, the number of users visiting the website
  2. for each country, the average visit time

We will use Apache Beam, a Google SDK (previously called Dataflow) representing a programming model aimed to simplify the mechanism of large-scale data processing. It’s been donated to the Apache Foundation, and called beam because it’s able to process data in whatever form you need: batches and streams

--

--

Bruno Ripa

Independent consultant | GDG Cloud London Organizer | GDGAcademy Mentor | Polyglot programmer