DATA PROCESSING IN “REAL TIME” WITH APACHE SPARK: PART 1

Published in

Sinch Blog

8 min readOct 27, 2020

In addition to delving into the new streams API, Structured Streaming, Eiti Kimura, Development Coordinator here at Wavy, gives more details of the architecture and shows the implementation of our solution to process high volumes of data at Wavy. written by — Eiti Kimura, IT Manager at Wavy

This post is part of a series of 3 articles on the Apache Spark use case for “real time” data processing. This content was also replicated on InfoQ.

Introduction

This year I had the opportunity to speak on the theme of this article in the 10-year commemorative edition of QCon in São Paulo (see my review of the event). I believe the topic is relevant enough to generate a more detailed post explaining in detail the solution.

When we talk about big data we think of big data, variety and the speed of processing that data. The increase in the volume of data has been growing exponentially for decades due to the popularization of the Internet and the increasing use of mobile devices.

In the beginning, the challenge was simply to be able to store large volumes of data. Then, with large masses of data in hand, the challenge became: how am I going to process this? Now the challenges are more, let’s say, challenging: how to extract relevant information from the data and how we managed to do it in real time.

The focus of this series of articles is on data processing as it arrives on the company’s systems and platforms, that is, the so-called real time. But what is real time?

Note that I put the term real time in quotes in the title of this article, because the interpretation is relative and conceptual. I’ll give you an example: for an IoT system to process information in a few tens of seconds it will certainly be too slow, probably real time in IoT is in the order of a few milliseconds. However, for web applications and commercial platforms, a few seconds or even minutes would be acceptable times as being real time, that is, the term real time is very relative. So it is worth contextualizing it here for the case that will be exposed.

One of Wavy’s lines of business is the sale of content through applications and services to telephone operators, in which the user subscribes to these services and is charged on a recurring basis.

Before the solution, which I will address later in this series, these transaction data with operators took between 30–90 mins to be consolidated and displayed on our data visualization platforms. This is a long time and it was impacting the time taken to take actions in our business. Our daily volume was over 110 million transactions with operators spread throughout the day.

The need: process millions of billing transactions in real time. At that point Apache Spark comes to the rescue.

Apache Spark

Apache Spark is a high-performance distributed processing engine. It was developed in 2009 by AMPLab at the University of California at Berkeley and in 2010 its code was opened as an Apache Foundation project.

The tool was developed to be highly scalable and to process large volumes of data. Apache Spark was developed in Scala language, however it exposes its API that can be accessed using other languages. In practice we can write programs in Scala, Java, Python and even R (check availability of operations), the code goes through an execution plan that accesses the common APIs of the framework, this in a transparent way for the developer.

The architecture of the Spark cluster can be seen in Figure 1. The program written by the user is known as the Driver Program, this program is submitted to the Cluster Manager that generates the execution plan and delegates the processing to the other processing nodes known as Worker Nodes . In order to run programs with Apache Spark it is imperative that all nodes are able to exchange information by writing files to distributed file systems (DFS), such as HDFS or S3, for example.

Figure1. Architecture of a program running on the Apache Spark cluster.

For more details on how Spark works in a cluster, see the documentation.

If you are interested in the theme, download the framework, start the spark-shell and take a tour of the tool locally, without worrying about the complexity of setting up the environment.

The Spark shell provides an easy way to learn how to use the API and exploratory data analysis tools. To do this, simply run the command in the Apache Spark installation directory and start:

./bin/spark-shell

Spark context available as ‘sc’ (master = local[*], app id = 345).

Spark session available as ‘spark’.

Welcome to

¬./bin/spark-shellSpark context available as 'sc' (master = local[*], app id = 345).Spark session available as 'spark'.Welcome to____              __/ __/__  ___ _____/ /___\ \/ _ \/ _ `/ __/  '_//___/ .__/\_,_/_/ /_/\_\   version 2.4.0/_/Using Scala version 2.11.12scala> _

In this example I use as a database a file with fictitious information generated about people, the file can be found in this repository

Let’s load the file in CSV format, setting the file separator, which in this case is a semicolon:

scala> val df = spark.read.option(“delimiter”, “;”).option(“header”, true).csv(“file:/YOUR_PATH/user-record.1.csv”)

Use a instrução show para exibir o conteúdo que foi lido do arquivo:

df.show ()

At that moment the data read and that are on the Dataframe are organized as follows:

+ — — — — — — + — — — — — -+ — — — — -+ — — — — — — — + — -+ — — — -+ — — — — — — — +
| id| country| city| phone|age|carrier|marital_status|
+ — — — — — — + — — — — — -+ — — — — -+ — — — — — — — + — -+ — — — -+ — — — — — — — +
| Quinn| Syria| Holywell|55(89)76207443| 54| VIVO| Single|
| Emi E. Leon| Pitcairn| Clermont|55(72)84497623| 52| VIVO| Married|
| Adele I.| Greenland| Darwin|55(09)68020238| 21| NEXTEL| Divorced|
| Preston R.| Uruguay| Borås|55(01)60250601| 28| OI| Married|
|William Buck| Montenegro| Casalvi|55(95)36737883| 56| OI| Married|
+ — — — — — — + — — — — — -+ — — — — -+ — — — — — — — + — -+ — — — -+ — — — — — — — +
scala>

In this phase, it is possible to check the schema that Spark inferred from reading the data by invoking the printSchema () method.

df.printSchema()
root
| — id: string (nullable = true)
| — country: string (nullable = true)
| — city: string (nullable = true)
| — phone: string (nullable = true)
| — age: string (nullable = true)
| — carrier: string (nullable = true)
| — marital_status: string (nullable = true)
scala>

To correct a data type, for example, the age field is like a string type when in reality it should be an integer type, we can execute the following instructions:

val dframe = df.withColumn(“age”, df(“age”).cast(org.apache.spark.sql.types.IntegerType))

When checking the schema again, we can see that the field has now passed to the correct type:

scala> dframe.printSchema
root
| — id: string (nullable = true)
| — country: string (nullable = true)
| — city: string (nullable = true)
| — phone: string (nullable = true)
| — age: integer (nullable = true)
| — carrier: string (nullable = true)
| — marital_status: string (nullable = true)

Let’s do a simple aggregation using the carrier field, inside the database to see how the distribution of users is by operator:

df.groupBy(“carrier”).count().show()
+ — — — -+ — — -+
|carrier|count|
+ — — — -+ — — -+
| VIVO| 23|
| OI| 17|
| NEXTEL| 15|
| TIM| 19|
| CLARO| 26|
+ — — — -+ — — -+
scala>

We can check the average age of people in the base and group by marital status, like this:

dframe.groupBy(“marital_status”).avg(“age”).show()

generating this result:

+ — — — — — — — + — — — — — — — — — +
|marital_status| avg(age)|
+ — — — — — — — + — — — — — — — — — +
| Married| 37.54545454545455|
| Divorced| 38.43478260869565|
| Single|41.666666666666664|
+ — — — — — — — + — — — — — — — — — +
scala>

Instead of obtaining this information by making calls to the Dataframe API, we can use SQL language to perform operations on the data, for that, we need to register the Dataframe as a temporary view:

dframe.createTempView(“vw_people”)

With the view defined we can now simply execute SQL commands on it, for example:

spark.sql(“SELECT COUNT(1) FROM vw_people”).show()

The result of a simple record count:

+ — — — — +
|count(1)|
+ — — — — +
| 100|
+ — — — — +
scala>

Or even more complex queries, such as this grouping the number of users over 30 years old by operator:

spark.sql(“SELECT carrier, COUNT(1) FROM vw_people WHERE age &amp;amp;amp;amp;gt; 30 GROUP BY carrier”).show()

See the result below:

+ — — — -+ — — — — +
|carrier|count(1)|
+ — — — -+ — — — — +
| VIVO| 13|
| OI| 8|
| NEXTEL| 11|
| TIM| 13|
| CLARO| 20|
+ — — — -+ — — — — +

With Spark we have the option to either use the SQL API by invoking methods through Dataframes or simply writing instructions directly in SQL language. In practice, regardless of how the instructions are written, the execution plan of the tasks by Worker Nodes will be the same.

I hope this little tour of the tool has piqued your interest even more. The idea here was not to give a complete picture of what the tool can do, but just “scratch the surface” showing some things that can be done.

Recalling that Apache Spark was designed to scale and process large volumes of information.

Conclusion

This article gave an introduction to the lecture given at QCon São Paulo 2019, showing the problem we want to solve: real-time data processing. We also show a brief introduction to Apache Spark, the tool used in this journey and some examples and tips to get started with the tool.

Follow our series of articles here on the blog! In the next we will do an in-depth study about Structured Streaming, the Apache Spark streams processing tool, this will be the next step for “real-time” processing.

About the author

Eiti Kimura is a high performance IT Coordinator and Distributed Systems Architect at Wavy. Eiti has 17 years of experience in software development. He is an enthusiast of open-source technologies, MVP of Apache Cassandra since 2014 and has extensive experience with back-end systems, in particular charging and messaging platforms for the main telephone operators in Brazil.

DATA PROCESSING IN “REAL TIME” WITH APACHE SPARK: PART 1

Introduction

Apache Spark

Conclusion

About the author

Written by Sinch