Complete Stream Environment (Go + Kafka + Spark + Deltalake)

3 min readJun 23, 2020

Due to the increasing number of Microservices and IoT scenarios, we often come across the need to have a complete streaming environment to perform analytics.

The use cases are numerous: from mobile devices and self driving vehicles, to microservices logging and tracing. A simple way of imagining streams is a sequence of tiny messages, usually in order of kilobytes, with no determined end.

There is very good literature out there explaining what is streaming in deeper details and how Kafka and Spark work and help handling streams. My objective with this article is to show a good architectural design, using them both in what they are best suited for. Also, this is as an environment agnostic architecture as it can run either on the cloud or on-prem, and these building blocks can be easily substituted by PaaS offerings of your favorite cloud service.

The reference architecture is considering microservices, written in Golang and I am considering that this must be a very performatic service, writing directly into the Kafka stream. For the purpose of this article we are going to skip the the microservices containers orchestration, which could be added on top of this architecture with Kubernetes.

The Go program writes some dummy messages to a Kafka stream:

for {

topic := “my-topic”

partition := 0

conn, _ := kafka.DialLeader(context.Background(), “tcp”, “localhost:9092”, topic, partition)

conn.SetWriteDeadline(time.Now().Add(10 * time.Second))

conn.WriteMessages(

kafka.Message{Key: []byte(“Key1”), Value: []byte(“One!”)},

kafka.Message{Key: []byte(“Key2”), Value: []byte(“Two!”)},

kafka.Message{Key: []byte(“Key3”), Value: []byte(“three!”)},

kafka.Message{Key: []byte(“Key4”), Value: []byte(time.Now().String())},

)

conn.Close()

fmt.Println(“Infinite Loop”)

time.Sleep(time.Second*10)

}

Kafka is a high performance and fault tolerant platform, which will persist those streams in disk. In case it fails it is possible to continue the execution as the data will not be lost. The retention time can vary according to the configurations, the default is 7 days.

So Kafka, at this point, centralizes the streams coming from different sources. From that point we will probably want to do something “smart” with these streams. Let’s perform some analytics with it. Here is where Spark comes into the scene.

In the next code block, we retrieve the Kafka messages and process it to an initial table that can be later explored further. The transformation is done in PySpark and can add intelligence from simple calculations to Machine Learning.

df = spark \

.readStream \

.format(“kafka”) \

.option(“kafka.bootstrap.servers”, “127.0.0.1:9092”) \

.option(“subscribe”, “my-topic”) \

.load()

df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”)

df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”).writeStream.format(“console”).start()

As the last step, we persist the stream into a Delta Lake as follows.

The Delta Lake adds a transactional layer on top of our semi-structured data and It adds reliability and “time travel” capabilities to the data (which is a cool name for versioning).

stream = df.selectExpr(“CAST(key AS STRING)”, “CAST(value AS STRING)”).writeStream.format(“delta”).option(“checkpointLocation”, “/home/bruno/data-files/kafka_topic/checkpoint/”).option(“basePath”, “/home/bruno/data-files/kafka_topic/”).start(“/home/bruno/data-files/kafka_topic/”)

Once the data is in the Delta Lake, we can transform it accordingly and create value with it.

Please, notice that I have created it directly in the OS file system for this article. In a Real Life scenario we would use HDFS instead (or a cloud service). But for learning purposes we can use our own file system, without having to install HDFS and spark in a single development machine.

[Thanks to fellow programmer Lucas Araujo for review!

https://www.linkedin.com/in/lucaslra/ ]

About the Author

Bruno is a Brazilian programmer since 14 years old, worked in companies like Ericsson, Fujitsu and Credit Suisse. Now he works as a Solution Architect @ Itility, in the Netherlands. In the spare time plays videogames and tries not to be so annoying with his friends and family.

https://www.linkedin.com/in/bruno-marcondes-a60a424/

Complete Stream Environment (Go + Kafka + Spark + Deltalake)

Written by Bruno Cosso