Written by Vaibhav Puranik on August 24, 2017
This is the first post in the series of three posts on Kafka Connect Implementation at GumGum. In this post, I have going to explain why we chose to implement Kafka Connect and what impact it had on GumGum’s data processing architecture.
GumGum produces approximately 70 TB of new clickstream data every day. The data has been growing very rapidly. The following graph can show you our data growth since the beginning of this year.
At GumGum we have two data pipelines which process different kinds of clickstream data. For the sake of simplicity, let’s call them pipeline A and pipeline B. In pipeline A the data volume is comparatively smaller — say about 10 TB per day. All of the remaining data is processed through pipeline B.
Even though pipeline A processes less amount of data, it’s very important to GumGum’s business and all monetary transactions are processed based on the data in this pipeline. And hence losing data is not an option. All the billing processing is done on batch data only. That is why we maintain two sources of batch data. If one of the source suffers loss of data, we will always have a backup. This strategy allows our engineers to have a sound sleep at night and allows us to perform maintenance on our pipelines without fearing data loss. First source is through Fluentd. Ad Server which generate this data sends data to Fluentd cluster in real time. Fluentd then writes this data into files and uploads it to Amazon S3.We call this the primary source of batch data. Ad Server in parallel also write this data on local disk in log files and these log files are uploaded to Amazon S3 every 60 minutes. We call this the secondary source of data. We also have a tertiary source of data — and that’s the data through Kafka — the real time data. Real time data is only used for real time reports. This data gets replaced by primary batch data source every 24 hours to ‘correct’ the data. In case primary source has a problem, this data gets replaced by secondary source of data — through file upload.
At GumGum we have long debated whether we really need three sources of data. Our experience shows that real time data is very seldom inaccurate and can definitely serve as a backup data source or secondary data source if not primary. The ideal situation is to maintain only two data sources — the primary (using fluentd) and real time through Kafka. It would be the best if we can somehow get data in Kafka to S3 allowing the batch processing system to take this data when the primary data is not available. In that case we wouldn’t need our file upload — the secondary source. Tertiary source can become the secondary source. Furthermore, uploading files to S3 periodically from local disk was problematic because of autoscaling. Autoscaling shuts down hundreds of server every day and in spite of special handling to upload this data before the server is completely shut down, we always kept losing data. Because of this data our Ad Servers became stateful. Hence our engineers were eager to get rid of the file upload and make the servers stateless.
In the past we have tried to get rid of file upload by implementing Camus. People in the Kafka community are aware of Camus from LinkedIn. Camus allows you to take data in Kafka and store it in HDFS or S3. We found that the software wasn’t really reliable. It did not reliably save all the data in Kafka to S3. Even LinkedIn seems to have abandoned Camus now. We also contemplated moving to Amazon’s Kinesis since Kinesis can directly upload the data to S3 without us doing anything. But there were some timezone related issues involved in moving to Kinesis. It would have been a bigger project too since we were using Kafka for 5 years and embracing completely new technology would be difficult. That’s why when we heard about Kafka Connect, we decided to give it a try. If it worked as advertised then it would be the least cost, least disruptive, obvious choice for us. Here is how pipeline A architecture would look after Kafka Connect Implementation:
The second pipeline that processes most of our data was bit different. As you can see in the architecture diagram given ahead, we only maintained two sources of data instead of three. This data is not used for any kind of billing. The data is only used for internal analysis and few hours of hole wouldn’t make a big difference.
So that data loss requirement here is not stringent and in spite of that we were maintaining two data sources. What would be ideal here, is to get rid of the fluentd source of data altogether and implement Kafka Connect to store the data in S3. This data then can be used for batch processing. In case something happens and we lose data, no big. We made a choice that it would be okay. Here is how the architecture of pipeline B would look after Kafka Connect Implementation:
These diagrams (Thanks to Karim Lamouri) make it amply clear that Kafka Connect would make a big impact on architecture and save us a lot of money and effort in long run. In pipeline B we would get rid of one fluentd cluster and replace it by Kafka Connect cluster. Our Ad servers will have to send data to only one sink and hence saving cross AZ bandwidth cost. Now that the motivation is clear, the challenge was to implement it. We would find out later that Kafka Connect wouldn’t work as shipped because of some of our architectural choices. I will narrate the story of our implementation and the issues faced in the next post.