Introducing Gobblin Extended

Picking ingestion tools that is stable and meet your requirement is not an easy journey. At Airyrooms we are using kafka for our internal tracking and so facing that so called problem. There are several tools, but the standouts are: kafka-connect and linkedin’s gobblin. Both tools are awesome but it doesn’t match any of our needs.

Our needs is to ingest the data from kafka in json bytes to the amazon S3. We chose amazon S3 over HDFS because it’s scalable, durable, and most importantly managed service by AWS itself, which is significantly pressing our cost to maintain additional data storage tools. We need to store this data in timely manner, means the file could be grouped by a time field to the AWS S3 buckets.

We are using confluent platform for our message queuing at airyrooms. So we consider kafka-connect as our ingestion tools to S3. Unfortunately, however it seems the open source tools that available on github doesn’t really go well with our needs. Although there are some forked tools to ingest to S3, which forked from kafka-connect-hdfs (maintained by the confluent team), the other tools doesn’t seem to be maintained well. So then we decided to look for gobblin. At current state, Gobblin has more flexibility that we need, such as defining the time format, store the current job and task states, etc. It has a good documentation as well, which explained clearly about gobblin’s internal mechanism. But yet it still so tight with HDFS storage.

So then we decided to extend gobblin capabilities, since it has clearer documentation that helps a lot for understanding how it works. We have successfully implement our own extended capabilities:

  1. Config storage on aws dynamodb.
  2. State storage on Mysql database.

Our works could be found on here. There are many rooms for improvement at this time, contributions are welcome, even only by filling issues on this extended version could be a great help!