GumGum Tech Blog
Published in

GumGum Tech Blog

Replacing Apache Druid with Snowflake Snowpipe


However, real-time reporting comes with one of the most difficult infrastructures to manage for the Data Engineering team. It is costly, needs a lot of maintenance, and needs more infrastructure resources to manage. Additionally, if more data is at the ingestion point, then more distributed clusters need to be built. It is even more difficult when your data is semi-structured. The semi-structured data needs more transformation on the streaming level before it is stored in the database to make it available for reporting. Whichever system you choose for real-time reporting will face these challenges, so it is important that the tool is flexible and scalable.

Many data warehouses support real-time reporting, but meeting reporting SLAs is just a small part of the full picture. Data accuracy, cost efficiency, scalability, and — most importantly for us — maintainability, all seemed to matter much more than the difference between 5 and 15-second latency measurements.

black background with hundreds of lines coming out of a central point
Photo by Hunter Harritt on Unsplash


But, managing this reporting system became difficult. As we started receiving more feature-rich reporting requests, we required joins to new datasets and ad hoc query requests — both difficult to accommodate in Druid. Also, as our data volume scaled, there was continuous pressure to keep the cluster up and running. Even with a Lambda architecture that replaced streaming data with offline batch data, we still faced several issues as described below. While the technical SLAs were still real-time — nothing about the feature development or maintenance felt fast.

Common Issues we saw on weekly basis:

  • Schema changes to the source and target need to be manually updated.
  • When a historical node is destroyed by AWS, all in-flight queries and sub-queries will fail. Brokers will continue to route to the destroyed node until Zookeeper picks up the change.
  • The middle manager gives an OOM error without handling the threads and cores. It causes a loss of real-time data.
  • Some query throws OutOfMemoryError in the JVM process in the historical nodes, requiring node restart.
  • Despite Druid being a massively parallel query system, a single query could end up being processed on hundreds of historical nodes; it completely lacks any fault tolerance on the query execution path.
  • Network failures between components of Druid.
  • When a broker sends sub-queries to historicals, it needs all of those sub-queries to complete successfully before it could return anything to the client. If any of the sub-queries fail, the whole query fails.
Meme with a toddler in front of a burning house with the words “YOUR DATA PIPELINES” and “IT WAS JUST A ‘TINY SCHEMA CHANGE’ THEY SAID…”

Druid is a fantastic tool — but the complex architecture with many moving pieces makes it a large task to manage or automate. We were looking for a fully managed solution that will give us near-zero maintenance as we rapidly scale our business and dataset volume.

Druid Architecture


Lambda Architecture

Snowpipe is a data ingestion service that loads real-time data into Snowflake tables. If you have new data in S3, it will run a copy statement when the data lands in the S3 bucket.

At GumGum, we have Spark streaming jobs running on Databricks that consume data from Kafka for different data structures. We can sink that output data to S3 in small-batch periods. Then Snowpipe will come into the picture to load that data into tables. That data will be ready for reporting from Snowflake.

Streaming Pipeline

Our engineering task was to start sinking the Spark Streaming output data to S3 (instead of sinking data to the Druid Tranquility node — a.k.a. middle manager). To push S3 data to Snowflake, we created a Table matching the output data, Snowpipe definition, Stages with the S3 bucket path, and File Formats.


dataFrame.withColumn("datetime", date_format(col("timestamp"), "yyyy-MM-dd-HH"))

Snowflake Streaming Data Ingestion

The Snowpipe Auto Ingest process flow

Now the Snowflake work starts. First, we create a TABLE that matches the data of S3 with the data type. We have also added several metadata columns (LOAD_TIME, FILE_NAME, and FILE_ROW_NUMBER) to be able to know the row details better.

create or replace TABLE <streaming_table> (
column_1 VARCHAR(8),
column_2 VARCHAR(256),
column_3 NUMBER(38,0),

After this, we will need to create a STAGE to add it to a Snowpipe to fetch the data.

create or replace STAGE <stage_name> url=’s3://<bucket_name>/<streaming_table_path>/’
storage integration=<>;

Before we create the Snowpipe we have to create FILE FORMAT, which will have the delimiter with the extension we are using while storing the file in S3 (i.e. “TAB”).

create or replace FILE FORMAT S3_TSV_FORMAT TYPE=csv field_delimiter=’\t’ NULL_IF=(‘null’) FILE_EXTENSION=’.gz’

Now it is time to create a PIPE:

create or replace pipe <streaming_snowpipe_name> auto_ingest=true as
COPY into <streaming_table>(
timestamp, column_1, column_2, column_3, load_time, file_name, file_row_number )
select $1, $2, $3, try_to_number($6), $7, try_to_number($8),
current_timestamp::timestamp_tz, metadata$filename, metadata$file_row_number
from @<stage_name>
file_format = S3_TSV_FORMAT
on_error= skip_file

By giving the file_format while creating the pipe, it will split the row by delimiter and we can select the columns while inserting the data. try_to_number() is converting the column value to a number while inserting the data. on_error=skip_file will skip the whole file if it finds the ERROR while inserting the data. The AUTO_INGEST=true the parameter specifies reading event notifications sent from an S3 bucket to an SQS queue when new data is ready to load.

To start the AUTO_INGEST we will need to give permission on the S3 bucket by adding the SQS queue ARN. Configure event notifications for your S3 bucket to notify Snowpipe when new data is available to load. The auto-ingest feature relies on SQS queues to deliver event notifications from S3 to Snowpipe. We can get the ARN when we create Snowpipe.

show pipes;
Creating the Event Notification for Snowpipe
Enter the ARN that we get while creating the Snowpipe

After configuring the Event notification and creating the Snowpipe, data will be auto-ingested from S3 to Snowflake Table. To monitor the pipe, the following commands can be used. It will return the execution state, pendingFileCount, lastIngestedTimestamp, and many more fields to know more about the pipeline.

select SYSTEM$PIPE_STATUS(‘<streaming_snowpipe_name>’)

This next command returns the ERRORs that occur in a pipeline. We can see the ERROR that’s a result of parsing if data is not matching with the table schema.

select * from table(validate_pipe_load(
start_time=>dateadd(hour, -1, current_timestamp())));


Shutting down all of our Druid clusters saved significant costs and maintenance time. We have reduced the ingestion cost by 10x. Additionally, we used to do weekly maintenance, and in the last 6 months, we have only had to perform maintenance once (due to an unexpected schema change). In the migration, we lost Druid’s main feature of aggregating the time series data with low latency. But for us, Snowpipe performance is excellent and still allows us to make decisions for the business in real-time.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | LinkedIn | Instagram



We’re hiring! Check out

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store