Making queries 100x faster with Snowflake

Why and how we migrated our product usage data from PostgreSQL to Snowflake

Context

At Doctrine we provide access to legal information for professionals. This includes legislation, case law, a lawyer directory, company legal statistics. The main service consists of a web application (other services include a customized legal newsletter for instance).

In our previous article about database architecture, we presented how we had previously improved our database infrastructure to solve two issues:

  1. Our data analysis scripts heavily use the database server resources (CPU, I/O), slowing down the web application used by our customers.
  2. When our team wanted to query the product usage data, the queries took very long.

To better solve the first issue, we have recently improved our database architecture using logical replication as explained in this article. But here we will talk about what we have done to solve the second issue.

What is product usage data?

The product usage data we are talking about are events recorded when users interact with our platform Doctrine.fr. Events are actions like “perform a search”, “read a document”, “create a folder,” etc.

Recording them allow us to understand how our customers use our product, and to focus our development efforts on the right features. This data is then manually queried with SQL by product managers, data scientists or developers.

The order of magnitude is around 5 million events recorded per day, and the total data is around 500 GB.

The previous solution

In our previous architecture, we had converged to this solution for storing the real-time product usage data produced by our application:

Our previous architecture

This architecture was already an evolution from directly writing to tables from our web application, which we had done at first.

When the database is heavily used, writing can be delayed, and we want to avoid delaying response time when users interact with our platform. The architecture shown above solves this issue by sending the SQL write queries (typically INSERTs but also a few UPDATEs) to a Redis queue. Adding requests to the queue is instantaneous. It is then the job of an independent daemon to pick the SQL queries and run them. Even if running the queries can be slow, it does not matter since it is independent from the application.

However, we were not entirely satisfied with this solution. The main issue was that running queries on this data can be very slow. For a while, we worked around this issue by creating indexes which allow to query the data much faster. But the way data scientists might want to query the data can be very different depending on the use case, so very specific indexes need to be created. Creating an index on a big table is very long, so if you want to optimize a 3-hour query but you need to create first an index during 10 hours, your analysis will take as much time.

Setting up Snowflake and Amazon Athena

It all started when we were contacted by a sales representative from Snowflake. The company Snowflake Inc. was founded in California in 2012 and has raised $1.4 billion. Its main revenue comes from selling the Snowflake data warehouse as a Software as a Service. After a few exchanges with them, and a few reference checks, it appeared that their solution was a relevant candidate to solve our issues.

However, we wanted to test a competing solution. After collecting some information, we found that Amazon Athena was a solution worth trying, especially since we are entirely hosted on AWS.

To test the two products for real, we decided to build an entire data pipeline in a way that would allow us to benchmark the two products and find which one best fills our needs. Here is a diagram of what we set up:

The infrastructure during the Snowflake vs Athena benchmark

The common part: Amazon Kinesis

Amazon Kinesis Data Streams and Firehose were not services which we already used, so we had to set them up and learn how to use them.

A Kinesis data stream is basically a publish/subscribe system in which you can send arbitrary binary data. In our case, we chose to use newline-separated JSON encoded in UTF-8. The instances running the backend of our web application send their data to the Kinesis data stream, and any client which has subscribed to the stream then receives the data in almost real time.

Instead of using a program to subscribe to the data stream, we use an Amazon service called “Kinesis Firehose”. Kinesis Data Streams and Firehose are designed to work well together, so when you create a Firehose, you can tell it to use an already created Data Stream as input. Firehose performs two tasks here:

  1. It converts the data format from JSON to Parquet, according to a schema declaration in AWS Glue.
  2. It packs the data in batches which it writes to an Amazon S3 bucket.

Here we decided to use Parquet because we heard that Amazon Athena is more efficient when querying Parquet data instead of raw JSON, but we could have stored it directly as raw JSON files, sparing the need to use AWS Glue. Athena and Snowflake both support JSON and Parquet, and in fact we then successfully used raw JSON in another data pipeline setup. In that case, the only job of Firehose would be to batch and write the data to S3, without performing any format conversion.

Setting up Athena

Amazon Athena is able to query the data from S3 directly. Athena provides the illusion that the data you are querying is in a regular database table, while it is in fact reading the files from S3 on the fly. To do that, you have to create a schema declaration in AWS Glue, which basically says which “columns” exist and what their data types are. We re-used the same schema declaration we provided to Firehose. Here is what a schema declaration looks like:

A few columns of our schema declaration in AWS Glue

Setting up Snowflake

In Snowflake, we needed a bit of extra work, because we needed to copy the data to the Snowflake database.

We first declared a “stage” in Snowflake, which is basically an external data source declaration. The stage we used is our S3 bucket, and we provided Snowflake an IAM user key to be able to read data from our S3 bucket.

We could then a copy a stock of data using the COPY INTO instruction, which copies from a stage to a table. Here is a complete example with the table creation and the data loading (I have kept only 5 columns to make the example simpler):

Special care is taken for the “data” column where the input is unstructured JSON data. We store it in a column of type VARIANT (basically the equivalent of JSONB in Postgres).

Note that we call parse_json on the fly. A nice feature here is that we can transform data on the fly to adapt it to the destination table structure.

The above procedure will load the data once, but we wanted the stream of new data to be automatically added to the table. To do that, we used a native feature in Snowflake called Snowpipe. Declaring a Snowpipe is almost like performing a COPY INTO:

An extra step was needed to notify Snowpipe when new files are written to S3. This was just a setting to add to the S3 bucket.

From that point, any new file written to the S3 bucket was automatically picked up by Snowpipe and the data was added to the table. It makes the data available in the final table only a few minutes after it is generated by our web application backend.

The benchmark

Similarities between Athena and Snowflake

To assess the performance of both products, we ran a bunch of queries corresponding to some of our data scientists use cases. Here is an extract of this benchmark:

A few use cases and the time taken to run the queries

The overall result is that both Snowflake and Athena are at least 100x faster than Postgres when a large amount of data has to be read.

The only cases where Postgres performs better are when we run a query for which a specifically tailored index is present. For instance the “get last 100 events” query uses an index on creation time which can immediately find the correct 100 rows. This still makes Postgres an ideal database for the small read/writes queries performed in real time by our application where low latency is very important (Snowflake and Athena are not designed for this use case).

One of the common features on both products is a way to see the progression of your query. A big issue with Postgres was that when a query was running for more than a few minutes, you had no idea if it was going to take 10 minutes or 10 hours, so our engineers started their queries and cancelled them if it was too long, without knowing if they had made 5% or 95% of the way.

Progression of a query in Snowflake

Another good aspect of both Athena and Snowflake is how costs are computed. In both products, you don’t pay for processing power when you are not using it. Athena bills only for run queries while Snowflake puts servers in sleep mode after a few minutes of inactivity and auto-scales the number of CPUs. This is a key advantage compared to Redshift or ELK where the servers are running and costing money around the clock.

Differences between Athena and Snowflake

The main difference between the two products is their support of writes.

When a user deletes their account, we anonymize all their data. Basically, this means we perform an UPDATE to “blank” the fields which contain references to the user id. This is possible in Snowflake but not in Athena. However, we could write a custom script which anonymizes the data in S3 directly, so this advantage of Snowflake is not enough to make a significant difference.

A more important use case is when data scientists want to perform a “CREATE TABLE AS SELECT”, for example to extract a sample of data on which to perform further analysis. In one of our tests, we created a sample of 10 M rows. Postgres had not finished after 3 hours (we canceled), Snowflake ran the query in 10 minutes and Athena crashed with “Query exhausted resources at this scale factor”.

The final benchmark result

Our different needs and the results with the different products are presented in this table:

Summary of our benchmark

Snowflake clearly corresponds best to our needs. It is more expensive than Amazon Athena because you have to pay for Snowpipe, while Athena directly reads from S3. But we thought this extra cost was worth it and decided to adopt Snowflake.

One year after deployment

Since we chose Snowflake in Summer 2019, we have been adding more data in the pipeline. We have also taken advantage of the Kinesis setup to remove the previous Redis-based system. We have also used it to send some events to Mixpanel, a software we use to easily analyze simple event data. The final architecture looks like that:

The final product usage data architecture

We also added a Snowpipe to gather Amazon Application Load Balancer logs, which is very easy since it can write the logs in CSV format to S3, which Snowpipe can parse easily.

We also added some application JSON logs (table NODEJS_JSON_LOGS). These are our 3 biggest tables:

Our biggest Snowflake tables (size is the compressed size = around 1/10 of real size)

One year later, our monthly Snowflake cost is around 3% of our total server cost, compared to 28% for all Aurora Postgres databases, so the cost for us if very reasonable.

Conclusion

After a benchmark, we chose to deploy the Snowflake data warehouse. We have been fully satisfied with this product, which is now used by product managers, developers, data scientists and financial analysts. It allows us to run queries very fast and store several terabytes of data while keeping the cost low.