0 to Production in 90 secs with Snowflake (Simplicity of Scalability)

After reading & commenting on many posts about what makes a great enterprise-wide data platform for a business, I decided to write about this topic to share my thoughts.

There were ideas ranging from how cheaper or the faster the platform was, to which had the most knobs to tune each workload to which had the most open-source backbone as some of the reasons to be chosen as the defacto enterprise single source of truth of all things analytics in an organization.

IMO, these are all valid reasons to consider but they should not be the sole drivers for choosing the right platform that will potentially drive all of the analytics in your business. The reason is that these topics(speed, cost, tuneability & open-source ) are most important to people who run & maintain the data pipelines but not to the business.

So what does business really care about?

They just want all of the data they need in a reliable & timely manner in the most cost effective way.

It is a simple concept if you know what you are trying to purchase & for who. I say this because the majority of the time, we forget that the only reason IT & Data teams exist is to support the business to make more money. Their customers are the business folks so the drivers of all of their decisions should be based on what is the best thing to do for the business.

However, many times the platform decisions are being made by the IT/Data departments based on their own wants & needs which do not always line up with their business counterparts. They want the cheapest, fastest & coolest toys with enough knobs that they can play with as techies & dbas have a tendency to want to control everything in minute detail within their platform. Again, that's how things worked since the ingestion of databases & data warehouses. They had to control everything because they had finite resources that they had to juggle across all the business users. However, technology evolved and the amount of control that they want/need with cloud-based platforms actually may end up hurting the business instead of helping them.

Why?… Because business needs are different than what techies needs are. Business looks for a cost-effective solution but not necessarily the cheapest one. They do care about accessing all the data they need quickly & in a timely manner which means they do care about the speed but they don’t really care if it is the fastest as long as it is fast ENOUGH. ( Queries running 3secs vs. 5 secs don’t usually make or break any business outcomes)

And for the product being an open-source solution?… Businesses could care less. That is a concept that some IT/Data folks have a personal or emotional attachment as an added purchasing criteria but it rarely if ever actually provides a tangible business value. But it sure does make things much more complicated than they ought to be.

Does having more knobs to tune a given platform matter…? Not for a business user! If it works and is fast ENOUGH, they don’t care how many or how few knobs it has for techies to play with. Well… this is not entirely accurate. They do care because if it takes 3 months to add a new dataset because of the number of man-hours required to test & tune each new dataset with all those new knobs which then creates a backlog of workloads to complete that they have to wait for. The business will totally care about that & not in a good way. They simply rather have all of the data they need operating 80-90% of the possible peak performance vs. having access to only a few datasets that are 100% optimized & tuned. So it is important to be able to onboard data quickly and that data to perform fash ENOUGH for the business w/o any additional work by IT.

The business also really cares about scalability and being agile. Their customers & competitors don’t sit around waiting for months for their proposed additions & changes to be implemented by the techies. For example, gaining access to weather or gasoline pricing data 3 months after a request has ZERO value for them. By the time they gain access, data is either no longer relevant or the opportunity that data could have provided has gone bye bye a long time ago.

As things Supplies, Demand, Competitors, M&As, Products evolve very quickly in a business, they really need a platform that can keep up with anything they can throw at it without either worrying about whether the solution can handle it or IT spending a ton of time/effort to make it work.

So in my opinion, scalability & simplicity should be the major driving factors for any analytics platform that is going to serve the business. The idea is very simple. The easier it can scale to any new business demands and the simpler it is to add new workloads or to maintain existing ones by freeing up IT/DBA times for them to work on providing new datasets vs. babysitting existing stuff, the better it is for the business.

This is where Snowflake truly shines.

We are talking about the time & effort it takes to go from a business request to full production with hundreds of users. Any DBA knows, anytime when a business asks for new datasets with hundreds of millions of rows with potentially hundreds of users, it would take months of testing & capacity planning before it can go into production. That is if your current platform can handle the new workload and much more if it needs to be upgraded. So I decided to show you how Snowflake handles this type of business request.

Let’s assume we just got a request from the VP of Sales to provide his team with a simple sales data model where the fact table contains over 200M+ rows. What is worse is that they will be using Tableau and have close to 200 users during peak hours. This means hundreds of ad-hoc queries hitting a large set of tables. Usually, this would ring alarm bells on any traditional data warehouse & DBA is responsible for not making it crash with extreme workloads. But have no fear, we got Snowflake to the rescue. Let’s see how it is done.

First, the last thing we need is to have any negative impact on any other business unit or any of the current ELT data pipelines. This means we need to spin up a new compute cluster specifically for this use case to isolate the compute resource from the rest of the company workloads. Easy peasy. Use the UI or use SQL.

Let’s create a cluster specifically to ingest the new data. We’ll initially choose MEDIUM cluster(4 servers with 8 cores) size but can always change the size at any time to fit our needs.

  1. TIME TO FULLY RUNNING INGEST CLUSTER < 2.0 secs

2. Then create a pointer to where the data is on an Azure ADLS gen2

3. Create the target tables

4. Now we ingest the smaller customers table with 4371 rows:

Well.. that took 2.1 secs but it was only 4K rows

5. Now let's do something bigger. How about ingesting 207M rows.

We will do this in a way that you can’t do with any other platform. Increase the server(node count) from 4 nodes/servers to 8 to speed up ingestion.

Done! Time: 0.2 secs or 205ms to be exact.

6. Rapidly Ingest 207M rows of data.

Time: 27.0 secs for 267 CSV files totaling 3.3GB in size

7. Downsize the compute down to 2 nodes immediately after ingestion in order not to waste resources.

Time: Only in 112 ms!

Now we have the 2 tables we need for our business. The larger one has 207 million rows and we have done 0 optimizations on it. This means zero extra IT/DBA effort was wasted on making things work fast so far.

Rolling it to Prod with 200 users

Ok, now we got the data, we can start rolling this to production and do a concurrency test with 200 users to make sure it can serve our business users well even under peak usage.

  1. Create a new Compute cluster to handle 200 Ad-hoc BI users specific for this the Sales Department. Let’s do this via the UI & make sure to set the max cluster to some higher number like 5

Three simple options to consider where we can change any of them at any point in production if we need a different behavior or performance w/o affecting currently running processes & users. Again super simple & quick with very little technical work if any at all. (Three because the 4th Auto-Resume is a no brainer option)

2. Replicate the entire Staging Schema to PROD with 2 tables & over 207 million rows using the 0 COPY CLONE feature (Awesome capability that lets you instantly clone entire dbs, schemas & tables w/o added storage costs).

TIME: 2.2 secs to have an isolated PROD schema w/o paying extra storage costs.

Yep, all 207 million rows are there.

3. Concurrency test with 200 users & 400 queries in 45 secs. Now that we have a separate copy of data and a separate instance of compute nodes dedicated to Sales BI users, we can proceed with the concurrency test. For this, we will use a popular open-source tool called Apache JMeter to simulate 200 users running 2 queries each within 45secs to see how the platform handles this large instant spike in demand.

First, we specify 200 users and 2 queries each(loop) and ramp up the whole thing in 45 secs.

Next, create a connection & specify an initial set of queries to run prior to the test run. These 3 queries will tag each test query so we can track them in Snowflake after the test, Disable the results cache to make sure each query uses compute & set our default compute cluster for this test as COMMERCIAL_WH

Then specify a query that would join the 2 tables, do some, calculation, aggregation, and sorting. We then use a variable ${v_custid} in the query that swaps distinct CustomerId values from a text file in order to make sure that every single query result is unique. All in the name of trying to make the platform sweat a little.

NOW HIT THE RUN BUTTON & WATCH ALL THE FUN!

While the compute node was at a paused state, all 400 queries are sent to Snowflake within the first 45 secs. Snowflake instantly wakes up the cluster and starts executing each query. At one point, it realizes that one medium cluster will not be enough to provide constant results so it automatically spins up a second cluster to handle the concurrency. As the queries & users taper off, the additional cluster is removed and the original cluster goes back to the paused state, and you as the customer stop incurring charges mins after the last query executes.

Let’s look at some of the impressive results that we achieved literally with ZERO tuning & tweaking.

Out of all the 400 queries sent, the avg response was 2.0 secs. But the main thing to focus on is the 0.00% Error rate. This means every single query was executed successfully and nothing failed due to resource or scalability problems. (This is where you would normally see positive error rates with most other cloud platforms because they either can’t scale at all or can’t scale fast enough)

Now let's look at the results in a visual chart. Now, this is something you won’t usually see every day with 200 users & 400 concurrent queries within 45 secs if you are using other platforms unless you spend hours optimizing things and make sure the test was executed while the system was not busy doing other things.

The average & median (blue & purple) times are actually going down as the concurrency is going up. This means Snowflake is providing faster & faster results as the user count increases. Totally not your average Joe results!

The red line is also going down which is the deviation. Meaning execution times deviate from prev runs but new numbers are even lower. You really want this line to be flat to indicate consistent results as the user count increases but to see it actually point down is something special.

Finally, the green line indicating system throughput (amount of queries & data served) is going up as the concurrency builds up. Again, a big testament to the power of Snowflake in terms of scalability and handling concurrency in a fully automated manner with ZERO input from administrators.

Set it & forget it to get great results. That’s what you want from your platform! Not manual knob turning & tweaking just to get OK results.

Let’s finally look at the query history in Snowflake to see individual results and concurrency handling.

You can see most queries were executed in less than 500ms. (Slowest was 2.1 secs) and the queries are consistently returning over 1,000 rows meaning results are more like Analyics-BI workloads and not simple spot queries that return only a few rows.

You also see CLUSTER NO in the 5th column. This shows the horizontal scaling process. Essentially how many total MEDIUM clusters needed in order to handle the concurrency. Using a Medium Size cluster, all it needed was one more cluster to handle that instant rush of 200 users so 2 was the maximum it needed.

Now if we reduced the cluster size to a SMALL (2 nodes), this would have doubled the execution time for each query and in that case, you would see more clusters may be added to handle the same amount of concurrency. Snowflake handles this completely in an automated way.

All you need to know is the cluster size(XS, S, M, L…..4XL) lets you adjust the speed of the individual queries based on the business use case, and turning on the auto-concurrency option simply allows the platform to keep up that same performance level with a higher number of users if/when they show up.

In the end, clusters automatically will start, stop and scale based on current user demand within seconds providing users with great performance in a very consistent manner while saving you money at the same time.

So let's review the full test scenario:

  1. A Compute Cluster for Ingest/ELT : < 5.0 secs

2. Increasing the compute & Ingesting 207 million rows : 27.5 secs

3. Replicating all the data to PROD with 0-Copy Clone: 2.1 secs

4. A new compute cluster for BI users to handle concurrency: < 5.0 secs

5. Running a test against this data with 200 users / 400 queries: < 60 secs

Total run time of less than 3 mins to ingest 207 million rows of new data into a staging table using a newly created ELT compute cluster then replicating the large tables into prod schema for BI users and creating a new compute cluster for PROD users that can scale for concurrency plus testing over 400 queries in 45 secs.

Being able to roll out a new large dataset to 200 users quicker than you could finish your coffee without affecting any other existing workloads or doing any manual optimization, tweaking, and to have it perform at these levels: PRICELESS

Performance & Speed does matter to the business however not in the way that most techies think of speed & performance. Running a few queries faster than anything else on the market does not matter because it doesn't change business outcomes. What really matters to the business is how fast your team can deliver the data they need in a consistent manner as long as the performance is within acceptable levels dictated by each business use case.

If achieving decent performance & scalability requires a lot of work on the IT side, it will have a compounding effect as more & more new datasets are requested. End result will be only a small number of datasets being available to the business where only a few are highly performant while the data team is trying to juggle onboarding new datasets with maintaining the performance of existing ones.

What happens when it takes too long to provide the requested datasets to the business? Business users will only wait so long before going on their own and building their own data marts & databases then you end up with the same problem you were trying to fix in the first place with an enterprise data platform. With data silos & shadow IT!

This is why SIMPLICITY & SCALABILITY should be on top of your list when it comes to choosing an analytics platform that will be used across your business.

--

--