#1 Test to Perform before Picking Your Next Analytics Platform (Concurrency)
After years of helping many customers pick their next modern cloud data platform, it is still shocking for me to see new prospects making one of the most critical decisions of picking their new data platform without this crucial test, “Concurrency”
Why is it so important?
The real question is why are you modernizing your data platform? It is not because of data engineers, ETL folks, or just for the sake of moving to the cloud. It is because of your business users. They want to access more data sources, more frequently and at a larger volume than ever. They want to run good, bad & ugly queries using their favorite BI tools all at the same time and not have to deal with failed queries & slow response times in their dashboards.
When you factor in that an average BI dashboard can easily have 10 visuals (charts, tables, KPIs, dropdowns &, etc.) which could trigger 10 simultaneous SQL queries with every single click on the dashboard, it is easy to get to 100 concurrent queries with just 10 users clicking away in their dashboards.
So if your new shiny data platform can’t deliver just 100 queries from your Tableau or PowerBI server, do it fast and w/o any errors, you will be guaranteed to be in a ton of deep #doo-doo when your workloads go into production and users start sending angry emails errors & slowness of the new platform.
In the end, it doesn’t really matter how fast you can ingest or process the data, how you store it, what languages you end up using or whether you can run ML at lightning speeds if you can’t deliver all that data to business users efficiently.
This really surfaces with Ad-hoc BI & dashboards which tend to trigger multiple queries with each click & can easily generate ~100 concurrent queries with only 10 or 15 users clicking around in their BI dashboards. So if the new platform fails this test, no amount of cool tech that is happening on the data engineering side will help you, and will definitely be spending a ton of time managing, optimizing & explaining why your new platform doesn’t work to your business users.
So how do you test this?
You just need some sample data & a tool that can simulate high-concurrency queries & record the results.
- The tool is OpenSource Apache Jmeter. Download the zip from here
- Data can be anything but should be selected carefully so it simulates enough stress on the platform for a typical user query.
- Sample Query where you can pass a variable for a WHERE clause to make sure each query is searching for unique values and pulls in a different set of data.
For data & Query:
- Make sure to use 2 or more tables (fact + dimension)
- Make sure the fact table is large enough (at least 50M+ rows, bigger the better)
- Query joins at least 2 tables and does some form of aggregation & calculation (av,g sum, &, etc) where there is some complexity and not just looking at a single table & pulling rows.
Testing this via JMeter
Luckily Jmeter already has most of this built-in. One thing we need is an additional plugin for concurrency that can ramp up usage and sustain it for some amount of time. (The default concurrency option allows you to run to see how fast you can run X # of queries. Instead, we want to see how many queries we can run in X number of minutes which is closer to what business users will experience)
- Use my GitHub page to download the resource files.
- Install Jmeter. Download here.
- Install BZM Concurrency Plugin. Download here. To install, simply unzip the file and copy the contents to Jmeter_InstallPath\lib\ folder. jmeter-plugins..0.6.jar goes directly in to \lib\ folder. The files in \ext\ folder needs to be copied to Jmeter_InstallPath\lib\ext\ folder.
4. Download the latest JDBC drivers for the platforms you are planning to test including Snowflake & others. Here is the link for Snowflake JDBC. Once downloaded, JDBC jar files need to be copied into Jmeter_InstallPath\lib\ folder
5. For Data, use the included test file(Concurrency_Test_Snowflake_Public.jmx). This file is using the sample SNOWFLAKE_SAMPLE_DATA.TPCH_SF10 (60M & 1.5M row tables) database in Snowflake. You can also use any other dataset by simply changing the Query in the test plan.
Whatever data tables you decide to use, make sure the data in those tables are optimized properly for each platform as performance for some of the platforms can be highly dependent on how much work you put into optimizing tables. Snowflake is very simple & not much is needed as most table optimization is done automatically. Data simply needs to be ingested/inserted in the natural order that the records are created where in this case is sorted by order_creation_date during ingestion.
Concurrency:
The test should simulate real-world BI/Reporting usage for the Monday morning or a Month-End rush. Basically a very quick increase of peak usage (0 to 100+ queries within 1 min) and sustain this peak for X minutes.
In this example, BZM concurrency load will be ramped up to 100 queries in 1min in 6 steps around 16 queries at a time. We will then hold this peak loa for 1 min and monitor two things:
- % of errors: Did any of the queries fail due to high workloads? (Pretty bad thing to happen if users are getting errors in their BI tool) This will lead to Service tickets & unhappy users & BI developers.
- Total # of queries: Total queries executed & results delivered to the Jmeter client during this 2 min run. This shows the throughput & responsiveness & resultset delivery capabilities of the platform during peak times. Basically how fast it can execute & deliver results of each query to a BI tool. Slow times will mean frustrated BI users.
Your Test Query:
The query you test should:
- Use a large fact table(50M+ rows) and join to at least 1 or more dimension and/or other fact tables. This is what almost every BI use case will do in the real world. Don’t run a simple SELECT … FROM SOME_TABLE as this will not be a valid stress test.
- Don’t run the same query over & over again. Make sure to use the Search_Filters.csv file to add dynamic filters so each query is unique and filters on different values.
- Add a calculated Column in your SELECT columns list to return a random value like RANDOM(). This should stop platforms from using query results cache data where they skip executing the same query twice. Each query will need to be executed even if the query syntax is identical.
The query used in this test for Snowflake:
In this case, I am joining ORDERS(15M), LINEITEMS(60M), PART(2M) tables. The query is doing an aggregation & has a few calcs (sum, avg & count) to make sure it is not just displaying existing values but doing some extra math. RANDOM() is added as a column to prevent any results caching to make sure each query is actually executed.
The trick is to make sure we are filtering different & valid values for each query by injecting different filter clauses after WHERE.
So how do you get a list of valid filter combinations?
In order to find the combinations that will return actual values, you can easily run a query to construct & generate this list by using the same tables.
Here is an example of how. This query generates valid combinations that will return 30 to 150 rows to make sure resultsets are valid & not very small like a few rows. It will then pick 2000 random combinations so the combinations are not using similar products or dates,
with T as
(
SELECT
‘L_SUPPKEY = ‘’’ || L_SUPPKEY || ‘’’ AND YEAR( O_ORDERDATE ) =’ || YEAR(O_ORDERDATE) as myfilter,
count(*) as RecCount
from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.ORDERS O
LEFT JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.LINEITEM L ON O_ORDERKEY = L_ORDERKEY
group by 1
having RecCount between 30 AND 150
)
SELECT myfilter FROM T SAMPLE(2000 rows);
You can now download this list as a CSV file and use it in your Jmeter test plan to swap the WHERE clause values.
HOW TO PERFORM THIS TEST WITH YOUR TARGET PLATFORMS?
Modify the Sample Test Template to match your environment.
- Start Jmeter. → Jmeter_InstallPath\bin\Jmeter (executable)
- Open Test file: File → Open → Concurrency_Test_Snowflake_Public.jmx file. Download it here
- Repoint to your filters .csv file in FILTERS -> Filename parameter. Download this one if you are using this specific Snowflake Sample data or your own. Download it here
4. If you plan to use your own custom query, edit BI Query Daily section. Make sure to insert ${MYFILTER} after WHERE clause so it gets replaced with unique values.
5. Modify the JDBC Connection string to point to your target platform and make sure the credentials/tokens you use to authenticate have proper access to the tables needed to run this test. This is done in the JDBC Connection Config section of the template.
Snowflake JDBC Config looks like this:
Examples of other platforms :
6. Save the Template file then Clear the existing logs & metrics before the test.
7. Make sure the platform is configured to scale out for high concurrency if it has that capability. For Snowflake, you just have to enable the Multi-Cluster option is enabled for Auto-Scale up to max 10 clusters.
8. Start the test.
8. Switch to the SUMMARY REPORT section to view the results. Numbers should stop changing & the final results should appear in about 2 mins.
OUR SNOWFLAKE TEST RESULTS
For Snowflake I used the smallest cluster which XS with a single node/server. It was configured to auto-scale horizontally up to 10 additional clusters to automatically handle spikes in concurrency. The test was from a cold start where the cluster was in a pause state where all data was initially accessed from the cloud object storage.
With XS I was able to run 1677 queries in 2 mins with NO ERRORs (0.00%) with an avg query time of 5.8 secs & throughput of ~14 queries per sec.
With SMALL instead with 2 nodes, total queries increased to 2350 (+673), throughput to 19 queries per sec(+5), and lowers the avg query times dropping to ~4.0 secs(-1.5 secs).
With XS, Snowflake ended up scaling up to 7 clusters to handle this instant peak and reduced clusters to 1 then finally paused the cluster shortly after the test ended.
If we use the log file to look at the number of queries performed within each 5 sec time slot, we can see Snowflake was able to Scale out the compute to handle all 100+ queries within 30 secs of the ramp-up time which was 0–100 in 60 secs. We can see at 95 secs we are already running 100+ queries.
With SMALL clusters(2 nodes), We were able to hit 100 queries almost exactly with the ramp-up time and running ~120-160 queries every 5 secs
This clearly demonstrates Snowflake can easily meet & exceed the business concurrency requirements with almost no additional effort than putting a single check mark on the cluster for it to Auto-Scale.
What you are looking for in the results are:
- High Throughput (Total # of queries)
- Lower Avg Query Times (Performance)
- No Errors ( If you start getting connection or execution errors, that is an indication of bigger problems in the future with more data & more users)
FINAL THOUGHTS
This test does not use big data (60M rows) which is fairly typical of most business BI use cases and simulates a fairly basic morning rush hour(10–15 users clicking around) that your Tableau, PowerBI, ThoughtSpot, or other BI servers may experience. If the platform is struggling with this amount of data and users, then you can expect a ton more problems in the future when the data & user counts will go up.
As an analytics practitioner, your #1 priority should always be great user experience for your business consumers as they are the reason why these data platforms are needed so they can get their work done.
We all have our biases or comfort levels in terms of company, tech stack, language, commercial vs. open-source &, etc. when it comes to all things data but don’t let those biases dictate your decision when it comes to picking a new platform. After all, a platform that works well for data engineers and/or can crunch through billions of rows in your favorite language is practically useless if it can’t deliver what you built to the masses in a performant & secure manner w/o errors and tons of extra work.
