How Fixing my Typo Improved Cribl Search Query Performance by 20x
One great perk about working for a company that builds software? You get to use all of it for free!
In the software development world, the term “dogfooding” (or as we like to call it at Cribl — goatfooding) is a critical practice relating to the use of your own software before it’s released to the public. This is a way for a company to gain insight into actual usage patterns, identify bugs or usability issues, or even capture ideas for new features.
The Technical Marketing Team at Cribl spends a lot of time using our suite of products. We’re responsible for making sure the Sandbox and Demo environments are up and running. We build and update all of the courses available to our internal and external customers. We help our customers find new ways to use the software.
The Cribl Sandbox is a truly unique and novel service. Self-paced training with pre-configured scenarios available on demand. Sandbox courses typically are end-user ready in 2–3 minutes.
Unfortunately, sometimes things break. Hardware fails. The Internet has a bad day. Because we run a complex and heavily-utilized platform, all of the “digital exhaust” in the form of logs needs to go somewhere when things do break. We leverage Cribl Edge to capture all of the logs from our Kubernetes cluster. The log data is written directly into an S3 bucket for long-term archival purposes. Cribl Search is our analytics and dashboarding platform. I wrote a blog recently detailing how we leverage Cribl Search to gain better visibility into our AWS environment.
We also like to show off how much our Sandbox and Demos are used to our leadership team. We’re averaging approximately 750–1000 Sandbox and Demo starts per week. With our data we like to see what kind of content people are interacting with to identify trends or develop new Sandbox course content.
I recently received a comment from one of our leaders who likes to look at one of the dashboards we put together.
This seems to load slower and slower every time I open the dashboard. Can you look at the Search query?
Happy to help! I had also wondered the same thing the prior week…
In order to get data into Cribl Search, a Dataset Provider and Dataset must be configured. Since we’re writing our Kubernetes logs to Amazon S3, it was trivial to setup the Provider to our account, then configure the Dataset.
The logs are organized neatly into a prefix structure inside the S3 bucket. The environment variable represents the Kubernetes cluster the logs originated from. The logs are further subdivided by the year, month, then day. Finally, the last prefix is the namespace the logs originated from. More metadata is included inside the log events themselves which can be used for further filtering.
For example, this is what logs from the “sandbox” namespace would look like stored inside the S3 bucket:
And this is how a single log message would be displayed inside Cribl Search:
Note the “kube_namespace” field inside the message. The fields starting with “kube_” are automatically added by Cribl Edge.
By adding this to the search query filters, only logs from this namespace are selected. Or so I thought…
Running this query to find Sandbox course starts took approximately 360 seconds to run. Not ideal and I can understand the comments regarding sub-optimal performance.
And note the approximately 17.5 GB of data that took over 18000 CPU seconds to search. Ouch.
But I thought we were organizing data by namespace?? We are, just not by that field name…
If you look in the Dataset path configuration, we configured the folder name token as “namespace”. Since that’s not included in the query, Cribl Search needs to search through all namespaces to find the correct values of “kube_namespace” as it’s inside each event. Oops. Now I see the issue.
Changing the query to only use “namespace” as the filter, running the query again results in significant performance gains. This same search with the correct field name runs in 20 seconds. A 94% reduction in search time, or 18x faster!
The Search job metrics also give some insight behind the performance. For this query, Cribl Search only needed to scan ~200 MB of data and took 821 CPU seconds.
So remember, filters are important when searching! It’s all about ensuring you specify only the set of data you need to scan. By using S3 partitioning, you can organize data for fast search results from Cribl Search.
I’d encourage you to go watch one of my colleague’s videos on Cribl S3 Better Practices:
And for more information, the Cribl Docs is a great resource for understanding how to correctly partition data in S3:
Need a more cost effective way to search your data without vendor lock in? Go try out Cribl Search by signing up for a free Cribl.Cloud account today!