A Developer’s Perspective: Why Monitoring Metrics Matters
Off to the Races!
Every year on the first weekend in November, The Breeders’ Cup, a two-day horse racing event that always promises to be a period of high, environment-breaking traffic on the TVG website, is held. For those unfamiliar with the world of horse racing, TVG is the preeminent online hub for horse race wagering.
As a software developer, my focus is heightened when an event like The Breeders’ Cup comes along, as it brings a significant spike in traffic on the site and apps. Essentially, The Breeders’ Cup is a big deal for the business and product side of TVG…and is historically a tense and busy couple of days for the Dev-Ops folks and on-call engineers.
On the dev side, our preparation for The Breeders’ Cup is sophisticated, wide-ranging, and reliant on Statful, our go-to tool for analysis and vigilance. Statful allows us to create concise, clear, and comprehensive reports on the many performance tests we run, and create dashboards that act as real-time warning systems for the Ops and IT personnel.
Let me break down how all this is done
All-told, TVG has nearly 45 backend and frontend services in use. Because the distribution of traffic is not equal among them, the first step in determining which services need to go through heavy load tests is to understand what real-time traffic the services realized in previous years’ races.
To do that, we choose a date range from Statful’s date picker. Statful’s ultra-efficient RPS/Kubernetes Pod CPU architecture allows its date picker to quickly visualize a service’s performance over our defined time horizon, narrowing down the eligible services for testing.
Next comes the process of writing realistic user journey tests to be run on the service in increasing rates. We write the tests using the Gatling library in Scala. Out of the box, Gatling provides very scant reporting and only in-terminal. Our extensive instrumentation of services via Statful easily replaces this level of reporting, showing us requests per second, response times, CPU, memory usage, database call execution times, and more. All are in real-time and separated neatly by environment and endpoint.
Below are a few screenshots from Statful showing this in action.
Anyone who runs user journey tests can recognize these uniform steps in RPS, and visually ensure things are running as they should.
Statful’s ability to split data into different streams allows us to separate the data from the noise.
Once these tests are completed for each crucial service, we are able to gauge what type of traffic activity would be enough to provoke a Kubernetes autoscale or a pager alert to the on-call engineer. We combine our results with the real data from previous years to create widgets and dashboards that are designed to be monitored in real time.
By following these straightforward steps, The Breeders’ Cup 2018 came and went without a hitch for TVG. Ops folks were able to detect when traffic was creeping up to critical levels, pinpoint where the fail point would come, and even anticipate which endpoint might potentially cause an incident.
This made for no downtime on the day of the Big Race, very happy Slack messages from the business end of things, and very bored on-call engineers.
A great day overall, powered by Statful.
Barak is a currently a developer at Mindera Software Craft