Load Testing in Production — Preparing for the Big Bang

Published in

SEEK blog

9 min readApr 1, 2024

Imagine this: You’re the captain of a large ship sailing on calm waters. Suddenly, a freak weather system spawns a massive wave headed straight for your vessel. The huge wave rapidly approaches, threatening to capsize your ship if you don’t take immediate action. This is the stuff of nightmares for any seasoned sailor.

Now picture this same scenario, but instead of battling a literal tsunami, you’re at SEEK, bracing for a massive influx of online traffic that could potentially sink your digital platform.

In early 2024, SEEK successfully navigated its own “traffic tsunami” by merging employment marketplaces across the Asia-Pacific region into a single, unified platform now serving over 50 million users.

The platform that has reliably served Australia and New Zealand for years expanded its reach by adding job seekers and employers from six additional markets served by SEEK’s Jobstreet and Jobsdb brands: The Philippines, Hong Kong SAR, Malaysia, Singapore, Thailand and Indonesia.

The technical details behind this initiative are truly remarkable. Planning and execution took nearly three years. Hundreds of SEEKers dedicated countless hours to preparing business systems and workflows, developing new user experiences for web and mobile, ensuring compliance with local legal and privacy requirements, and finally migrating all customer data to the APAC platform.

The company made a Herculean effort to ensure all services and processes were ready for the critical moment — the “big bang cutover” — when user traffic would literally triple overnight.

Why Load Testing?

Ever built a website or mobile app? Scaling the backend for a massive user jump (like 3x!) can be a challenge, especially with millions already on board. Remember some online retailers during the COVID pandemic? Their platforms couldn’t cope with the extra load.

Photo by Magdalena Kula Manchee on Unsplash

Making a popular online platform truly scalable takes time — months, sometimes years of re-platforming work. While SEEK’s platform was built with scaling in mind, and individual teams felt confident, overall system performance remained a question mark. Without proof, optimism could only go so far.

So, how do you know your system won’t buckle under heavy pressure? How can you ensure a traffic surge won’t take down your website, apps and services? That’s where load testing comes in.

Now, for slow, gradual user growth, load testing might not be needed. Bottlenecks would likely appear naturally, giving you the chance to adjust your system for the increasing traffic.

But what if your user base is expected to triple overnight? You need to be confident your systems can handle it. You want a smooth launch when all the stakeholders are eagerly watching. In that case, load testing months beforehand is crucial. But the next question is: what environment do you run these tests on?

Why Load Testing in Production?

Unlike functional tests, load testing is tricky to pull off in a non-production environment. Large online platforms, like the one at SEEK, are complex ecosystems. They’re built over years and involve a mix of modern microservices, legacy services, serverless functions, queues, batches and integrations with external systems like CRM, Search Engines and CMS. It’s hard to create an environment that perfectly mirrors the real world. In a staging environment, for example, these components might not be sized or connected exactly the way they are in production. And if your test environment isn’t truly representative, the value of the test itself is questionable.

So, why not just test each piece of the platform individually? You definitely can (and should!) load test individual services or components. But that doesn’t guarantee smooth operation when hundreds of these parts work together in different ways. A system is more than just the sum of its parts. Each service might function well on its own, but when they’re all connected and interacting, you need to ensure they can handle everything that the real world throws at them. Testing services in isolation simply didn’t give SEEK the level of confidence it needed.

This led to an unconventional but logical solution: load testing directly in production! This might not be suitable for real-time, mission-critical systems, but for a typical online platform, it’s a viable approach. It’s a calculated risk — you might experience some downtime during the test, but in return, you gain valuable insights into how the platform will handle increased traffic. The key is to prepare your systems and processes thoroughly to run tests safely and reliably.

Preparation is Key

Load testing in a production environment is no small task. It requires a flexible system architecture and a crack engineering team. You need to understand what you’re testing, the potential impact, and have a plan to avoid any unintended consequences.

SEEK tackled this challenge by establishing a dedicated Scalability Team. This team orchestrated the entire process, from devising the testing strategy and getting management buy-in, to running the tests alongside real user traffic. They also assisted other teams in resolving any problems that arose during the test runs. All this happened while everyone else was focused on getting the platform ready for launch — a truly high-pressure situation!

These were the key steps involved:

Strategy is Essential: The Scalability Team prepared a strategy to get all systems tested, handle test data, and introduce safeguards to prevent test activities from leaking into public view or messing with AI models, analytics, and financial reports.

Test Data Handling: Test data has been flagged with simple tags attached to all test entities and requests. These tags include a flag to identify test records, along with details about the test’s scope and behaviour. Test behaviours allow control over system actions in specific scenarios. For example, a “createInCrm” test behaviour could add a test user to the CRM system, only if needed for a specific test run. In scenarios involving anonymous traffic, test tags were added directly into HTTP headers, with only internal services permitted to do so for security reasons.

Choosing the Right Tools: The team elected to use k6 Cloud for API-based tests and Flood Element for UI-based tests. API-based tests execute HTTP requests mimicking those sent by web or mobile apps. This allows for fast iterations but has some limitations. UI-based tests involve loading a web page in a headless browser and simulating real user actions by clicking links and buttons. While slower and more fragile, these tests validate server-rendered parts of the website and some user interactions. Roughly 80% of the effort was invested in the faster and more robust API-based tests. To avoid triggering internal DDoS protection mechanisms, load testing agents had their IP addresses added to an allowlist.

Big Red Button: In case a load test caused any issues, there was a mechanism to abort the test run at any point.

Monitoring: The team also created dashboards to help monitor and troubleshoot any issues during test runs.

Running the Load Tests: A Pragmatic Approach

Here are some of the guiding points the Scalability Team used while running the load tests.

Finding the Sweet Spot: Balance is essential. You need to find the right combination of load size, the number of virtual users simulating traffic, and how long the tests will run. Remember, these tests will be run regularly, so make sure they’re repeatable and monitor for any unexpected changes in results.

Focus on What Matters: Prioritise your testing efforts. Don’t waste time on rarely used areas of the platform. Focus on the most heavily used features.

Be Realistic: Don’t push your luck. If you suspect a system can’t handle a certain amount of load without breaking, don’t try to force it! There’s no point in causing a real outage just to prove a point.

Schedule It In: Deciding when to run the tests is crucial. After hours, when there’s less natural traffic, might seem ideal. However, running the tests during business hours offers a key advantage: all the engineering teams are available to investigate issues and make fixes immediately. The team opted for this approach to handle potential disruptions faster, but it did require buy-in from leadership and all teams.

Stress Test Your System: Simulate real-world pressure! Plan soak tests where systems handle increased traffic for extended periods, like 12 hours. Separate the test results from any natural traffic spikes the platform might experience. As your systems evolve, you’ll also need to update your test cases and scripts to reflect those changes.

Start Slow: Don’t overload your system all at once. The Scalability Team started by gradually increasing traffic — 1%, then 5%, and so on — fixing any problems found along the way. The goal was 100% of anticipated unified traffic, but they didn’t stop there. They pushed it to 150% to account for future user growth. For months, they continuously tested critical user journeys until the real traffic surge arrived.

Keeping an Eye on It

It’s important to track all the tests you run and the issues they uncover. This creates a continuous feedback loop, ensuring any changes introduced to the platform don’t degrade its overall performance.

During load testing, the custodian teams closely monitored all their services and asynchronous processes. This allowed them to promptly catch issues like increased response times, overflowing queues, or failing batches.

Overall, SEEK’s load testing uncovered over 50 issues, including:

Timeouts or unexpected increases in P95 response times
Internal server errors due to timeouts
Incorrectly set limits (e.g., database write throughput or worker concurrency)
Overlaps in long-running batch operations
Suboptimal rate limiting

While all issues were rectified, some required weeks of work. For those, specific branches of the load tests were paused. This allowed the responsible teams to properly address performance and scalability issues in their services. Imagine how much harder it would be to identify and fix these issues if they only surfaced when real traffic hit?

The Outcome

SEEK’s bold decision to perform load testing in production paid off! The APAC platform was fully equipped to handle a massive traffic increase, thanks to the incredible effort from all teams involved.

While the load testing was successful, it wasn’t without its challenges. One hurdle involved test data handling. While the code changes to add test tags themselves were relatively small, the sheer number of services involved made it a substantial effort. These small changes needed to be applied across hundreds of services, impacting API contracts, database schemas, and more. This required close coordination across multiple teams.

Ironically, during the testing phase, k6 Cloud, the chosen tool for API-based tests, itself struggled to handle our high load demands. However, despite these challenges, the goals were achieved, and the APAC platform was hardened for the merge.

Special recognition goes to the Scalability Team for their meticulous planning and execution. They achieved this impressive feat without causing any major downtime, proving the success of their strategy. All SEEK systems were in top shape for the cutover, resulting in zero performance-related incidents after the switch. Bravo!

Thinking about load testing in production? While it can cause some disruption, don’t let that discourage you. It’s far better to identify and fix potential issues in a controlled environment, rather than face them head-on during a real traffic surge.