Automation: A Failure Story

Published in

Salesloft Engineering

13 min readMay 21, 2020

I’ve often had new engineers ask me why we aren’t doing automated browser-based regression testing. They start asking questions — why don’t we have this? Could I try to write it? I’ve done it before! …Oh young child, sit down, and let me tell you a story.

Setting the Stage — Fall 2015.

What did SalesLoft Engineering look like when I joined? The entire product and engineering team was around 16 people. The development team was made up of 12 full stack Ruby & Angular engineers. I was hired as QA, and we had one designer, one product person, and the director of engineering.

Our application was made up of 3 services. Gandalf was the authentication service, crm-repo was the interface to Salesforce requests, and melody contained everything else — all the front and backend code and business logic.

Part of the role description I was hired for was to create an automated regression test suite within 90 days. I started reading about them, and went to the Software Testing in Atlanta Conference that year (2015). I saw two different presentations about automation there, and one stuck in my head — called Automating the Monolith. They had used Ruby and Watir — a Selenium gem for Ruby. I came back and read up on both Watir and Capybara, and based on the general feedback on the internet, chose Capybara.

Initial Plans

Our initial plans were to start small and see where we could get. Ben, one of the software architects, wrote a prototype to grab the login, store it, and then skip login for the rest of the tests. This was generated for each person running the test suite. He also made a couple of example tests. He created the initial database cleaning strategy — each test would start from a fresh database, we’d use FactoryGirl to create data needed for the test, run the test, then destroy the database. Each test would be able to run independently and in parallel (if we needed that in the future). This strategy mirrors a unit testing strategy, and is a great place to start to keep data manageable.

Making Progress

Carmen, my fellow QA engineer, and I wrote some tests. We both had technical engineering degrees, although not in Computer Science. We also both had done some amount of programming prior to SalesLoft, so we weren’t totally lost. We wrote tests; we made progress. We started out practicing — writing tests in areas that seemed easy to write tests against. After a month or two, we realized we needed to have a better plan of what a valuable test would be to write, and came up with a Top 50 tests list. We attempted to start tackling it. We wrote about 300 tests, however some of them were easy iterations on each other — we still didn’t finish the top 50 list. We also started to understand that it would be very hard to check some of the end-to-end scenarios — how could we check if an email actually sent? Could we monitor a standalone inbox? How could we test a phone call was made via our app? Did we want to run a test suite that would cost us money in dialer minutes each time the test ran? We didn’t have a great understanding of what the unit and integration tests already covered.

The regression test suite only ran locally. If we did run it, it took about 45 minutes. The tests were fragile too — we mainly used css selectors to identify buttons and fields. We were privileged to have the ability to add custom css to the pages specifically for test identification. We made a lot of ‘qa-special-widget’ css classes. It was a great decision, but it was still fragile. Our company and dev team moved really fast and new fields, flows, and pages always appeared. Every once in a while an engineer would be refactoring and delete our css unknowingly.

We did have some real success in early 2016 when the engineering team took on a complete UI overhaul of the application. We were moving from a purple/gradient theme to a more subdued navy and gray theme. As some pages got redesigned, and we ran the tests locally during the dev process, we found plenty of errors that could be fixed early on.

One year in. Fall 2016.

My 90 day goal had long since passed by, and accepting that my skills were not going to get us the results we needed, we hired Bubba, a senior automation engineer. By this time, the QA team had grown from 2 to 5 testers. The dev team was over 20 now and included specialized UI engineers working only in Javascript. Bubba came in and tackled our main issue — how did we get these tests running anywhere besides our local environment?

This is Dumbledore. Bubba made Dumbledoor. Three levels of puns.

He created one of the best named microservices at SalesLoft to this day — Dumbledoor. It mocked the login so that we removed any manual steps from the login / save the token process. Dumbledoor was the fake Gandalf.

He began work on LARS — the (L) Automated Regression Test Suite. This was a service running on AWS EC2 that we interacted with via Slack. The first iteration ran nightly against master. This was good, but when we had major bugs, they were caught by our customers during the day first. So we enhanced it to run continuously against master. It ran, took a 20 minute break, then spun up again. It output the results (with screenshots) to a Slack channel that the QA team monitored. We enhanced the process so that developers or QA could submit a melody PR via slack to it, and it would prioritize that run next instead of running master. This was a little better — we had the chance of catching something prior to it going to production.

The big question — Did it find bugs? Overall I think LARS caught about 3 or 4 bugs in the time we had it running, which was close to a year. It could only find the bugs for tests that we wrote; and we definitely didn’t have full coverage of the application.

We started implementing a Page Object model using SitePrism — refactoring some of the tests we already had. It was a beautiful thing — so clean, but it was another layer of abstraction that had to be understood by the teams. This would also come back to bite us; when you have modeled page objects against the front end, then you change out the front end, you’re left with a mess. We had page objects used across multiple pages, but then the engineering team was moving from Angular to React on some of those pages, but not all. The web got very tangled.

We still had flaky tests, and the dev team had grown to about 40. We tried to fix flaky tests, but if we couldn’t get to them and spend the time to fix them, we’d flag them and then they wouldn’t run. We still hadn’t added all that many more tests, even though we were adding functionality to our product at record speed.

One last chance of a redirect. Fall 2017.

Another year later and we still hadn’t found “success”. What did success even look like? Why did it take two years to ask that question? Ideally, we were catching bugs prior to them hitting production. It happened every once in a while… but in total we had found 8 bugs thanks to the test suite. We were also continuing to struggle adding additional tests; our QA team wasn’t finding time to write automated tests, and we had vastly different technical skills on the team. It was hard to prioritize spending time writing automated tests when our manual testing efforts were so valuable and successful. We had a very uncomfortable “come to Jesus” leadership meeting where I asked for help. End result of that — we narrowed the focus of tests we were trying to write to Cadences & Email in order to show some success.

At this point I had fully moved into the manager role, stepping out completely from any individual contributor work. Carmen had moved on to the development team as a full stack engineer. Along with Fora, another QA Engineer, Bubba picked up the test case creation work — now we had our automation engineer writing tests, but without the same business context that a manual QA would have. They worked together and we added about 200 more tests around cadences.

We started having a really difficult time keeping up with the test maintenance of our existing incomplete test suite, let alone having any time to write new tests. Some of the dev teams attempted to take on some of the work themselves. We had a weekly metric — for our bugs that we fixed, did we cover them with some type of automated test (unit, integration, regression) prior to pushing out the code again to production? It helped increase the visibility and importance of the tests. But at the same time, the tests were still hard to write and maintain, and the entire test suite took over an hour to run.

We started moving from Angular to React. React components made using selectors, especially css selectors, very difficult. More tests became defunct.

Nail in the coffin.

We had started moving to microservices, and our test suite still only ran with one of our original services, melody, as the base. There was an engineering initiative to split out all the front end code form the backend code in August of 2018. Rhapsody, the front-end microservice, was born. LARS died.

We spent some time trying to fix it, but it was like putting your finger in a hole of a dam to hold back a flood. It was a slow and painful death for those of us watching. There were a lot of people not watching, which was also part of the problem. The last run of LARS was January 23rd 2019.

It was a very anticlimactic ending for a very unremarkable project. Bubba moved on to a new QA tooling project and the QA team continued providing high quality manual testing.

Retrospective: How I/We Failed

Let me count the ways. In no particular order.

Team Buy-In

I had viewed this as a one-off thing that we could hopefully put into place without disturbing the rest of the team. One of my guidelines early on was “this team has never had QA before — tread lightly and try not to slow anyone down.” I should have stopped treading so lightly much earlier on.

I would have tried to push the whole team ownership of tests a lot sooner. Having a close relationship with QA and dev would also ensure the right type of test would be written at the right level, without duplicating the test purpose. We probably wrote too many Selenium tests that were duplicates of unit tests. We should have been able to have exact knowledge of what type of test to write as a regression test. If we could have had the developers writing the tests, we would have likely had more success as their technical skills are more closely matched to the need.

Test Strategy

We never solved how to fit this into CI. We never solved the parallel issue, so we were never able to get the suite down to a “reasonable” time to block PR’s going through CI. Even three years ago SalesLoft was releasing code multiple times per day and has only increased in deployment velocity. We couldn’t put a suite that took over an hour into CI.

“We built LARS using docker-compose to import all the necessary components for the test suite. This worked well, but the static nature of a docker-compose file was limiting. We could have created a worker model with LARS and run the suite asynchronously with a dynamic number of workers to achieve the time duration desired to be acceptable for CI.” — Bubba

Showing Value, Reporting, & Measuring Success

This project started out as an experiment, but stayed in experiment mode far too long. Though we had tests written, they may not have been the best tests to write. We should have been able to do an end to end test much earlier in the process, and hopefully start catching errors. Our true measure of success was hard to capture — did we catch the bug before a customer found it? We didn’t have reporting in place to tell us this.

LARS running outside of CI also posed an issue with getting developer buy-in to these results. If it didn’t block them, it wasn’t urgent to prioritize.

“Once you have a set of test results, what do you do with those outside of CI? “Now you have two places you want me to look?” It’s not really an ideal approach. One also has to solve the issue of what the results look like. Had we solved the parallel/asynchronous nature of the tests, we would have also had to solve how to stitch the independent test results back together in a readable format. Looking back on it now, having a Log Ingestion tool (like an ELK stack) combined with the concept of a “Run ID” would have been fantastic. Devs would have been able to search for a TestID and then check the history across previous RunID’s to exactly when something started failing and at what git SHA.” — Bubba

Instead we were posting results to Slack in a human readable format that was never meant to get persisted anywhere. You couldn’t analyze the output.

Language

Our initial choice of Ruby seemed like a good one. Everyone knew Ruby and we modeled the tests after our unit tests — cleaning the database in between, using FactoryGirl to generate data. However, as our team grew, we added UI engineers who only had expertise in JavaScript. They would be the ones that would break the tests due to UI changes, but they weren’t equipped to fix the tests in Ruby. Could I have foreseen this? I bet someone would have, but I hadn’t asked enough people for advice.

Skills

We hired people on to the QA team with a desire to learn how to write automated tests, but without technical backgrounds. We stayed in experiment mode too long, so I didn’t prioritize increasing team members’ skills — I prioritized the manual testing work, where we were finding bugs and providing a lot of immediate and measurable value. Since manual testing tends to be an activity that will fill up the time you give it, it was very difficult to find any extra time to learn and develop automation skills.

Selenium

No automated browser-based testing failure story would be complete without mentioning Selenium. It’s the crux of all pre-2018 test suites: browser test flakiness. Regardless of if you are using a browser or “headless/in-memory” implementation there are always flaky transitions in regression test suites.

“When you start building a suite, you pick a webdriver based on the language you know. Maybe it’s to integrate with another tool you want to use, or maybe you’re preventing language sprawl by choosing the same language as your tech stack. Even though you are writing in the language you want, you are dependent on Selenium’s ability to control and assess the browser’s state. This is the disconnect. Selenium does a fantastic job of this, but it’s not always correct. Sometimes the browser page will have completed loading, but there are still animations happening on the screen. Some element isn’t completely visible while a side panel transitions in from the side. A piece of information you are trying to scrape and validate isn’t ready because the modal hasn’t fully populated. It could be anything. And all of these things have to be solved, but in using Selenium, most approaches feel half baked and often result in clever forms of sleep(1).” — Bubba

Current State of Testing at SalesLoft

We have an incredibly talented team of 17 manual exploratory QA Engineers and every fix, change, and feature goes through a QA step on its way out the door. We catch bugs daily prior to them hitting production. We’ve also got amazing Ruby and React unit test coverage and a CI/CD pipeline. For major site-wide changes, we do a “swarm” — a version of a focused regression test over 1–2 days. We don’t have an automated regression testing tool yet, but we are also able to use that time to find bugs instead of struggling with a tool or with Selenium.

What Could Success Look Like in the Future?

It’s pretty certain that we’re not going to try to reimplement another home-grown Selenium solution.

It’s also pretty certain that we need some type of solution or delivery will slow to a crawl as we try to manually test everything. As soon as we expand our browser support beyond Chrome, we’re in for a world of hurt.

We’ve looked at Cypress.io. They claim to have solved the flakiness by not using Selenium and running off of JS directly in the browser. This should solve most (all) intermittent wait issues on page transitions. That separation from Selenium is what the browser test community needed. However, we’d need a huge team buy-in and commitment in order to use Cypress for regression testing. If we went this way, we could also solve some of the other problems by having dynamic workers to parallelize tests and achieve CI limited timing goals; and a third party logging and log parsing solution for persistence and a query language frontend. We’d lean heavily on our UI engineers to write tests.

We’re also looking at other software providers — those that provide stability and maintainability. Those that help us figure out what our coverage should be. We want to be able to write tests easily, with little code knowledge required. We want a tool that would take a lot of the workload off our team, but at the same time provide value without the maintenance headache.

What comes next? I hope to have a sequel soon. Automation: A Success Story.