Almost two years ago, shortly after I joined The Sun, we had a bug that was missed by the Engineering team and got deployed to production. The image below is a true representation of how I felt when I saw that this bug was on our article pages…
The bug caused a major CSS styling issue on The Sun’s web article pages and was missed on lower test environments. To make matters worse, this was deployed around December which is probably one of the busiest months for The Sun! To resolve the issue, we had to switch back to the old platform which meant that new features were not made visible to the users. Not long after this, another visual bug got deployed to production which caused the site’s layout to break every time a takeover advert was displayed. Again, this was missed on lower test environments because the test takeover advert was working as expected. Long story short, during this time, the team had to focus on getting bug fixes out instead of working on new features which has caused a major disruption on everyone’s workflow.
According to IBM’s System Science Institute, the cost of fixing bugs in production is four to five times as much compared to as when the bug is to be found on requirements or design stage as detailed here. Not only that we have to worry what our end users would see, the cost also includes service desk escalation, retesting plus redeploying the change.
When these visual bugs on The Sun website got deployed to production, we had people coming in from different areas of the business (Editorial team, Commercial Team, Digital Operations) all asking the same question — why was this bug missed and when will the bugs get resolved?
Fortunately, we did recover from these issues however, it took the team two weeks to get the bug fixes out and eventually caused a delay to actually releasing new features to the business. I reviewed our current quality and release process to see where our issue lies and found the following:
1. There was a lack of coverage on the existing automated tests that the team had and only covered minimal areas of the website.
2. When new features got deployed, no automated tests were written since the QA team’s time is spent manually testing these features on different browsers and mobile devices.
3. A manual regression pack was also non-existent so the testing performed were all ad hoc and unstructured.
4. Apart from the test build, deployment builds were also taking a long time to run so when deploying bug fixes, our Continuous Integration pipeline was not optimised to get these fixes deployed as soon as possible. For some context, getting a ticket merged in and deploying it to production would take 4–5 hours (on a good day!). By the time we are ready to deploy, we had missed our release window and would have to reschedule it for the next day. Staging and Production environments were taking almost an hour each 😢
5. There was no automated visual testing in place.
So.. how did we tackle these problems?
One of my responsibilities as a QA Lead was to put a test strategy in place that will help the team deploy new features quickly while ensuring that existing functionalities still work as expected. By using automation libraries such as WebDriver.IO with CucumberJS and following guidelines such as using page object model, I decided to refactor the existing test automation framework that they had and make it more reusable and easy to extend. I worked with our Head of Engineering to hire more SDETs in order to speed up automating the regression pack. To track our work and see what features we need to automate, I decided to create a separate JIRA board for The Sun QA team and run daily stand ups. This allowed me to see which tickets are being worked on a regular basis and which tickets are being blocked. Week after week, the team made good progress on automating the regression pack which helped us tremendously in catching issues that we could have missed manually.
In order to run our tests in parallel, we also created our own Selenium Grid by using Docker to pull down the different node images and hosted it on AWS. To save costs, the grid spins down after 8 pm and spins back up Monday to Friday 8 am. It was always our end goal to speed up the tests so we can report back any issues to the team faster. We also integrated Allure to show a report of our test runs. I’m happy to say that our average test run for the regression pack is between 5–6 minutes.
Now that we have solved problem number 1, we still noticed that bugs were being caught after the feature branch is merge to master. Ideally, most (if not all) bugs should be caught while testing on the pull request (PR) branches. To solve this, we worked with our team to ran the same automation tests every time a PR is created by a developer. Only then that if the build is passing, that we are allowed to merge the pull request. This strategy has proven to be effective as it was able to catch most of the bugs that we would have missed before.
To solve problem number 2, we decided to pair with developers to write end-to-end tests if what we are going to release is a new feature. By doing this, we are ensuring that tests are covered before we deploy something new to production. This had also allowed us to spend more time to do manual exploratory testing to find bugs that were missed on the requirements stage.
Moving on, we optimised the way we did manual regression testing which solved problem number 3. We still have a couple of scenarios that we need to check manually and to speed up this process, we make sure that all the manual regression scenarios are documented really well so anyone will have no trouble following it. We set up a random rota as to who from the QA team will perform the manual regression testing to allow the rest to continue with their work. Since most of our scenarios are now automated, our manual regression only takes an average of 10 minutes to execute.
Apart from improving the automated functional tests, I also decided to introduce visual testing as part of our test strategy to catch unwanted visual bugs getting deployed to production which had solved problem number 5. We initially settled on using BackstopJS which had worked well however we wanted our tests to run on real browsers instead. So we ended up using an in house tool called AyeSpy. These tools helped us with speeding up our visual regression tests as it allowed us to check our application on different viewports.
Lastly, the team had made optimisations to speed up our build pipelines, solving problem number 4. Nowadays, a deployment to Staging environment takes 2–3 mins and a deployment to Production environment takes 15–20 mins. This improvement paired with introducing automated tests helped the team deploy tickets faster. We went from releasing once or twice a week to releasing multiple times in a day 🎉.
We are still continuously trying to improve our QA process at The Sun. Most recently, we just automated our booking release process to help teams release to production even faster. Looking back to where we were before, I can honestly say that the above changes we had put in place definitely helped us to where we are now. In turn, this has allowed us to deliver critical features on time.
What’s the future for The Sun?
The long term goal is to always build cross functional teams and for developers to have a testing mindset as well so we are always on the lookout for new testing tools that will make it easy for developers to write tests (watch this space for upcoming posts!) and release their code to production easily. The main thing to remember is to make sure that the QA process we have put in place is standardised and maintained, irregardless of what technology or testing tools we decide on in the future.