2024 Summer of Learning — Streamlining QA Automation and Diagnostic Reporting

Published in

mainframe-careers

7 min readAug 9, 2024

This past summer, I had the privilege of getting to learn and grow through Broadcom’s Next Generation Mainframer program. This program not only allowed for me to explore a plethora of new technical areas and the tools associated with them, but it also further advanced my larger professional skill set by allowing me to to work amongst and across large teams within the organization. With the vast amount of experience I gained this summer, there were a number of projects I got to dive headfirst into and make an instant impact on. However, perhaps none were more significant to my team than my work to comprehensively overhaul and streamline the Quality Assurance (QA) regression test suite and its corresponding reporting mechanisms.

The Challenge

When I initially arrived at Broadcom, I was quickly introduced to the QA team for the OPS/MVS product, which is a critical event management and automation tool for mainframe systems. I was told that the challenge of streamlining their automated test suite and improving its reporting mechanisms was something that had always been a high priority. After hearing the issue and realizing the importance of further modernizing the test suite, I got to work on breaking down the problem and fully grasping what I was up against. From my initial diagnosis, the QA team was faced with three main issues.

Firstly, their many Python test scripts were running as “freestyle” jobs within an internal deployment of Jenkins, an automation server that supports many CI/CD initiatives. Jenkins is a well-known, well-respected tool in the industry, however, by using freestyle jobs the team wasn’t able to take advantage of much of the fantastic functionality and features that the server provides to its users. Another issue with these freestyle jobs was that, while they all did identical runtime setup (environment variable initialization, package installation, etc.), each of them only ran a singular test script. This, compounded by the fact that the team had over 100 of these jobs running nightly, means there was a large, unnecessary cost of overhead taking place when considering time and processing resources.

The second issue I found was on the diagnostic side of the system. Their reporting mechanism was simple: each time a job ran, if that run failed for any reason, the logging output of the job would be emailed to the team. Although it is simple, diagnosing an issue quickly became a laborious task for the QA team, especially when some major issue cropped up, bringing down many tests at once. The team relayed that they were unhappy with this process, as it was not unusual to be flooded with emails, requiring them to sift through them and painstakingly dig into the output for each failing test to determine what the issue was, the severity of the problem, and who/what might’ve caused that issue.

Finally, the last issue I was tasked with was automating additional REXX tests that were previously driven manually. These tests relied on hardcoded values in an Excel spreadsheet, which was not tracked through a Source Code Management (SCM) tool like GitHub, making it labor intensive to run and update.

My Solution

The first issue I tackled was the conversion of the Jenkins freestyle jobs into Jenkins pipelines. Leveraging Infrastructure as Code (IaC) principles, I wrote the pipeline structure utilizing the programming language Groovy Script, which allowed for these pipelines to be fully-customizable, stage-based procedures that were version-controlled. By employing these IaC principles, I was also able to parameterize many common values that had previously been hard-coded into the freestyle jobs. For example, if the team wanted to update the Python version for testing, rather than changing the value in every single job as it had been before, we could now simply change one property value through the Jenkins UI, which is then dynamically distributed to the pipelines at runtime:

The place in the Jenkins UI where you can change these dynamically distributed properties. They reside in the folder containing all the jobs you wish to have access to them. — Using the “Folder Properties Plugin” for Jenkins, you can dynamically distribute values to your jobs at runtime.

These pipelines solve the issue of the costly repetitive setup that freestyle jobs suffer from, performing setup once and then using that environment for the remainder of the pipeline. In our case, I worked closely with the QA team to logically group tests into various pipelines by category of product functionality they were testing.

Jenkins is also great about providing visual pipeline representations/reporting out of the box. Simply clicking on a pipeline within the Jenkins UI pulls up this “stage view”:

A picture containing the “stage view”, which shows each build of a pipeline segmented into its different stages. Lots of information is surfaced on this screen. — By clicking on a pipeline in the Jenkins UI, you are automatically brought to this stage view.

Each row is a specific run/”build” of the pipeline, while the columns are the stages present within the pipeline. Lots of valuable information is surfaced on this screen, such as whether the run passed or failed (the red box on the left means they failed), the stage names, the average runtime for each stage, the actual stage runtime for a build, and whether a stage passed (green), failed (yellow), or was skipped (gray) due to disablement. By clicking on a certain stage for a run (often one that failed), Jenkins also automatically surfaces a pop-up to the screen that holds an excerpt of the test’s output, containing whatever error was found:

The excerpt of a stage’s console output holding the error that occurred. — This excerpt of the stage console output can show you exactly where the error occured in a process.

The ability to disable certain tests was something the QA team strongly pushed for, in case a certain test needed maintenance or was corrupting files that could affect other tests. Once again, I implemented this feature so that our team can simply change one value within the Jenkins UI to disable a test of our choosing. This also works for the test suite as a whole, in case all the pipelines must be disabled at once.

For even more visual reporting power than just this out-of-the-box feature, I set up a real time monitoring dashboard utilizing the Jenkins Build Monitor View plugin:

A monitoring dashboard for Jenkins jobs that provides real-time updates and information at a glance. — This dashboard is a great triage tool that improves the inital diagnostic process of our team.

This screen is a tremendous improvement over the process of sifting through emails, and provides additional information that helps us to diagnose key aspects of an issue. Each box is clickable, bringing one directly to a pipeline’s corresponding stage view as shown above and, on top of seeing the pass/fail/disabled status of each pipeline, valuable insights are front and center, such as how many builds have failed in a row, certain identified problems which we can tell Jenkins to search for in failing builds, and, if someone made a change to the code immediately preceding the fail, the name of the committer that might be responsible:

An excerpt from the monitoring dashboard that shows who committed changes right before the pipeline broke, and how many builds failed since that change. — Providing the name of the person who might be responsible for breaking the test(s) allows us to quickly know who to follow-up with on the issue.

While I left in the email reports from failing pipelines as a backup, these more modern, visual reporting mechanisms have already revolutionized the way the QA engineers approach diagnosing a failure in our test suite.

The last reform I made to the QA system here was to fully automate the manual portion of our test suite. To bring these tests up to speed with the rest of the suite, I wrote various Python scripts to automate the Excel-to-CSV conversion necessary for testing, as well as a larger test driver program. I also refactored the old codebase to be more robust, readable, and maintainable, by allowing engineers to pass various parameters into the program at runtime, rather than manually hardcoding values into the code itself. Lastly, I again broke down the tests logically into functional groups with my team and implemented a pipeline to automate the nightly schedule, as well as to exploit all of the features I already mentioned above. All of the files necessary to this portion of the test suite were now also stored alongside our various other tests, version-controlled in Git/GitHub. This automation took a large time burden off our QA team that the manual process required, freeing them up to work on higher priority matters.

All in all, I took 133 freestyle jobs, plus all of the manual tests, and slimmed them down into 37 logically grouped pipelines, all while improving upon the reporting mechanisms and maintenance of the test suite.

A Look Back

Making a positive, lasting impact in any way I can is an important driving force in my life, and my time at Broadcom this summer has allowed me to do just that. My project allowed me to make an immediate impact on the productivity of the QA team here and on the reliability of our testing as a whole. Furthermore, I accomplished all this while also making diagnostics and maintenance of the test suite easier, improving the day-to-day experience of our QA engineering team.

While this post highlights my main project from this past summer, I also had the opportunity to engage in a variety of other activities that have enriched my learning and growth. My time at Broadcom has truly provided me with a wealth of new knowledge and experiences across various tools and disciplines. I can’t express enough how grateful I am to my team, others in my office, and to the company as a whole for this opportunity, knowing that I have been primed for my future career.

This summer may have passed in the blink of an eye, but its impact will surely be lasting.

2024 Summer of Learning — Streamlining QA Automation and Diagnostic Reporting

The Challenge

My Solution

A Look Back

Written by Dominick Dellecave