Continuous Process Improvements

Published in

THG Tech Blog

11 min readAug 8, 2019

In the THG Warehouse Management Systems, team we have grown our software engineering processes as we have increased our team size and warehousing responsibilities. From a small & tight-knit, single location-based team of engineers and operations staff to a larger team distributed over three time zones and continents.

Measuring processes & organisational performance

If you have studied software engineering, you may have seen or heard of the Capability Maturity Model from CMI. An alternative (less scholarly but perhaps more immediately useful to most) model is Joel’s 12 steps. Both of these are attempts to define metrics for measuring the effectiveness of an organisation’s capability to complete a software project successfully.

The aim of most organisations — become ‘optimizing’

The CMI model defines the goal state of an organisation — ‘optimizing’. To get there a team should always be looking at means to improve their processes.

Identifying what to improve and more importantly how to improve is more of a challenge, however Joel’s 12 tests are specific and much easier to get a grip on:

So first of all, how does the WMS team stack up with Joel’s 12 tests:

Yes, we use GitHub
Yes, we have build scripts coordinated via Jenkins
Yes, we build and release daily
Yes, we track issues in Jira
Sometimes, we don’t always address bugs prior to writing new code as it depends on deadlines and priorities of the new feature vs the severity of the bugs found
Yes, we keep our project milestones and schedules updated
Yes, we have dedicated process engineers who help the software engineers define the scope of work and the specifications of the features before we start developing
No, we have a shared office space which can get noisy. Software engineers wear headphones (some use fancy noise-cancelling headphones)
No, we have good tools (IDEs, hardware etc), but not the best
Yes, we have a dedicated QA team
Yes, short coding exercises form part of the interview process for new candidates
No, we don’t do hallway usability testing

Overall not too bad from a software development practices point of view. I would say that most development teams that I’ve worked with have had between 8 to 10 of the 12 steps sorted.

However, Joel’s tests are from the year 2000 and predate Cloud Infrastructure, DevOps, SRE and the other more recent developments in software engineering best practices.

Developer surveys

Each year StackOverflow (which Joel founded) publishes the results of their Developer survey. This collates the experiences of a wide variety of developers across different domains and helps to understand trends in the software development industry.

In a similar fashion, Nicole Forsgren, PhD; Jez Humble; and Gene Kim (of the DevOps Research & Assessment Group) publish the “Accelerate State of DevOps Report”.

Moving on from blog posts (like Joel’s original 12 steps) to a more research oriented and data-driven approach to assess the current industry-wide state of software engineering processes shows growth in the maturity of the software engineering community and the surrounding processes.

Obviously these types of surveys and analysis have been conducted in academia since the distinction was made between software and hardware; the difference now is that the wider (industrial) community is more aware of the data and is more willing to engage in understanding what the results of these surveys mean.

Software Delivery Performance

The annual Accelerate State of DevOps Report defines a different measurement ‘Software Delivery Performance’. This definition is broken down into five metrics that can be used to categorise your organisation’s performance:

Lead Time
Deployment frequency
Change fail
Time to restore
Availability

The report defines a scale for each of these and organisations should fall into Low, Medium, High or Elite for each of these.

When we consider our team against these metrics, we find that we would currently be considered a High performer with some characteristics of an Elite performer on certain scales.

Continuously Improving & Learning

After evaluating our current status, the next obvious question is “What can we do to improve?”. We considered the areas where we hadn’t met the ‘Elite’ standard from Software Delivery Performance scale and how we could improve in these areas. We decided to address the following as an initial programme of process improvement:

Code reviews
Game days
On-call & support
Reducing the impact of production failures and incidents
Improving the overall code quality

Code reviews

We have (what I would consider) a fairly standard branch-based development workflow. We use git and GitHub to host our private repositories. We use GitHub pull requests to enable some form of code review prior to merging our branches back to master.

This workflow is common and well-understood; however, how to conduct code reviews is an active area of research at large software organisations. Code review comments can vary in quality substantially and having a review process that emphasises just the changes to the code in a limited context can be detrimental to the overall structure of your system.

There are various ways to improve pure GitHub pull requests. The first thing we did as a team was to look at bugs that had made it through code review and classify them in some fashion. These particularly pernicious bugs we codified at first as a checklist for all reviewers and also as a GitHub template.

The GitHub pull request template we settled on, partially completed.

We also decided that some code reviews required more than simply comments on a Pull Request. These reviews are necessary when there are larger architectural concerns in play and for tying together work from multiple teams. These allow the developers to review not just the code, but also the technical design and to discuss the decisions and context in a higher bandwidth fashion.

This obviously would have to occur via video conferencing for the remote team.

A typical ‘code review’ session in the WMS team — see how happy the team are (they’re just shy)

The issue with a traditional code review is that it encourages ‘sunk-cost-fallacy’ thinking in that the review occurs at the time when the engineer has already committed code and has already invested significant thought and time into one particular solution.

To reject a change at this stage, due to problems in the technical design forces an engineer into a position of either accepting that they need to completely redo the work (unlikely), or trying to justify their current solution — and (in my experience) all engineers are egotistical and love to argue and debate about minutiae.

To mitigate against this issue of only getting feedback on your technical design and code-style late in the process, we created a slack channel to discuss / propose alternatives, #coding_practices

Example of engineer conversations about different methods of defining tests

This channel allows for ad hoc discussions around snippets of code found in the system or areas an engineer is currently working on. These discussions are informal and behave more as in-flight validation of techniques and choices instead of the more formal post-commit review that a Pull Request entails.

Our coding practices channel is one mitigation that has helped engineers discuss their code outside of a code review. Another method the team uses is to tag a branch in GitHub as WIP (work in progress) which acknowledges that this code is far from complete but is still open for discussion and feedback.

The jury is still out on the success of both of these mitigations — of course a ‘non-agile’ approach could be to turn the clock back a few decades and devote time up-front in the process for some technical design ahead of an implementation phase.

When discussed amongst the team, this “Big-design up-front” was seen as the least palatable option as it just doesn’t seem to match the rest of the agile practices the team are experienced with. However the current process of reviewing code in isolation from the design/architecture is, in my opinion, a missed opportunity and a deficiency of the Pull Request model.

Game Days

Game days are typically used to investigate a team’s response to a ‘disaster’ and how to recover business operations. We have previously conducted a game day around security issues. The plan is to have another game day with a different focus this summer.

On-call Handover

https://spectrum.ieee.org/consumer-electronics/gadgets/the-consumer-electronics-hall-of-fame-motorola-advisor-pager

To support the WMS, the team uses an on-call rota where the developers take turns acting as 2nd-line support for any software related issues raised by the warehouse operations team.

We use the excellent VictorOps to manage the on-call rota and allow engineers to trade on-call days etc. along with its facility to route alerts from our monitoring systems to the current on-call engineers (it genuinely does make “on-call suck less”) and as a team it’s one of the key pieces of DevOps tooling we have. We also use VictorOps to derive a status page for our external stakeholders.

The WMS metrics, alerting and paging architecture — VictorOps handles paging the on-call engineers

After a year or so of just passing along the support phone (modern-day equivalent of the on-call pager), we realised that there were frequent cases of engineers being called to deal with an issue which the previous on-call engineer had a good understanding of as they had just dealt with it. This led to grumpy engineers who had their on-call hours extended as they were called by the new on-call engineer requesting advice — not great.

To combat this (and try to ensure that the on-call period is as defined and not extended by one or two days each time), the team instituted a short “on-call handover” meeting. This meeting has a simple format:

Discuss and document any ‘incidents’ or out-of-hours calls from the previous week
Discuss and decide if current alerting / monitoring is sufficient or needs adjusting to be more or less sensitive
Discuss any notable changes to be deployed in the upcoming week so that the new on-call engineers won’t be surprised by a sudden change in system behaviour

This short hand-over meeting has had a positive impact on the ability of on-call engineers to be up-to-speed & informed of the current state of the system when handling a call and has improved our mean-time-to-recovery (alongside enforcing better documentation).

Post-incident Reviews

Even the best performing software teams have service outages and bad days. The DevOps and SRE movements are a reaction to dividing the responsibility of developing a service from operating a service.

In the THG warehouses we develop, maintain and operate a software system that, like most complex systems, has defects. When a previously unknown defect causes an issue that impacts operations, where activities in the warehouse slow or stop entirely, the engineers respond immediately to recover service availability. The resolution steps are usually one of (or a combination of):

Rolling back the service to a previous known good state
Resolve the issue and deploy a hotfix
Restarting a failed vm or service

Upon recovering the system, the engineers organise a Post-Incident Review to determine:

What led to the outage or disruption?
How could the incident have been resolved more quickly?
What changes are needed to software, documentation, service SLAs, monitoring, or operational processes to prevent a repeat of this incident?

The review is conducted as a meeting between the engineers involved, the incident manager (usually the first engineer to respond takes the role of the incident manager) and other stakeholders from warehouse operations.

The output from a review should always be complete documentation of the incident including;

A timeline of how the incident started, was reported to engineers and was handled by the team
Root cause analysis
Links to relevant reported bugs (Jira issues in our case)
Actions to take to prevent a similar incident occurring in the future

The ultimate goal of a post-incident review is for the team to learn from the failure and to improve the overall quality of the system (as measured by up time or service availability).

Bug Bash (or Bug Squash) Days

On long-term software projects Technical Debt builds up over time. The causes of technical debt are many and is associated with “Software Rot” where dark corners of a large project that may not be getting active attention and maintenance start to become difficult to change (dependencies and libraries not brought up to date in a timely fashion are often a cause of software rot).

There are also minor bugs and defects that are never blockers or critical to fix in the current sprint and are thus de-prioritised in favour of the key deliverables for the sprint and other higher-priority defect fixes.

To address both technical debt and long-standing minor bugs, the WMS team decided to organise a “Bug Bash”. There are two competing definitions:

Bash the product until the bugs fall out vs Bash the actual bugs

As our current goal is to increase software and system quality, and we already have a list of minor defects and technical debt in our Jira backlogs, the team decided to focus on bashing the bugs instead of finding more to add to the list.

Initial organisation included:

Defining our terms — finding vs fixing bugs
Deciding on gamification — rewards for bugs closed etc.
Deciding how our geographically spread sub-teams will work together — do we make each location compete against each other or do we use this as an opportunity to strengthen remote-working via pairing engineers across locations?
Deciding how to measure success of the activity

Setting Baselines

The key to being able to start a programme of software engineering process improvements is to establish a baseline against which any improvements can be measured. We started by looking at the key metrics from the Accelerate State of DevOps Report:

Lead Time
Deployment frequency
Change fail
Time to restore
Availability

We then spent time analysing our internal team data to establish either a raw baseline or what could be considered a proxy value — not the actual value itself but a value that intrinsically varies in relation to the underlying (hard to measure) value.

For 2, 3, 4, and 5, THG software teams keep data about all software released to internal systems so these were relatively easy to capture. However, capturing the Lead Time took a little more effort to calculate.

After we have completed this first set of improvement tasks, the plan is to track these five metrics and report back to our stakeholders on any measurable improvements over time.

We may need to adjust our programme of process improvements over time based on the feedback we’ll get from the measuring and tracking of these metrics, we may need to discard one (or more) of the metrics, or add further metrics when we have completed a cycle — right now we’re in the initial stages of using this data to guide further improvements to our processes.

We’re recruiting

Find out about the exciting opportunities at THG here: