We’ve embarked on work to move off our legacy monolithic Oracle Commerce platform and are creating a microservice-based architecture hosted in Google Cloud. As part of this we have created a Platform (John Lewis Digital Platform, JLDP) which enables teams to get started quickly through providing key capabilities that are needed by all teams e.g. Google Cloud environment, code repository, path to production, telemetry. We call this our “paved road”. We started out on this journey as it was observed that we were not delivering business value quickly enough and technology was believed to be a key impediment to doing this.
One of the key initiatives we undertook as part of building the platform was being able to measure the progress of teams working on the platform through their approach to Continuous Delivery. Previously we had lacked this information and could not make data-driven decisions about our service/application health. We wanted to avoid this and build it in to the platform from the start.
We started capturing much of the data through spreadsheets based on data visible in Google’s Stackdriver logs. We’ve iterated on this and have developed an application which we call the “Service Catalogue”, that brings together the various metrics we’ve collected on health of a Service (one or more microservices to achieve a particular outcome), along with key information on each service/application e.g. Search.
Lead Time to Value
Our initial focus was to show that access to the Platform was no longer a constraint in getting a service live to customer. We wanted to capture some information that proved this was no longer the constraint. We were able to quickly show our service creation lead time took hours not months as it had been previously.
We then looked at capturing the lead time of the first service deployment to production, we called this our onboarding lead time. For the platform this evidenced that teams had managed to navigate our paved road and deployed at least a “hello world” to production. This confirmation also reduced the platforms reactive work load and was a key message to teams when onboarding them, get to production early.
The final addition was to measure of lead time to first value. This was our final measure of success and confirmed the team had made their service live to customers.
Once a team had deployed and were live to customer, we wanted to continue to understand their continuous delivery on the platform. We set out to capture more information: deployment intervals, deployment lead time and deployment throughput. We have again used systems data to capture this information, looking at git commit history through to container deployment to derive the data. We are looking to observe teams’ approach to agile principles and lean methodology through their continuous delivery practice.
The deployment interval captured is the time between deployments to our production cluster. We want to see a low interval and consistency which suggests a team is then deploying frequently and continuing to work on the Service.
The deployment lead time is then how quickly a code change gets through the pipeline to production. This tells us how long it takes a team to get a change live and again looks for consistency and a short time from commit to production. This tells us that teams can get change through to production quickly through their pipeline and their ways of working.
Deployment throughput then helps us confirm overall deployment volumes across the platform helping us to evidence our overall deployment trends within teams and services. We have the data for this but have yet to fully integrate this in to our Service Catalogue, though we can present it on a on-demand basis.
What did we learn?
These data points were then critical for our initial conversations with teams which we called platform research. We used the data to talk to teams that were taking time to get to their first customer and understand what might be constraints in either the platform or the wider delivery process. We are also trying to build a product not just a platform and so the capabilities we had will be informed through the customers (service teams) running on our platform.
We learnt quickly that we’d started to unblock some key constraints.
- The access to the platform has reduced the provision of infrastructure to teams from months to hours.
- Our overall deployment throughput was increasing substantially from 10’s of deployments a years to 1000's.
Through further ongoing research we learnt that there were common service constraints
- Teams were struggling with the security process which was designed for large waterfall projects and hadn’t caught up with a more agile, microservice-based architecture.
- Teams had too much work in progress. We had teams which were requesting multiple services without getting their first service live to customer.
- Teams had external dependencies, usually around integrations outside the platform ingesting data from other external systems.
- Teams were attempting to get to feature parity before going live to customer.
- Teams were still attempting to undertake end-to-end functional or performance testing with external dependencies.
- Change freeze still plays a part in reducing consistency and increasing batch size.
These were all things we could now start initiatives on to improve lead time. For example, we have created a Platform Security team to help engagement with teams in to our current security process with a goal of changing and developing a pattern more consistent with our delivery approach.
It also helped us inform the direction for the platform. We increment the version of our platform quarterly (although deploy daily). Each increment we aim to delivery new capabilities such as improved service creation, monitoring and logging capabilities and resilience, all focused on building a product that helps the teams running on our platform go faster or improves their stability.
So far we have focused on measuring our continuous delivery indicators to show throughput to production. Operability is the next key focus. Teams are now starting to support their services, and the traditional operational structures which supported many of the operability aspects are now transferring their knowledge on processes to teams.
- Continue to develop the platform with operability as a focal point. We have already provided teams with indicators of health e.g. golden signal dashboards based on Operability patterns outlined in Google’s Site Reliability books.
- We want to combine this with alerting focused around Service Level’s and error budgets.
- We will look to expand our Service Catalogue to include deployment stability indicators. Deployment Failure Rate (number of failed production deployments and duration between), Recovery time from failure (time taken to recover the service after a failure).
The aim is to combine our continuous delivery and operability metrics together in one place. The future will then be tying this data together with business data to truly show how deployment frequency relates to value. The platform and being able to evolve it as a product will continue based on our ability to evidence and communicate the value it can add.
For more information we recommend reading Measuring Continuous Delivery.