Path to Production - Test Environments as a Product

It could be easier, we could have teams for each application we run, that those teams treat their production and non-production application environments the same. That they have fully automated full stack continuous deployment pipelines and the software is so well designed that there is no dependency between each application owner. Teams deliver code by the hour to production and use advance deployment techniques such as blue-green deployment coupled with highly effective monitoring, alerting and automated rollback to identify and resolve issues in production within minutes.

We do have “agile” application teams working within a cloud environment, they have growing autonomy as they build independent features and function. That’s awesome. The problem is that we have not been working that way since we started building technology. We have a sprawling on-premise (we own two data centres) environment where Non-Production environments equates for 60%+ of the overall infrastructure. This is made up of dependent applications running together in an environment.

We have followed traditional programme structure over the last 10 years, with multi-year investments and waterfall based delivery. These have sought to reduce risk through separation rather than integration. This has resulted in multiple non-production environments. Our applications (100’s) have been built tightly coupled and need multiple other applications to work (Commerce systems through to legacy mainframe systems all meshed together), increasing the size of the basic/core environment. These programmes have not invested in automation, leaving a challenging sprawling estate to manage.

Path to Production (Variety of routes followed by teams ensuring lack of consistency in versions)

This has created a number of problems. The estate is costly to maintain. There is hidden cost in each application change or infrastructure upgrade. Our people are linked to projects and programmes, when they finish, the applications in non-production fall in to disrepair, they are costly and time consuming to get working again when the next team comes along. A continued desire for separation results in projects hunting down that unused application/environment that has nobody else working in that space. The sprawling nature and lack of ownership results in a small band of “Environment Management” who work to bridge the communication and knowledge gaps across this sprawling estate. These people work for Projects and Programmes and through “best endeavours” get Delivery teams, Operations and Infrastructure teams to stitch the environment back together each time and retain the illusion these environments are good to test in. We effectively have 10 mini-lives beating away with no real Service Management to deal with them.

That long introduction was needed just to give me time to breathe and reflect. The organisation is changing for the better, which is great. We are reorganising and realising the mistakes of the past. We are now building most new systems in a cloud environment, are focused on micro-service design principles and have teams that own and run the service (well almost). These are small individual successes which highlight the organisation is changing. Our online teams have gone from 10 deployments a year to over 1000 in just over 18 months. It should be a factor of 10 greater by next year. However there is a long way to go till we are fully working in this way, maybe we will never get there. This is where Environment as a Product comes in. How do we change our existing non-production environment landscape for the better.

We have started with two key initiatives. Initiative 1, Let’s get support to introduce an “enabling constraint” through removing the environments that least support delivery. They are the most difficult to maintain, are off the beaten path to production and the most costly to re-enable them. If the project or programme don’t have them they can’t use them. We know they are costly to build which is part of the reason they have been maintained. With the remaining environments we are going to create a paved road of sorts. One of the key things is to not construct the path to production based on your current organisational/delivery construct. We have been very waterfall in nature and have therefore not made change often to production.

This results in us having multiple hops to production for a single change based on the type of testing to be undertaken at each stage. Development (unit, system), Integration, Release (end to end), Performance (performance, operational), Pre-Production (incident and support) and Production. You introduce delay just through deploying to each of these environments and fixing the issues that come up as a result. That’s before you think of data, types of testing, handoff implications. So initially let’s get down to one of each type to support all teams, projects and programmes. Let’s draw it up so it looks like some are optional. Development, Integration, Pre-production, Production (even this is too many).

This was difficult to get going, convince people that we might be able to go faster through having less environments, that there might be more contention initially, however that might force us to look at the organisational shape and how we design software. Whether the full extent of the idea has sunk in, is yet to be seen, however the cost avoidance articulation of maintenance running in to millions of pounds a year was hard to ignore and gave us the initial green light.

Revised simplified Path to Production

Initiative 2, is to consolidate the team we have and build a paved road (common environments, tools and approach). We have taken everyone responsible for non-production “Environment-Release Management” and centralised them in to a single Product team. This wasn’t easy. You have to convince the leadership team that they want to pay for this as a separate service, that the cost is neutral, value adding even and that in doing so you can look at how the team works and make efficiency, as well as improving the path to production. What this doesn’t do is centralise the technical people responsible (not owning) for those environments. These remain fragmented across Operations, Infrastructure and Delivery teams. While the number of different technologies, teams and ways of working make this impractical at a small team level.

It does allow you to put a very clear Service Management approach in place for Non-Production and with leadership support allow you to be more robust in enforcing a better approach. You can start to work with individual teams to support and help them approach their delivery differently, focusing on automation, repeatable process and decoupling of change. The aim is that teams do this themselves and that the Environment-Release team can identify teams that need support. We are also backed by organisational principles which have been established to give teams direction in how they should deliver.

Environment-Release Product Team

Centralising the team has also allowed us to look at how we approach managing the environments. A single Product team with longer term goals, where the value is what we do to the environments, not the end customer outcome that the Project or Programme was focused on, has a very different priority and outlook. What was several ways of working can now be focused in to a single way of working. We have started by making sense of what people were doing daily. This has been done through several value stream mapping workshops but also to get everyones work up on a board and try to make sense of the work we do. Get a feel for reactive vs planned work. Upon doing this we can start to build a very clear backlog of priorities for our own ways of working, how we track issues, do we manage problems, how do teams onboard on to our path to production, can we pre-empt issues, what do we do with data, looks at our skills and most importantly clear communication.

Value is now a clear driver, as is ownership. We can now mandate changes to production are part of non-production thinking. We can also start to look at cost of delay and overall costs of supporting teams and drive our own value outcomes. Lead time is also a key measure, whether that that is to measure how quickly teams move through our environments, how long it takes us to respond to incident, downtime within environments etc… We can also help to start to measure teams frequency of deployment/release and to show organisational pace of change. Traditional service management is very much focused on controls but here we want to make it much more about value outcomes and enabling flow within teams.

We also have one foot firmly in the future. Over time the need for this team will be less. Greater ownership, more end to end owning teams, “DevOps”will all drive less need for a Environment-Release Product team. Our life span is based on the value we can show to generate. However we need to develop our skills in to other areas, whether that be Platform management within our Cloud platform or in to more “DevOps” skillsets which can work within teams to help them deliver. For now we are focused on our Product and value. It would be interesting to hear how other companies and people manage non-production environments, especially where there has been ongoing development of a variety of approaches for many years.

--

--

Rob Hornby
John Lewis Partnership Software Engineering

Lead Engineer within our Technical Profession & Platform Product Lead for John Lewis with a background in retail technologies, software testing and platforms.