Learnings from a Microservices Migration journey

I’m Lima, a Product Engineer at John Lewis & Partners, focusing currently on building backend microservices for the johnlewis.com website.

My team has been building a product catalogue microservice that runs on Google Cloud and serves the needs of the John Lewis website. Our API now has ~30 consumers, including micro-frontend and other backend microservices, and it replaces a similar API hosted by a monolith package application.

We provide both GraphQL and REST endpoints for use. Here’s what we learnt during our journey of building this API.

Start small

Deciding what to do first when starting off with replacing an API that serves almost all product information needs on a website is a daunting task.

You don’t have to tackle a difficult or big thing when you are starting off. Starting small helps you prove your design choices and integration patterns without a major impact to the existing system landscape/website.

It is critical for you to know if what you’ve put in place works and can give you the performance you need. Doing a small thing first gives you much-needed feedback on your choices and helps you learn from them. It helps to de-risk your work.

The best choices are made if it is collaborative. The Product Manager/Delivery Lead may take the lead; but it is key to take input from the techies as well, as they bring valuable insight into the discussions.

Our team chose the recently viewed items panel (RVI) as the first consumer as it required very little data, but helped us to gain confidence in what we were doing. It was also a piece that wasn’t absolutely critical and wouldn’t bring the website down if it broke. It also took ~13% of traffic away from the API we were trying to replace.

Having something small running well in production is much better than a lot of non-productionised functionality.

Always ask why

While consumer-driven development helped us to focus on the needs of the consumer; really understanding why a feature was required enabled us to provide the best solutions for their needs.

Sometimes fields on existing APIs may be used just because they are there rather than because they are the best solution. A bit of prodding may reveal better ways to achieve the consumer need.

Photo by The 77 Human Needs System on Unsplash

Asking questions may help to uncover and acknowledge gaps in overall understanding of the system landscape and help provide architectural clarity.

Document the why

While it is important to know what a system/piece of logic is doing; something even more important is to know why.

As human beings it is impossible to remember all the reasons for choices or decisions we make all the time. Therefore, it is essential to have a place to refer back to when trying to remind ourselves of why something is the way it is.

This is not just useful for the team itself, but also for consumers and other stakeholders who may want to know why things are the way they are; especially if a long time has passed since the systems were built. They also help in avoiding incorrect assumptions around the design of your systems.

KDR template

We use Key Decision Records (KDRs) to document why we chose to do something in a certain way and what the consequences are. They can be as detailed or concise as you want them to be. We document our KDRs using the template in the image above and have them in Confluence under our project workspace.

Understand your bounded context

One of the things that comes up often is what the boundary of your domain is. Our bounded context from the outset was to provide core product data for the selling domain.

What is my domain

The information one needs to sell a product is quite different from what one needs to buy it into the business. The API we were trying to replace also provided some product buying information along with its selling information, which various consumers had come to rely on.

There is also non-core product data that specialist selling services use for selling complex products like made-to-measure curtains and blinds, which the old API provided, but was not within the bounded context of our API.

These requirements meant we had to work with consumers to come up with a solution that worked for them as well as us, both in the short and long term. The process also helped identify and acknowledge the need for additional microservices in the architectural landscape.

This approach has helped us to keep our API clean and avoid unnecessary complexity.

Own what you expose on your API

It is important to be able to answer questions from future/existing consumers about the evolution of the API or accept additional feature requests. You may also need to field challenges to design decisions you’ve made for your API, for which the KDRs discussed above are quite useful.

These are part of the core responsibilities of your team and the discussions result in better overall appreciation of the design and architecture.

One of the key things that help in answering questions around availability of features is the team roadmap. It is always handy to have a link to the latest version of it, or know where to find it!

Photo by Slidebean on Unsplash

Another important thing is not to clutter your API with everything that is requested by consumers. Carefully consider whether a piece of information belongs on your API (bounded context) and if so, how it should be made available. It is important to expose just the right amount of information and only what is required.

There have been occasions on our team when API modelling has taken more time and energy than the actual implementation.

Know your consumers

It is important to know who your consumers are and how they use your API. One of the most useful ways for us to do this has been via Consumer Driven Contract Testing.

Pact consumer network graph

The consumer network graph that Pact provides gives an idea of all the consumers using a particular provider.

Having consumer-side Pacts help us in assessing the impact of any changes we intend to make and communicating with the impacted parties proactively.

The other important thing is to be available to your consumers to answer questions and clear doubts. We have a Slack channel that acts as our front door for any questions/suggestions/feedback/requests anyone has about our API. We monitor it during working hours with a dedicated rota.

Whilst various consumers were moving over to our API, we had regular consumer drop-ins where we were available to answer questions and help with better understanding of our API.

Design for change

It is important to be able to make changes quickly to your API or underlying microservices. Quite often it is the unseen parts of the logic that are the most complicated.

Design the internal application landscape such that loosely coupled, simpler and smaller, well-defined microservices are leveraged to provide the final exposed API. This helps in making smaller changes to the internal microservices without impacting the exposed API. It also helps with internal architectural evolution without impacting any consumers.

Photo by Claudio Schwarz on Unsplash

Employ an expand and contract pattern for changes on the API. This avoids breaking changes for consumers and gives them enough time to switch over.

Deploy small changes often to production. This helps build confidence in the code changes.

It is also important to set the expectation with consumers that there are bound to be changes as the API evolves.

Have a way of communicating changes made to all consumers. We have a Slack group set up with all our consumers on it. Anyone who is interested in being notified of changes to the API can add themselves to it too. We notify the Slack group when we know there are changes already deployed/to be deployed that will affect most consumers.

Our analysis of who is impacted is greatly helped by contract tests as they tell us exactly which consumers are impacted.

Build in resilience from the start

It is important to anticipate things that could go wrong and put in measures to protect your API.

Consider -

  • Short timeouts to fail fast
  • Retry mechanisms
  • Rate limiting
  • Circuit breakers
  • Auto-scaling when required
  • Disaster recovery considerations
  • Seeding/re-seeding mechanisms

Use libraries relevant to the language and framework you are using to set up retry /rate limiting mechanisms and circuit breakers.

Part of our monitoring dashboards

Consider horizontal pod autoscaling to scale microservices based on pod CPU usage and Keda’s ScaledObject to scale microservices based on the number of messages waiting to be processed on GCP pub/sub topic subscriptions.

If you have consumers who hold a local cache of your data (like some of our consumers), provide a way for them to seed/reseed all your data if needed.

If you have many upstream dependencies consider caching their data locally to minimise your dependence on their services and protect your service.

Also, consider backing up your data stores regularly to enable you to recover from any DR situations.

Quality is everyone’s concern

We have built in tests from the very start and follow a Test Driven Development (TDD) approach to development. There is no specific QA-only role on our team. It is everyone’s responsibility.

We have different types of tests built into our pipeline. Unit tests test specific classes while integration tests test the various components of the app work fine together.

Where external APIs are involved, we use Wiremock or relevant mocks for writing automated tests. Contract tests using the Pact framework ensures we have understood the response from the API we call correctly.

We also have performance tests built into our pipelines to give us confidence that our code change hasn’t adversely affected performance. Over time we have realised that we are interested more in the performance trend of our API and so have graphed the performance test results using Google BigQuery and Data Studio.

performance trend

Operability, monitoring and alerting

It is key to know what’s going on inside your microservices landscape and react to any adverse changes.

dashboard

We use the Micrometer library to capture metrics in our app and expose them in the Prometheus format for collection by VictoriaMetrics. We use many of these exposed metrics to configure alerts to notify us when things go wrong and use Grafana for dashboarding.

The key questions to ask when capturing metrics and setting up alerts are— who wants to know and why, what needs to be captured and what is the action?

non-urgent slack alert

If you are a critical service and provide 24x7 support, it is important to think about urgent and non-urgent alerts when setting them up. Key questions to ask would be — can this issue wait till the morning? Can I do anything to fix it overnight? It is important to tune alerts to avoid false alarms.

On our team, we have non-urgent alerts going only to Slack while urgent ones are set up in PagerDuty and we are called out overnight for them.

Runbooks are the first port of call for solving production issues (especially when called out at night!). Keep them up to date and easy to use.

It is also important to have a healthy mix of operability and functionality(feature) tickets when planning your work. It is important to tune/refactor/simplify and improve the code to keep your API supportable and performant.

Provide good documentation

When you serve multiple consumers, it is extremely important to have good documentation. In addition to standard API specifications, it is valuable to have any additional on-boarding documentation related to what can be expected from the API, what the API evolution strategy is, good practices around resilience and performance etc.

consumer on-boarding

Working swagger specs for REST endpoints and a GraphiQL playground for GraphQL endpoints are an absolute must for the consumers to familiarise themselves with your API.

Consider using a library in your code that will automatically update your API specifications as you change your API so that you don’t have to keep track of it separately.

swagger spec
GraphiQL playground

Voyager is another extremely useful tool to navigate the type system on the GraphQL API.

All our consumer on-boarding documentation and specs are pinned to our front door Slack channel for our consumers to find them easily.

Share share share

Software development is a team sport and one of the most important drivers for success is sharing knowledge.

We use extra time at the end of our stand-up to go through anything that anyone on the team wants to talk about. We do a lot of pair programming and then showcase to the full team to go through what was done.

It is extremely important to have an open, honest and sharing culture within and outside the team. Not everyone will agree with everything; but it is important to have a safe place to share feelings, ideas and opinions without the fear of being judged. Some examples of things that have come up on our team are -

  • Help! We are going down this rabbit hole! Can we talk it through with someone?
  • Too many options to choose from! What should we do?
  • I need a change! Can we swap pairs?
  • I’d like to tell you about how my weekend went….
  • We need feedback — can we do this better?

Sharing ideas/solutions/tips and tricks with other teams including consumer teams is also a great way of building knowledge and capability within the wider organisation.

Final Thoughts

  • Starting small and building in learnings from early implementations can provide strong and resilient APIs.
  • Small, loosely-coupled microservices speed up change, shrinking the path to live.
  • Deploying small changes to production can give confidence in code changes.
  • Expose just the right amount of information to avoid cluttering your API and ensure there is a consumer for each piece of information.
  • Think about quality and resilience from the start.
  • Have the optimum level of monitoring and alerting in place. Too much or too little could be detrimental.
  • Software is always evolving so it is important to set expectations with consumers and other stakeholders that changes are bound to happen.
  • Software development is a team game and it requires all kinds of people -the thinkers, the doers and the doubters to get things right. You win it only if you work together.

At the John Lewis Partnership we value the creativity of our engineers to discover innovative solutions. We craft the future of two of Britain’s best loved brands (John Lewis & Waitrose).

We are currently recruiting across a range of software engineering specialisms. If you like what you have read and want to learn how to join us, take the first steps here.

--

--