Tech Modernization at scale — Blog Series

Chandra Ramalingam
Walmart Global Tech Blog
6 min readJul 1, 2020
Image credits: Pixabay

Platform refactoring initiatives go by various grandiose terms — Platform modernization, Tech refresh, Agile transformation, and DevOps transformation. The goal of these initiatives is primarily refactoring systems and reorganizing the teams that build and operate such systems. The correlation between these two — teams and systems — can’t be more succinctly expressed than what has become popular as Conway’s law, based on Melvin Conway’s article: “How do committees invent.”

Any organization that designs a system (defined broadly) produces a design whose structure is a copy of the organization’s communication structure”

M. Conway[2], http://www.melconway.com/Home/Conways_Law.html

The opposite is also true — Teams evolve with systems, and are enabled or limited by such systems. Take the example of a team developing a monolithic application. It perhaps makes sense when the team starts to build a new application to solve a new and complex problem. As the application matures, scales up and starts to host multiple complex use cases, the teams grapple with the monolith, and try to organize themselves in different ways around the application. The communication lines between such teams multiply (see image below) as they struggle to coordinate their work to ensure they do not step on each others’ toes. If such teams are in different geographies, they have a tiny sliver of a time every day to meet and collaborate.

Photo Credit: Lighthouse

Any “transformation” initiative that underestimates — or doesn’t take into account — the interdependence between teams and systems is doomed right from the beginning. Most companies, like us, by these modernization initiatives, are aiming for agility — how to improve throughput without compromising on quality. Thus, eliminating complex communication paths and empowering the teams to run as independently as possible are at the front and center of this effort.

Going back to the problem of monoliths, we were grappling with many such applications that were falling under their weight. Scalability and stability were degrading with every change, maintaining development and code hygiene became impossible as the teams were in different geographies. More importantly, the teams struggled to add even small features.

Thus began our venture into “microservices” — the natural and the most obvious choice for breaking monoliths. After all, if we break down a massive application into smaller applications and distribute them around, we should be free of the communication mayhem that we saw earlier, won’t we? And hence agility of the teams should improve? Unfortunately, there are fewer terms and architectural styles that are less understood as microservices. For starters:

1. How “micro” a service should be to be deemed a microservice — is it one operation per service? Or is it one business entity per service?

2. Should the services be broken down based on technical capabilities or business processes?

3. How would you handle data consistency in the world of distributed services? And is data consistency even necessary at all times?

4. How would we observe failures in a massive web of interdependent services?

5. How would we prevent unavailability or failure of one service affecting a host of other services?

6. Can the services be different from one another? How do we handle developer experience in a polyglot environment?

As we started, there were more questions than answers and we made a lot of mistakes. The pitfalls that we fell into helped us refine our approach as we chipped away at the monolith. For instance, one of the first services that we extracted out of the monolith still shared the same domain boundary as the monolith. Two applications were working on the same domain entity concurrently. Compensating transactions and Sagas which are usually the cure for distributed updates didn’t help here much at all as the domain boundaries were incorrect. We didn’t quite build a distributed monolith, but we were on the way to create one.

Fortunately though there’s help: Domain Driven Design (DDD) is a key tool to identify clear system boundaries — not only inside a monolith, but also in any complex business domain. There are excellent books and articles — Sam Newman’s Building Microservices, Martin Fowler’s articles on Microservices, to name the most popular ones. There are countless tech talks, blogs, and articles written by various organizations that have tread the same path. We adopted some of these learnings to redesign our applications and teams.

We wanted to share some of our experiences refactoring our systems and introduce some guiding principles and concepts that would be of interest to anyone pursuing this path. We intend to write a series of blogs in the coming weeks, the topics of which are listed below. These topics are by no means exhaustive, but we aim to talk about some critical things that mattered with the assumption that the readers will further explore them. We may also gloss over some of the well-documented concepts of building and managing distributed systems. For example, it’s fair to say that the days of logging into individual machines to look into logs are long gone. Without centralized logging and alerting, it would be pointless to build distributed systems. We may not dwell on some of these well-documented concepts, though we may highlight if and how we diverged from the pack. But we intend to cover some of the contentious topics like how a proper CI/CD pipeline looks like, what standards and practices are worth mandating, the boundaries of testing in the world of microservices, and continuous delivery, and so on.

Building Domain Driven microservices — In this first blog in the series, we’ll talk about microservices in general and some of the first steps to take when starting to build microservices. We’ll talk about how domain driven design helps in carving out services boundaries and enables low coupling between the services and the teams that work on them. We’ll also cover some important design heuristics, methodologies, and pitfalls to watch out for when designing microservices.

Designing applications for agility — Part 1 and Part 2: We live in a world where changes are an everyday occurrence and our applications need to evolve at the pace of those changes. Refactorability, Modularity, and Testability should be at the core of our application design. We cover some concepts of Domain driven design — note that DDD existed before microservices became a thing — such as ubiquitous language, intention-revealing interfaces, anti-corruption layers, and non-anemic domain models come handy. We also explore these in the context of another useful application layering approach — Ports and Adapters pattern

Accelerating Software Delivery and Organizational Performance: We’ll explore aligning teams and systems in ways that can improve agility — also called as Inverse Conway Maneuver. We’ll also discuss how adopting DevOps culture transforms functions such as Architecture and SRE. We’ll also talk about important metrics — Code quality, Mean time to detect failures, Mean time to recover, to name a few — that we may want to measure and track to ensure we are not straying too far from the path.

CI/CD practices of modern applications: Modern applications must provide a way for businesses to realize value faster without compromising on quality — Move fast, but don’t break things. Such agility requires continuous delivery processes and tools that not only deliver changes securely — running all necessary tests automatically — but also provides safety nets that reduce blast radius when unexpected failures happen in production. Ideally, we would like to deploy every commit to production, passing through all quality gates, and measuring and safeguarding important hygiene metrics — development and deployment hygiene both along the way.

Testing Strategies and tools to enable Continuous Delivery:

Continuously deploying every change-set to production requires just enough tests to ensure that the service under test works as expected, but how much is enough? In other words, what’s the test boundary of a service? What about the dependencies of this service and any consumers who depend on this service. It’s also important to get feedback early, preferably in the developer machines with mocks and stubs. — Shift Left testing is quite the rage these days — but we may also want to run tests in actual infrastructure for some advanced tests: End to End tests, Performance tests, and any resiliency tests. Sometimes, testing in production may be the only option. We also talk about how we can use Golden datasets to avoid flaky and unreliable tests.

Designing for high availability:

As applications become more and more distributed, it’s easy for a runaway application to completely bring down the platform; hence, Fault tolerance and resilience are critical when we design microservices and their interactions. We also talk about various traffic shaping patterns and safety measures that help a service and its consumers to be resilient to unexpected fault modes.

--

--

Chandra Ramalingam
Walmart Global Tech Blog

Distinguished Software Engineer @ WalmartLabs, based in San Francisco; California, working on building large scale grocery eCommerce platforms