Overheard at QCon: Designing and delivering effective microservices

McKinsey Digital
McKinsey Digital Insights
9 min readDec 7, 2022

By: Tassio Abreu — Digital Specialist, McKinsey & Company

Anyone that has ever attended QCon will no doubt agree that it truly is one of the highlights of the year for software engineers and 2022 was no different. Microservices have always been an area of interest for me and one our clients regularly want to discuss with us. I made sure Microservices was a key focus of my time at QCon.

As you’d have guessed from the title, this article shares my key takeaways from the conference on how I recommend to design and deliver effective microservices. However, before doing so, it’s important to understand the benefits of microservices and how to secure the most value from the approach.

Starting your microservices journey

Everything starts with design. When new teams or new products are being conceived, many jump straight to microservices as this style of architecture is very much on trend currently. It’s important to differentiate between architectures you’re interested in and those that you really need, though.

During his QCon session, Chris Richardson, software architect and serial entrepreneur who founded Microservices.io, used metaphorical dark energy and dark matter to compose a memorable framework to help with exactly this, focused on repulsion forces and attraction forces.

He defined the repulsion forces, in this context, as:

  • Simple components: when services are segregated, it’s much easier to analyze a codebase from scratch and understand its entirety
  • Increase team autonomy: teams can focus on small pieces or business functions and thus be more responsible for its end-to-end success, from implementation to customer impact
  • Fast deployment pipeline: smaller systems are deployed faster, the lead time for DORA metrics is often faster and from code to deployment benchmarks tend to be closer to elite levels
  • Support multiple technology stacks: you can always use the best technology at hand, whether to increase performance, to use a framework that is more advanced and can lead to improved ergonomics and engineering time, or for compliance purposes
  • Cost-effective scaling: you can scale pieces of your application separately, allowing for a new concept of scaling through container density, which in turn diminishes the chances of having to overprovision up to known peak capacity to avoid problems in uptime
  • Segregate highly available components: you can allocate different services in different infrastructures, and this can cause variations in both cost and availability

On the other hand, he defined attraction forces as:

  • Simple interactions: importing a class or file and using it directly will always be simpler, cheaper and fast than having to connect through a network with another component
  • Prefer ACID over BASE: avoiding eventual consistency allows for better and an always-true system behavior — there’s no risk that information being hot-read is inconsistent
  • Minimize runtime coupling: service availability will depend less on other moving pieces. The more distributed a system is, the more expensive it is to implement resilience, as more moving parts often result in increased complexity and a higher risk of failure
  • Efficient inter-service communication: communication often happens in-memory, ensuring it’s fast and predictable
  • Minimize design time coupling: your service will need to change less frequently, especially in response to external forces. Imagine having to redeploy your system regularly due to a dependence on an external REST API of a service that keeps changing. It’s far from the ideal approach

As a rule of thumb, it’s recommended to begin with attraction forces, as keeping them together is often easier and provides engineering teams with a stronger platform for eventual success. It’s also important to keep in mind that you don’t need a distributed team at this stage, unless it’s project critical.

After starting small, keep challenging your architecture with real, proven hypotheses for repulsion. It’s here that distributed systems begin to offer value, as although they’re more complex to maintain, they have better end-game potential for scalability and resilience. This does come at a price, but Chris defined and shared multiple patterns to help here.

It’s important to understand that system design is an iterative process, so constantly reevaluate how to evolve your system. It can even prove beneficial to decompose it into multiple sections in terms of decoupling, scalability, enhanced team topology and enabling two-speed IT. The latter is a framework by McKinsey which showcases how teams that require stability while delivering disruptive, innovative solutions.

So how should you iterate on your system design? The following is derived from Chris’ pattern language:

  • Define the context of your system, e.g: an eCommerce system being evolved to increase conversation rates
  • Enumerate the problems being tackled, e.g: there’s a big churn before checkout due to lengthy user log-in times
  • Note the forces that help prioritize each problem, e.g: new team members must quickly become productive as the checkout service is critical and new deployments can’t happen regularly due to the risk associated
  • Design the solution, e.g: the checkout functionality will be broken down into a new microservice, to decouple its lifecycle from other components that can and should change more often to adapt to users’ behavior
  • Implementation details, code commits and effective code changes that bring your new system to life
  • Resulting content, e.g,: C4 diagrams, new ADRs and other content to illustrate how the system will behave and interact

Taking care of overcomplicated architecture early

At the second session I attended at QCon, Cassandra Shum, enterprise architect and former member of the Technical Advisory Board at Thoughtworks, discussed the importance of avoiding overcomplicated architecture which perfectly complemented Chris’ session. She strongly believes that driving a culture of engineering excellence takes care of net-zeroing tech debt early in any project, and I have to agree. Especially having now witnessed her framework which follows-up on the growing complexity of architecture based on symptoms vs remedies.

The following are the areas suggested engineers focus on to diagnose overcomplicated architectures:

  • Distributed monoliths: this is the worst possible scenario. Teams try to dive into microservices, but end up with replicated code, dysfunctional patterns and outputs which don’t reap any benefits of either architecture
  • Tracing, logging, monitoring and visibility without benefit: nobody knows what’s going on and logs become an expensive clone of the famous green screen in The Matrix which rarely leads to problem solving
  • Too much abstraction: indirection can lead to delays and unnecessary moving parts that bring additional risk and needless complexity
  • High cognitive load: the system is so complex to understand that onboarding new developers, irrespective of how good they are, is always a burden. Changing a single line of code has so many potential side effects that teams simply prefer to stagnate
  • Difficulty making technical decisions: too many cooks spoil the broth, and this proves it. When too many people need to be brought into discussions prior to making changes, teams can’t evolve their business functions
  • A ‘build in-house’ mindset: The perfect example is a fintech implementing its own authentication system instead of focusing on innovation that will set them apart from the competition. Every time a business builds something in-house which could have been bought “off-the-shelf”, that’s one less team potentially steering the company forward

Remedies on the other hand do exactly what they say on the tin, providing solutions to the aforementioned symptoms:

  • Have clear domain boundaries: align your business with your code. This can be achieved by having a good, yet lightweight, understanding of domain driven design, as well as spending time defining an ubiquitous language that is spoken from directors down to the engineers seamlessly
  • Observability and monitoring: find a process which allows real-time inspection of the inner workings of your system, from the outside. Don’t ignore anomalies, be alarmed by them. Ensure people around you understand the system and the importance of cooperating with product teams on debugging
  • Lightweight architecture decision records (ADRs): There shouldn’t be excessive or unknown side effects, so changing the system shouldn’t be a daunting task
  • Cross-functional requirements: leverage patterns like microservices chassis to reduce the time teams spend work on activity which could be done just the once before being centralized and reused. The likes of sidecars, proprietary libraries and packages really help at this point
  • Remove the hero culture: if critical knowledge is owned by one person, you have a very low truck-factor as if that person is unavailable or leaves, a project can come to a quick halt
  • Controlled technical debt: your team should know how to measure technical debt, be aware of issues with code quality and react to such issues to ensure best practice
  • Developer effectiveness: codes can be changed by a new engineer without risk as it’s protected by automation and has controlled and known side effects
  • Platform thinking: product teams are incentivized to contribute, donate and eventually platformize team-specific tech assets to the company as a whole
  • Commodity vs differentiation: teams should spend as much time as possible working on new features and potential revenue streams or end-user perceived quality, instead of unnecessary heavy lifting against the system’s own weight and inertia

Effectively operating your distributed system

With your microservices journey now well under way and a clear framework in mind for tackling overly complex architecture, attention should turn to effectively operating your distributed system. During his talk at QCon, Wanderlei Souza– a distinguished engineer from Build by McKinsey — put together a collection of best practices that feels like the missing piece of the designing and delivering microservices puzzle.

However, before introducing a distributed system, there are three crucial steps that should be taken:

  • Have design patterns as a handy problem-solving resource: common solutions to common problems are your best bet to keep your system manageable and evolvable over time. Make sure engineers are fluent on a pattern language for distributed architectures
  • Create a culture of parsimony: focus on simple solutions. Use a framework capable of taming overengineered architectures to avoid unnecessary — and usually accidental — complexity
  • Implement observability and alarming: you can use a platform that better suits your team preferences, needs and budgets

Once these steps have been taken, it’s vital that engineering teams work together with the wider business stakeholders to align on objectives and outcomes. With that in mind, the following considerations will ensure your distributed system is infinitely more effective:

  • Don’t forget about the humans: make sure that engineering teams and product / business focused colleagues are aligned. Avoid the scenarios in which a new Black Friday push notification campaign can generate a traffic surge that damages or even causes service downtime.
  • Master patterns and trade-offs for service communication: favor orchestration for complex workflows, choreography for scalability and weaker coupling
  • Design for resilience: consider concepts such as:
  • Graceful degradation
  • Fault tolerance
  • Circuit breakers
  • Bulkhead
  • Business-aware health checking
  • Distributed tracing for end-to-end debugging
  • Log aggregation
  • Anomalies on application performance metrics
  • Disliking TDD but resisting letting go of high testability: make sure you can easily simulate real-world situations at test environments, record the operations of a given day and simulate or replay these deliberately. It’s also key that you’re notified of changes on external integration contracts and that you’re able to reproduce composed scenarios when lag or system failure occurs
  • Measuring your success: it’s vital you can evaluate the overall quality of your system against a well-respected industry standard, such as ISO 25010

Microservices aren’t a silver bullet

Operating microservices isn’t a walk in the park and certainly isn’t the silver bullet for achieving engineering excellence that many paint it as. Despite the acknowledged complexity, we all accept that being aware and understanding these concepts can help any kind of software engineer or architect in any sector — whether that’s a cloud-native, written-from-scratch startup, or an incumbent company that needs to be brought into the 21st century.

Eight years after Chris’ ground-breaking first talk about the microservices architecture, tools and patterns, community knowledge on the topic has grown exponentially. It’s interesting to see how widespread and well-adopted the architecture is and how its applicability suits the need of modern businesses in the era of agile, team topologies and subscription services. Especially as, when managed well, microservices can be a lever for increased engineering agility.

So while it’s a complex area of focus, if your team relies on patterns with a proven track record, allows for observability and takes the time to implement controllability early in the microservices journey, success will be within touching distance.

--

--