Overheard at Qcon: Accelerating teams and continuous deployment

McKinsey Digital
McKinsey Digital Insights
8 min readNov 21, 2022

by Rishi Markenday — Expert, McKinsey & Company

For those who have never experienced the wonders of QCon, it’s one of the leading events in the software engineering calendar, focusing on the latest software trends and cutting-edge innovations. Taking place in San Francisco late last month, I was fortunate enough to attend and witness first-hand discussions on the future of the software development industry and certainly had my eyes opened by some of the talks I attended.

As the title of this article suggests, I’ll be focusing on two areas of inspiration following a couple of insightful sessions I attended during my time at the conference.

Accelerating teams and scaling-up engineers

One such session I attended was hosted by Argentinian eCommerce business Mercado Libre which describes itself as a “marriage between Amazon and PayPal.” The discussion focused on scaling-up engineering teams, with the key motto hammered home being ‘accelerating teams, controlling organizational entropy.’

To say the business has been on quite a journey since launching its architecture as a monolith in 2000 would be an understatement given that today it has the Fury platform to enhance developer experience and have been scaling the number of engineering teams since 2020. In fact, the average day sees Mercado Libre responsible for 10,000 deploys, 26,000 microservices and training 700 machine learning models.

This journey was a key discussion point during the session, outlining that 83% of Mercado Libre’s developers have less than two years with the organization. When it had a more modest team, it could get away with providing a single career path which saw seniority aligned directly with managerial responsibilities. However, as Mercado increased its headcount, it also created a parallel career path where employees could grow their technical skillset without managing other developers. This is built on the premise that experts lead experts, similar to the approach Harvard Business Review suggests is a primary reason for Apple’s success.

The team structures saw 85% of developers positioned within delivery teams responsible for building customer-facing products, while the remaining 15% were part of platform teams. These deliver teams are cross-functional and consist of:

  • Front-end developers
  • Back-end developers
  • Machine learning engineers
  • Native developers
  • Data engineers
  • User experience experts

Simultaneously, Mercado Libre’s platform teams are split by:

  • Cloud engineers
  • Security engineers
  • Machine learning engineers
  • Quality engineers
  • Data analytics engineers
  • Site reliability engineers

The positioning of product managers is particularly interesting, as Mercado Libre doesn’t see the need to have a product manager per team. Instead, it prefers all members of its delivery teams to be “product managers.” The way Mercado approaches this makes sense. They have a UX designer in every team, who coach teams to see customer perspectives. On average, this means it has one product manager for every 25–30 engineers, a structure that has worked wonderfully for them during their growth story. This structure makes sense for them, because they have a tightly aligned UX team and also ensure that they have a UX person on every team.

Its management team also strived to ensure that 90% of decisions were made by those closest to the work, with only 10% of decisions escalated to leadership and 1% of those positioned as “golden nuggets”. These are strategic pieces of decision-making that executives look at very seriously and support the direction of travel for engineering teams.

Alongside the team structure, Mercado Libre also has a robust system which helps to accelerate the team, facilitate scaling-up and drives innovation. It has a robust platform that takes care of risk scoring, machine learning and cloud computing while reducing the cognitive load on product engineering teams.

Mercado Libre’s Fury platform does exactly that, being available to all its developers, who could instantiate it with a few commands. The platform prides itself on providing a consistent developer experience, removing complexity, and offering a cheap service that allows product engineers to innovate at pace. It does so by taking care of the following behind the scenes:

  • Software management through GitHub, Jenkins and Docker
  • Compute services through GCP and AWS
  • Data services through GCP, AWS and Oracle Cloud
  • Service management through New Relic, Opsgenie, Jupyter and Bugsnag

One key to the Fury platform’s success outlined during the QCon session was that it followed Mercado Libre’s transition to a microservices-based architecture, meaning it invested significant time and effort to decouple its architecture before building the developer platform.

We’ve seen a trend towards clients attempting to build ‘platforms’ without paying the required attention to the coupling of their architecture. Unfortunately, it’s a recipe for failure which takes serious work to undo. Instead, organizations should approach decoupling by understanding the domains and bounded contexts within the system architecture.

Fury is a perfect case study of how it should be done, with a modern technology firm considering leadership, teaming, talent and architecture and understanding each are key to delivering success. It will be interesting to see if more businesses take such an approach having witnessed the session.

Is continuous deployment beneficial?

Another hugely popular session I attended during my time at QCon this year was a fantastic talk by representatives at Lyft on its journey to continuous deployment and the struggles it faced on the way to ensure developers could deploy code effortlessly within minutes.

At this stage it’s of course important to differentiate between continuous deployment and continuous delivery. On this, Jez Humble, author of the seminal work, Continuous Delivery, said: “…when we can release on demand at the push of a button, during normal business hours, we are doing continuous delivery.” Simply put then, continuous delivery and continuous deployment attempt to automate the entire deployment process to production. Jez also believes that “the key thing we should care about is not the form, but the outcomes: deployments should be low-risk, push-button events we can perform on demand.”

Many organizations would envy Lyft’s starting point — its developers could deploy from staging to canary to production in half a day. It’s commendable that the business has continued improving, as it outlined the following four pain points:

  1. Deploying took half a developer’s day
  2. Many commits deployed at once
  3. Commits from many different authors, each with various changing components
  4. Many uncoordinated changes often made it difficult to figure out what was happening during incidents

Lyft started tackling this by creating a Deploys Team, solely accountable for reaching continuous deployment to reclaim developers’ time. The key here was empowering a multi-disciplinary team to own the entire thought process of deployment. This aligns perfectly with our advice to clients on how to motive talented engineers, by providing autonomy and a clear sense of purpose.

Alongside the need to reclaim its developers’ time, Lyft supplemented the Deploys Team objective with three design tenants:

  1. Automated — remove as much manual intervention as possible
  2. Scalable — allow teams to add integration checks and alerts to govern deployments efficiently
  3. Responsive — the system should capture the mental context that developers have now

This approach provided Lyft’s developers strong foundations to build on, while ensuring teams had smaller deploy sizes was a key metric. It was mentioned that Lyft needed to measure quicker time to production since roadmaps and feature complexities varied by teams. This is one of the infamous DORA metrics we strongly advise most clients to keep an eye on. However, Lyft started from a different launch pad so could afford not to monitor its lead time to deploy that closely.

It was also revealed that Lyft’s Deploy Team built its deploy system in-house, using the following tech stack:

  1. Backend was in Go, the frontend was in React.js
  2. PostgreSQL — for both online storage and analytics queries
  3. ReST APIs
  • DeployAPI tracked the deployed state
  • DeployView interacted with the state
  • AutoDeployer continuously deploys when safe

When questioned on this approach, Lyft stated building a tool in-house worked well for them as it allowed full customization. It also highlights the level of empowerment offered to the Deploys Team and proves a team with autonomy and a sense of purpose can produce incredible results.

The following is a visual representation of Lyft’s fundamental data model of its deployment system — jobs.

A few jobs came together to form a pipeline.

Each pipeline would go through several states.

Lyft has no rollbacks built into the model and no rollback state in its state machine. A developer could trigger a deployment with an earlier build and it avoided concurrency issues. If Lyft has to rollback, it would have to pause the pipeline.

Lyft also revealed a sample checklist which helps them determine completion for a pipeline to pass from a waiting state to a running state:

  • Has the revision passed continuous integration tests?
  • Have business metrics been affected?
  • Have integration tests passed?
  • Has bake time passed?
  • Ensure no service alerts are firing.
  • Would this cause an unintended rollback?
  • Is the owning team present and working?
  • Is the job more than 14 days old?

This is an excellent checklist for moving from one state to another, although if considering a similar approach I recommend adding as many non-functional requirements as you need to the tests, along with code linting and some coding standards.

Lyft closed the session by outlining key lessons it learned about continuous deployment along its journey:

  • A less delicate rollout would’ve been fine. Its developers were excited by the prospect of continuous delivery and Lyft was working to make its developers’ lives easier
  • Not having automated rollbacks worked well for them
  • Previously, it was easy for ineffective changes to ship to production — continuous deployment has significantly reduced such errors
  • Continuous deployment. It has made developers’ lives easier and given them more confidence to make changes across projects
  • Continuous deployment reduced Lyft’s exposure to CVEs. A CVE announcement used to result in a scramble; now, they use the continuous deployment pipeline with little manual work

At McKinsey, we regularly tell clients that continuous deployment is a time-intensive process to introduce but is worth the investment. The journey lasted a few years but has resulted in empowered engineers with the latitude and space to innovate at pace. With backing from senior leadership teams, it pays off in the long run.

They are my key takeaways from just two brilliant sessions I attended at QCon 2022 on behalf of McKinsey. My colleagues will be sharing their own insight from other QCon talks in the coming weeks, so watch this space.

--

--