#2 Making a Tech platform… Present.

9 min readDec 21, 2022

What we wanted to build and how we did it.

I’m Clement Hussenot, passionate technophile… I was Site Reliability Engineer, then Platform Engineering Manager at ManoMano. I’ll tell you here a little bit about the backstage and what I had to do with all the talents I met. I’ll tell you here how I became a manager in order to create and animate a team to guide in real time all the tech (~500 engineers) on the state of the production and our platform.

>>This article is a sequel of this previous one, which explains ‘how’ we have switched to a modern, cloud-based infrastructure.

We migrated… what do we do now?

…We suspect that not all things work like they used to… there are certainly some failures and not-so-cool things to come… but we don’t know where to start to fix it and keep going…

Hear the heartbeat of the platform

Platform engineering is a complex and multidisciplinary field that requires a wide range of skills and expertise. At ManoMano we have been more or less a hundred people to think and act to build and maintain our platform.
The platform is the playground for approximately four hundred engineers who deploy the services that make up our websites and mobile applications. That’s a lot of change in production, every day. There are many risks of breakage, in addition to the migration we just did…

How do we keep the platform stable and safe?

Building the best operation team of experts with a common mission…

One way to ensure the stability and performance of the platform in a constantly changing organization is to focus on building a strong and effective team of platform engineers. This team should be composed of talented and experienced individuals who have a strong understanding of the technical, business, and organizational aspects of platform engineering.

Experts to help engineering teams.
This type of team can bring together a diverse set of perspectives and skills, and can provide the depth of knowledge and experience needed to effectively manage the technical platform life. It is this team that learns on the front line and then passes on to the other engineering teams so that they are self-sufficient and in control of their applications in production. This team is in charge of finding ways to observe the behavior of the platform, track operating costs, performance, service level objectives.

Additionally, this cross-functional team can be more responsive than a larger team, and can more easily adapt to changing requirements and priorities. This can be especially important in the fast-paced and dynamic world of startups, where business needs and the technology landscape can change rapidly, but also of course when managing production incidents.

Strong expertise that differs but works together!
This small cross-functional team needs to foster closer collaboration and communication across the organization, and promote a culture of continuous learning and improvement by organizing Dojos, times of learning... It is essential promote a positive and supportive work environment to encourage team members to share ideas and knowledge, and to learn from each other.

How to build this critical team?

To build the best operation teams for platform engineering, it is important to focus on several key areas:

Hire and retain talented and experienced individuals who have a strong understanding of the technical and organizational aspects of platform engineering.
Provide them with the training and support they need to develop their skills and knowledge. This can include training on technical tools and techniques, as well as on soft skills such as communication, collaboration, and problem-solving.
Establish clear roles and responsibilities within the team, and provide clear guidance and direction on how to prioritize and manage work. This can help to ensure that team members are aligned on the goals and objectives of the team, and can work together effectively to achieve them.

I found myself gradually building this team of engineers at the staff or senior level around the observability chapter from 2020.

Strong expertise… divided in Five chapters!

Currently, this ManoMano team (Pulse team) is half composed of Site reliability engineers in order to respond to incidents, manage on-call, cloud capacity and control costs. It is also composed of software engineers to provide high quality in-house solutions that meet the needs of the platform users. In 2 years the team has grown and has become the owner of 5 chapters, to always guarantee the stability and safety of the platform.

Observability
Chaos Engineering
Incident Management
Finops
Web performance

Having already talked about observability in a previous article, I will simply list below what each other chapter brings…

Chaos engineering

Chaos engineering is the practice of intentionally introducing failures and disruptions into a system in order to test its resilience and identify potential weaknesses. By simulating real-world failures and disasters, chaos engineering allows teams to evaluate how their systems respond and recover from such events, and to identify and address any issues or vulnerabilities that may be present. This practice is only possible if a careful observability of the platform has been setup. Because chaos engineering without observability is just… CHAOS!

One of the main benefits of chaos engineering is that it can help teams to build more resilient and reliable platforms. By identifying and addressing potential failures and vulnerabilities, chaos engineering can help teams to prevent outages and downtime, and to ensure that their systems continue to function properly even in the face of unexpected challenges. It is possible to integrate resilience tests in your continuous integration chain, in order to target your different environments or deployments. We often couple this type of experimentation with load tests. This allows us to learn and understand the limits of the systems we have built. For example, what happens if we have more than 1000 people on a vendor’s backoffice dashboard and we remove a replica from the database?

Additionally, chaos engineering can help to foster a culture of resilience and continuous improvement within teams. By encouraging teams to regularly test and challenge their systems, chaos engineering can help to build a sense of ownership and responsibility for the platform, and can motivate teams to proactively identify and address potential issues before they become major problems. For this, the ManoMano Pulse team I was in charge of has regularly animated playful events called GameDays to gather collaborators around Chaos experiments. If you want to understand how to organize this, it is just explained here.

Incident Management

Incident management refers to the processes and practices used to respond to and resolve unexpected events or failures that impact the availability or performance of a system. This typically involves identifying the cause of the incident, implementing a plan to resolve the issue, and communicating with stakeholders about the incident and its resolution.

Good incident management practices can help to prevent outages and financial damages in several ways.

First, by having a well-defined and rehearsed incident response plan in place, teams can quickly and effectively respond to incidents and minimize their impact. This can help to reduce the duration and severity of outages, and can prevent minor issues from escalating into major problems. There are many solutions on the market now and Netflix even has an open source solution that provides a clear process for your employees. We opted for an in-house solution that perfectly meets our needs and works with slack.

Second, by conducting thorough post-incident reviews, or «post mortems», teams can learn from past incidents and identify ways to prevent similar issues from occurring in the future. This can help to improve the overall reliability and stability of the system, and can reduce the likelihood of future outages. We establish postmortems for serious incidents and also for chaos engineering experiments if the lessons learned are to be shared with our collaborators.

Third, effective incident management can help to minimize the financial impact of outages by reducing the amount of downtime and ensuring that the system is restored to full functionality as quickly as possible. This can help to minimize lost revenue and other financial damages, and can help to protect the reputation and credibility of the organization. Everyone in the industry knows about the outage of Atlasian this year.

Overall, good incident management practices and post mortems are essential for preventing outages and minimizing the financial impact of failures, and are critical for building and maintaining reliable and resilient systems. If you are not already doing it, I invite you to start doing it with your teams ❤.

Financial operations

Also known as Finops, is the practice of managing and optimizing the financial performance of a business. https://www.finops.org/introduction/what-is-finops/

In the context of platform engineering, finops practices can help teams to spot issues in their technical platform by providing visibility into the financial impact of the platform and its operations. The practice makes sense in a tech industry that now uses mostly the cloud… With a pay-as-you-go pricing model… the dollars can quickly fly away!

By tracking key metrics such as revenue, costs, and profitability, finops can help teams to identify areas where the platform is under-performing or generating unexpected expenses (high costs of outgoing traffic, storage, short TTL on communication between services…). For example, finops can help teams to spot instances where the platform is consuming excessive resources or generating unplanned costs, and can provide the data needed to understand the root cause of the issue and take corrective action.

Additionally, finops can help teams to identify opportunities for improving the financial performance of the platform. Platforms like AWS regularly offer improvements with prices that drop year after year… you have to keep up… An example : change your volumes from gp2 to gp3. By analyzing the cost-benefit tradeoffs of different technologies or deployment strategies, finops can help teams to make informed decisions that optimize the financial performance of the platform while still meeting the needs of the business.

Overall, finops practices can provide valuable insights and data that can help teams to spot and address issues with their technical platform, and to optimize its financial performance. In 2023 it is critical to adopt this kind of behavior.

Web performance

Web performance is the practice of optimizing the speed, reliability, and efficiency of web-based applications and services. In the context of platform engineering, web performance is important because it directly impacts the user experience and the overall effectiveness of the platform.

By improving web performance, teams can better serve their users by providing faster and more responsive web-based applications and services. This can improve user satisfaction, increase engagement, and drive business results. Tracking web vitals (https://web.dev/vitals/) is essential. Especially for a market place like us who wants to see the entirety of his catalog found on search engines like google.

Additionally, improving web performance can help teams to save money by reducing the cost of operating the platform. We avoid wasting resources and improve user satisfaction by making sure that they have a fast experience with a minimum of waiting while using the website. For example, by optimizing the use of resources such as CPU, memory, and bandwidth, teams can reduce the amount of resources required to serve a given number of users, which can lead obviously to cost savings.

Finally, improving web performance can also help teams to challenge the architecture of their technical platform. By identifying and addressing performance bottlenecks and other inefficiencies, teams can design and build more efficient and scalable systems that can support the growing demands of their users and business. Profiling tools are the key to do this by allowing fine tuning if a problem is detected.

Overall, web performance is a key discipline for teams building and maintaining technical platforms, as it can help to better serve users, save money, and improve the overall architecture of the platform.

Let’s conclude this long article…

Implement good practices and tools for incident management, observability, chaos engineering, and other key disciplines that are essential for building and maintaining high-quality technical platforms. This can help to ensure that the team has the tools and processes in place to respond quickly and effectively to any issues that may arise, and to continuously improve the reliability of the platform.

Overall, building the best operation teams for platform engineering requires a combination of hiring the right people, establishing clear roles and responsibilities, fostering a positive and collaborative culture, and implementing tools for managing and improving the technical platform.

If you found this article helpful or interesting, please consider sharing it with your friends or colleagues, or contacting the author to learn more. I would love to hear from you! — Clement Hussenot