Metrics Matter: How Data Shaped Our Patch Manager Product

Mesut Saran
Trendyol Tech
Published in
6 min readJun 27, 2024

I’d like to share how we transformed our Patch Manager product that manages patching on Linux machines by leveraging data-driven insights at Trendyol.

The Platform Core Team, responsible for Infrastructure as Code (IaC), maintains critical Terraform modules for OpenStack, vCloud, and Alibaba Cloud providers. Apart from this, they develop many core solutions to the needs of other teams. One of them is providing a patch management mechanism on Linux systems.

Background

We manage over 40K VMs in Trendyol’s infrastructure, running various application modules such as Kubernetes, Couchbase, Elasticsearch, and PostgreSQL.

Teams within the infrastructure are responsible for patching the products they manage. Since patch management concerns all teams, the Platform Core team worked on a patch manager product that meets the needs of all teams.

2022, Integrating Uyuni

We started the integration of Uyuni, an open-source project supported by OpenSUSE that uses SaltStack’s configuration management and automation capabilities, into our infrastructure.

Throughout 2022, the Platform Core team completed the necessary implementations and infrastructure setups.

Early 2023: Challenges and Realizations

At the beginning of 2023, we were concerned about our product release readiness. Concerns are:

  • There was no role and user management mechanism. It was necessary to reduce risks in the process of using other teams.
  • We had to complete complex automation and rollback capabilities.

Status From The Product Management Perspective

The needs identified from the initial data were as follows:

  • Very few users were using the product that had been developed for over six months.
  • The team has been hesitating to announce the product because of its shortcomings.
  • The product’s roadmap contained many items, but priorities were set by the team rather than the customers.

Roadmap Creation

To properly follow the Uyuni product flow, we created a roadmap focused on three main areas:

  • Customer Acquisition: We believed the product had reached its MVP stage. We could evaluate what we have done by opening it up to customers. Additionally, we could prioritize new features and address shortcomings based on feedback.
  • Infrastructure Expansion: Expanding the product in other data centers and infrastructural environments allows us to reach more customers.
  • Feature Development: Enhancing product with features such as authx, automation, and API to create a reliable and controlled product.

Metrics for Success

To track the product’s progress, we began monitoring the following metrics:

  • Number of Organizations: Teams number adopted the product.
  • Total Patch Count: Number of Patches Applied via Uyuni. The fundamental metric showing product usage.
  • Number of VMs with Uyuni Installed: Measures the product’s spread across environments.
  • Number of VMs: Helps calculate potential patch numbers.
  • VM Penetration: The ratio of VMs that Uyuni installed to total VMs.
  • Federations (Uyuni is available): The product growth in split environments.
  • Availability (SLI) (Last 7 days): The service health over the last 7 days.

The Security Vulnerability team assigns remediation projects quarterly to each team according to their VM responsibility. Accordingly, each team performs patching based on the importance of vulnerabilities. Therefore, we decided to track the above metrics “quarterly”.

Achievements and Improvements

During this process, we:

  • Conducted controlled product demos and presentations for customer acquisition.
  • Shared best practices and infrastructure needs for using the system.
  • Prepared written and visual information resources, creating knowledge base pages on the wiki in addition to Zoom recordings.
  • Held regular monthly meetings with teams to share updates and receive feedback.

By the end of Q2 2023;

  • four organizations consisting of seven teams were using the product.
  • Over 70K patches were applied through Uyuni.
  • With over 19K VMs available for Uyuni, we were on a good path toward 300K potential patches, but more work was needed to increase usage.

Focus in Q3 and Developing Pacman

In Q3, we focused on making the product work in other environments and on version upgrades.

Although regular meetings with customers did not indicate major issues, our metrics showed a different picture by the end of Q3.

Despite adding another team, the number of patches applied through the Uyuni UI was below expectations, with only slightly over 4K patches applied, far less than our Q2 data and potential patch amounts.

There was a very big problem about the product usage.

Identifying and Addressing Issues

We paused all roadmap work to focus on identifying the root cause of the problem.

Through interviews and anonymous feedback forms, we gathered insights on the difficulties teams faced with the product, alternative methods, or products they used, and their patching frequency.

Key Issues:

  • Long durations for batch patch operations in Uyuni and difficulty tracking their status.
  • Faster operations with a few optimizations using alternative systems like Ansible or SaltStack.
  • Uyuni performed well with a few hundred servers but struggled with thousands.

Decision to Separate Management

Since users are directly affected by the lack of management in Uyuni, we decided to separate this management.

This allowed us to address user needs more easily in the background.

Pacman: A New Solution

We decided to provide the frontend outside of Uyuni while using Uyuni as the backend system.

This way, we could solve Uyuni-specific issues as a team without involving users.

We developed a new solution in about a month, naming it “Pacman” to reflect its role in managing packages and patching vulnerabilities.

Terminal User Interface of Pacman

In just a month, we developed Pacman, incorporating:

  • A Terminal User interface to streamline user access across environments. Thus, users no longer needed to log into different systems for different environments.
  • A pooling mechanism to handle bulk patch operations efficiently. That enabled thousands of patches to be applied at once.
  • Authorization and filtering capabilities to ensure teams only managed their assets, preventing potential errors.

We piloted Pacman with users, collecting feedback and making updates.

The response was overwhelmingly positive.

Results

  • Over 120K patches were applied within two weeks during the pilot process. By comparison, about 90K patches were applied via Uyuni in all of 2023.
  • With the addition of more teams, the total patches quickly surpassed 305K+.
  • We aim to achieve 1M patches with the Pacman by the year’s end.

To enhance real-time monitoring of the product, we’ve added new metrics. Below is the Grafana Dashboard:

Some data has been anonymized as it contains company-specific information.

Lessons Learned

  1. Adaptability: Product needs constantly evolve. Defining metrics and updating objectives are crucial.
  2. Data-Driven Decisions: Customers might take time to reveal the truth, but real-time data can reveal the true status of your product. Addressing specific issues directly with customers helps solve problems.
  3. Pivoting the product: Successful product management involves being willing to partially or completely shut down the product if necessary. Successful product management correlates with adaptability.
  4. Collaboration: Involving customers and stakeholders in every process leads to better outcomes.
  5. Quality over Speed: Sustainable success comes from growing correctly, not quickly.

Through this journey, we’ve transformed our patch manager product and learned invaluable lessons that will guide our future.

Special Thanks 🙏

I want to extend my heartfelt gratitude to Enis Kollugil for his exceptional contributions and customer-oriented mindset throughout the development process.

Additionally, huge thanks to the Platform Core team🥇. Their expertise, hard work, and collaborative spirit have been instrumental in shaping and refining this product.

About Us

Want to be a part of our growing company? We’re hiring! Check out our open positions and other media pages from the links below.

--

--

Mesut Saran
Trendyol Tech

product manager, technology enthusiast, and mentor