FinOps at SMPD

Olivier D'hose
T-Mobile Tech
Published in
6 min readJul 6, 2020

Cloud resources are free… until they are not.

Banner
FinOps — Joel Marchand

Cloud computing is amazing. Resources of all sizes and shapes can be provisioned with a few commands. But at the risk of using a tired trope, with great power comes great responsibility, or at least great cost. My team saved Cloud costs by 40% while doubling our infrastructure by adopting FinOps principles articulated around four tenets. In this blog, I will introduce our process.

Some Background:

The Social and Messaging Product Development (SMPD) team lives within T-Mobile’s contact center organization. Our mission is to develop tools and experiences empowering customers to have the best and most effective interactions with T-Mobile over asynchronous communication modes (such as messaging or social networks). We build tools and a platform used by Customer Care experts to accelerate and optimize every contact with our customers.

Our application landscape is organized around an increasing number of microservices, running in containers orchestrated by Kubernetes and augmented by a collection of Amazon Web Services products for data persistence, streams management, encryption, etc.

Within SMPD, my team is focusing on creating and maintaining the platform, tools and guardrails allowing the product teams to innovate, disrupt and optimize business flows with increasing speed and quality. This team is known as the Engineering Efficiency team or E2. As the custodian of the infrastructure and platform, the E2 team took on the responsibility to introduce FinOps principles as part of the quality markers for each SMPD product. So even as our portfolio of products and services continues to grow and change, the E2 team is focusing on maximizing resource utilization by identifying and reducing waste while remaining on the leading edge of technology innovation.

These principles were put to the test during a recent project to achieve multi-region resilience. More on that later…

To bootstrap our FinOps responsibilities, we organized four focus areas:

1. Increase cost awareness across the whole organization (and make it fun!)

2. Provide automated guardrails for resources creation and management.

3. Reduce waste by adopting ephemeral environments principles.

4. Create cost modelling tools at design time.

1. Making cost awareness fun

The first step in our FinOps journey is to make the actual cost of services available to all. True to the spirit of the DevOps model, the operational responsibilities for a product lies fully with the team that creates the product. As such, it is important to make the cost of running a product completely transparent. The T-Mobile Cloud Center of Excellence (CCOE) provides each team using cloud resources a detailed report on the cost of running every aspect of our applications. From the CPUs and memory to network connection cost, the information is available in nearly real time. This invaluable tool is too often only available to comptrollers and management. We decided to provide that information to everybody inside the team. Awareness is the greatest tool to ensure the rightsizing of resources. This has become a part of our operational review of our applications. Every month, the E2 team reviews the numbers provided by the CCoE and presents them to the different product teams with suggestions on potential optimization opportunities.

To increase the participation of the product teams, we have also taken a page out of the gamification playbook. We regularly create contests between teams to achieve some widespread goals for the platform. For example, we needed the team to right-size the provisioning of the Kubernetes pods resources. Our goal was for each container to use 60% of assigned memory as a baseline for provisioning. Creating a leaderboard showing how each team’s portfolio met that goal increased awareness while tickling their competitive streak. It is a simple but very effective solution to ensure the teams’ participation.

Resources utilization dashboard.
Gamified Resources Utilization Dashboard

2. Guardrails for resource creation and management

In SMPD, we strongly believe in bringing the decision-making process as close to the individual as possible. However, there are some requirements that are larger than the team. Security standards are a good example. To achieve adherence to organizational standards, the E2 team is designing automated guardrails that implement the patterns necessary for compliance without additional burdens on developers. Our CI/CD pipeline is a constantly evolving product designed to implement standards when they are needed. For example, the pipeline enforces naming conventions and tagging requirements on resources. And when automation is not possible, documentation starting with the “why” of a pattern is available for constant reference. We have affectionately named our documentation system E2Pedia. If the code does not meet our documented standards, the pipeline fails the build and prevents the deployment.

3. Ephemeral environments

With the resource creation and management guardrails in place, we move a step closer to ephemeral environments. One of the promises of the Cloud is the ability to create and destroy resources as needed. The ability to create, let’s say a Redis cluster, with a few mouse clicks is truly a game changer when it comes to designing disruptive products. Destroying that Redis cluster when done however seems to not generate the same level of glee. It is also not in the Cloud providers’ business model to encourage us to clean up after ourselves. My parents would be proud. Our team is moving towards a completely ephemeral provisioning model. Every Cloud resource required is created upon requests by the CI/CD pipeline and destroyed either on schedule or with the click of a button. This ensures that resources are only created for as long as they are needed. This seems like an obvious habit to implement but it does require us to declare all the infrastructure needed by an application with the application itself. Adopting the principles of “Infrastructure as Code” and empowering every developer to declare what resources is needed is a critical step in managing our cloud resources efficiently.

4. Cost modeling tool at design time

The next step in our FinOps initiative is to provide a cost modelling tool at design time. While we do not advocate for cost to be the sole or even a major design consideration, providing knowledge of the impact of those decisions helps shape better design decisions. The tool models the cost of the infrastructure required to run the service through a short survey. For example, a new application might require 4 new microservices deployed over 35 Kubernetes pods. Each pod is provisioned with 500 minutes of CPU time and 256 MiB of memory. The application also uses a couple DynamoDB tables and Kafka for streaming. Based on the information collected, we provide a strawman of the costs associated with running the application. The results can be saved for side by side comparison with different configurations. This information can also be compared with the results of some more rigorous performance testing to validate the design assumptions. This tool is still a work in progress. The feedback we gather from the different teams will make it more accurate over time. Data driven decisions make for better decisions.

Walking the talk:

One of our team’s initiatives for 2020 is to improve the resilience of our applications by distributing our services across multiple cloud regions and follow an active/active traffic distribution pattern. This meant duplicating our infrastructure and supporting cloud services. To validate our FinOps principles, we set a goal to be fully geo-redundant without increasing our cloud budget for the year. Six months into the year, the current forecast has us achieving that goal!

Conclusions:

Managing the costs associated with an application portfolio is rarely a pre-occupation of a traditional development team. We believe however that bringing the attention of the whole team on the financial aspects of software development empowers us to make better, more conscious decisions about the design, management and maintenance of our products. FinOps is a process complementary to any good DevOps team.

--

--

Olivier D'hose
T-Mobile Tech

Human looking at better ways to interact with the world.