How to measure CO2 emissions of a Deep Learning model.

Published in

Schibsted engineering

9 min readJan 28, 2022

Nils Törnblom, Maria Slanova

What

In this blog post we want to share the work we’ve done on estimating the carbon footprint of one of our machine learning pipelines. This project started as a small investigation that spawned into a series of experiments and collaborations between different teams across Schibsted.

We’re writing this post to share our learnings and results and to encourage fellow engineers in the machine learning community to explore the ways of building more environmentally-friendly AI solutions.

Our findings show that analysing the carbon footprint and putting forward relevant improvements can lead both to decreased environmental impact, and reduced infrastructure costs.

Why

We have started looking into this problem, after reading a series of articles, analysing carbon emissions of some of the well-know AI solutions and describing the negative carbon impact of building and running big AI systems. Some of these articles mention actionable recommendations for reducing the carbon footprint. To mention a few steps: switching to greener regions and more efficient hardware¹, using efficient algorithms, performing cost-benefit analysis of AI solutions as well as reporting the training times of models².

With all these ideas in mind, we set out to estimate the current CO2-equivalents of our pipelines and find out whether it should and could be reduced.

For the sake of this article, we limited the scope of the experiment to estimating the carbon footprint of a single run of one of our deep learning pipelines.

Both the code and the data, used for this experiment can be found in our open GitHub repository³.

How

Our Deep Learning pipeline is modelling and predicting user age based on their behavioural data, collected across the majority of Schibsted brands. The pipeline consists of multiple stages, as shown in the Img.1. Each stage is responsible for a specific pipeline task and requires certain compute resources. The pipeline is built using multiple programming languages and frameworks. Most stages run on AWS General purpose instances (like m5.8xlarge, m4.4xlarge), while model training stage runs on AWS Accelerated computing instances (i.e. gd4n.4xlarge). We run the pipeline in parallel for 3 different countries — Norway, Sweden and Finland. Each of them comes with different data sizes, resulting in varying pace of pipeline execution.

To estimate carbon emission equivalent for our Deep Learning pipeline, we use multiple data sources:

CPU and RAM usage of our Kubernetes workloads per EC2 instance type.
Estimated power consumption profiles for EC2 instance types from Teads Engineering Dataset⁴, including datacenter PUE and Scope 3 CO2 emission equivalent. The Scope 3 estimate is based on limited information about the hardware manufacturing emissions, amortized over 4 years⁴.
GPU power usage (for training stage only), estimated using CodeCarbon⁵ package.

Step 1. Utilization metrics (CPU and RAM)

Each stage of the pipeline runs within a docker container. Each pod runs only 1 container in our pipeline. Every pod is scheduled onto a specific Kubernetes node, which translates into 1 EC2 instance. The utilization metrics (CPU, RAM) are extracted from Datadog (service monitoring solution) via a query over the relevant time period followed by an export of the data to CSV.

Our Datadog query is based on 2 metrics from Datadog Kubernetes agent⁶: kubernetes.memory.rss and kubernetes.cpu.usage.total. Since there can be pods running from other workloads on the same instance, we only keep the pods from our deep learning pipeline by applying label filters in the DataDog query. We get the vast majority of usage metrics within a sample period of 2 minutes for every pod that runs within a specific pipeline stage. Those metrics that have timeline gaps of more than 2 minutes are converted to the initial time frequency by propagating last valid data points forward to the next one.

CPU usage is measured in vCPUs, while RAM is measured in GB. To convert utilization metrics into the EC2 instance utilization percentage, we divide raw usage values respectively by the total number of CPUs and total memory of the corresponding EC2 Instance type.

This allows us to measure the load level for each EC2 instance type during all pipeline stages.

Img. 2 Average load levels per instance type

Step 2. Power consumption

The total power consumption is calculated as a sum of 4 power factors: CPU, RAM, GPU power and Delta power (estimated power consumption of the rest of the hardware components on the machine).

After Step 1 we have: CPU and RAM load values per pod and EC2 instance type, throughout the whole pipeline runtime. This is where we refer again to the Teads Engineering Dataset⁴. AWS EC2 Emission Dataset provides estimates for power consumption for 4 different instance load values: idle, 10%, 50%, 100%.

We use this information to interpolate the power utilization of our instances, given the calculated load value.

As we mentioned before, the model training stage of our pipeline runs on GPU-powered EC2 instance type. GPU utilization measurements are not provided in our Datadog export. For this reason, we use the average GPU power consumption reported by CodeCarbon from our previous experiment.

Delta power is exported from AWS EC2 Emission Dataset using the EC2 Instance type.

Img. 4 shows the total power utilization over all workloads withing our Deep Learning pipeline throughout a single run.

If we decompose the power utilization by pipeline stages (Img. 5), we see that User Tensors stage, followed by Events Preprocessor stage contribute the most to the power consumption.

Img. 5 Power usage by each pipeline stage

Step 3. Emissions

Once we have power estimates per instance type, we can use the AWS EC2 Emission Dataset to calculate emissions from running each instance at a specific load. Currently all pipeline workloads run in one region — Europe(Ireland).

We first calculate energy per time step t, via:

Based on the energy per time step, we compute emissions per time step:

In the end emissions per time step will be summed up over the whole time period to calculate the total carbon footprint of the pipeline.

Before we share the results of our estimation, we would like to briefly list a few other attempts of estimating the carbon footprint of our Deep Learning pipeline together with their pros and cons.

CodeCarbon. This was our first tool for epxerimenting with carbon footprint estimation. CodeCarbon is a Python package, built with the purpose of enabling developers to track the carbon footprint of their machine learning experiments, by measuring CPU and GPU power based on onboard sensors. A very big advantage of this solution is that it’s straightforward and easy to use. In addition, it provides a dashboard tooling with nice visual insights into the emission numbers. Some of the limitations: we could only use this package for python stages of our pipeline, meaning, we could not run estimation for Spark/Scala stages. The measurements, provided by CodeCarbon don’t include PUE or scope 3 emissions, RAM or delta power. That being said, we still ended up using the information about GPU power usage from CodeCarbon experiment.

AWS Cost Explorer and AWS CloudWatch. Our second attempt to estimate CO2 emissions was done using two AWS services: Cost Explorer⁷ and CloudWatch⁸. The idea was to extract for a specific period:
- hourly CPU utilization per EC2 instance type from CloudWatch service;
- Number of running instances per hour for each EC2 instance type from Cost Explorer.
We then combine the extracted information with EC2 instance type power consumption profiles from the first Teads Engineering blogpost⁹ to calculate total energy levels. Cons of this approach: difficult to filter workloads for a specific pipeline, limited information w.r.t utilization per EC2 instance types due to CloudWatch not being enabled for all of them, not very easy to navigate though service responses. In addition, available data periods differed between Cost Explorer and CloudWatch.

Results

The estimated total carbon footprint of a single run of our Deep Learning pipeline is ~12000 gCO2eq.

This is equivalent to CO2 emissions from burning around 6kg of coal or to Greenhouse gas emissions from driving around 50km by an average passenger vehicle.

When we look at the emission numbers by every factor, we see that the highest and the lowest emissions come from CPU and GPU utilization respectively. We illustrate it on Img. 7. The scope 3 emissions (manufacturing) account for around 25% of all emissions:

If we drill down into emissions by pipeline stages(Img. 6) and factors (Img. 8), we can see that User Tensors stage together with Events Preprocessing stage make the greatest share of all emissions, leaving the Model Training stage far behind. This gives us an indication about which areas of the pipeline to look into if we need to work on reducing the emissions.

Img. 8 Emissions by pipeline stage and factor

Out of curiosity we looked into how our Deep Learning pipeline compares to some well-known models in terms of the carbon footprint. We took the numbers from CodeCarbon documentation¹⁰. A single run of our pipeline results in much lower emissions than, for instance, training ELMo or BERT base model.

Img. 9 Carbon footprint of multiple models

It is important to mention, that the comparison above and in general the whole experiment was performed for a single run of our Deep Learning pipeline. If we take into consideration the frequency of retraining and re-running the pipeline, the picture will look quite different.

Summary and Next Steps

We’ve started this experiment with the goal of estimating the current CO2-eq emissions of our Deep Learning pipeline and finding out whether there is a space for improvements.

We’ve learned that a single run of the pipeline results in ~12 kgCO2eq with two pipeline stages as the main contributors: Events Preprocessing and User Tensors. This number can be reduced by ~70% just by switching from Europe(Ireland) to a greener region like Europe(Stockholm). That would be the easiest first improvement.

In addition to that, now that we know which pipeline stages produce the most of the emissions, we could make sure to run them on the most efficient hardware (e.g. AWS instances powered with AWS Graviton2 processors¹¹) and that we use more efficient algorithms for data processing and transformations.

Another improvement would be to reduce the number of pipeline re-runs. Right now our models are retrained on a daily basis, meaning that the actual emissions (monthly/yearly/bi-yearly) are much higher. We could analyse the necessity of frequent pipeline runs and find ways to do it less often. This would also help us to significantly reduce our AWS costs.

The current experiment was performed with some limitations: we didn’t take into account other scope 3 factors and we didn’t include measurements of some extra services (S3, energy from networking inside the data center, etc.). We could further refine the experiment by extending the monitoring and collecting directly CPU and GPU power measurements and including other products and pipelines.

PS: We’re hiring and have exciting positions in all our locations across the Nordics and Poland. Check out our open positions at https://schibsted.com/career/.

[1] Dhar, P. The carbon impact of artificial intelligence. Nat Mach Intell 2, 423–425 (2020). https://doi.org/10.1038/s42256-020-0219-9
[2] Anthony, Lasse F. Wolff, Benjamin Kanding, and Raghavendra Selvan. “Carbontracker: Tracking and predicting the carbon footprint of training deep learning models.” https://arxiv.org/pdf/2007.03051.pdf
[3] Source code repository. https://github.com/schibsted/carbon-tracking
[4] Teads engineering. Building an AWS EC2 Carbon Emissions Dataset
[5] CodeCarbon. Documentation
[6] Kubernetes agent for Datadog. https://docs.datadoghq.com/agent/kubernetes/?tab=helm
[7] Boto3 CostExplorer. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ce.html
[8] Boto3 CloudWatch. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudwatch.html
[9] Teads engineering. https://medium.com/teads-engineering/estimating-aws-ec2-instances-power-consumption-c9745e347959
[10]CodeCarbon.Model examples. https://mlco2.github.io/codecarbon/model_examples.html
[11] AWS Graviton2-based instances. https://aws.amazon.com/about-aws/whats-new/2020/10/amazon-emr-provides-lower-cost-improved-performance/

How to measure CO2 emissions of a Deep Learning model.

What

Why

How

Results

Summary and Next Steps

Written by Maria Slanova