Energy Star rating for software

11 min readMay 31, 2024

Currently there is lot of focus on reducing the carbon footprint of software application and services. There are tools coming up that help measure the carbon impact and thus the greenness of the application and also there is good amount of information available on programming best practices that will make the code more carbon efficient.

All Major hyperscalers like AWS, Azure, Google and others provide tools that help measure carbon footprint of the application code running inside their environment. Also, static code scanners like CAST highlight to locate inefficient codebase or the tools like SonarQube has plugins like EcoCode to check greenness of code or the Kepler that works within the Kubernetes ecosystem etc

However just a number indicating the carbon emission does not provide insight into complete picture.

For example, if we say that a car emits 100 gram of CO2 and another car emits 120 grams of CO2, we can not say that the first car is better just on basis of that single number. We need to consider the larger context like fuel type (Petrol, Diesel, CNG etc) and consumption, the engine size , car weight (with or without passengers) etc. as well to derive some meaningful conclusion like gm of CO2/ km/ Kg of car weight. This enables us to compare the CO2 emissions from a car with emissions from another car of different size and category. To stretch it further we can also do meaningful comparison between emission from a bike with that from a car and with that from a truck .

The need for standardization of Greenness of software

There is always a need for standardization when things are compared to each other. This sets a base set of parameters and configurations that allow meaningful comparison between the entities in terms of relative efficiencies.

In the same fashion, there is need to have some common standardized way to meaningfully compare carbon efficiencies of different software of different sizes written in different languages and running in different types of environments.

In case of a software, the primary source of Scope 1 emissions is the number of instructions that get sent to a CPU which consumes power for executing those and in process involves other elements like RAM, GPU, NIC etc. which again consume power during their functioning and other factors like use of encryption, compression, network traffic (wired as well as wireless). This functioning causes heat generation that necessitates cooling infrastructure that further consumes power.

At surface we may see a single line of code but depending on the complexity of environment, the libraries and frameworks used, the programming language itself, it would result in way more instructions reaching to the CPU than the ones purely corresponding to the instructions intended in that one line of code.

The complete ecosystem involved in execution of a codebase of an application can be broadly classified into 3 layers as shown in the image below. The layers can be treated independent of each other and thus further leveraged for the purpose of standardization.

The Codebase Layer

The first layer comprises of the codebase, the language runtime, the framework, the libraries and all the dependencies as required to successfully execute the codebase. This is the primary set of instructions that would get sent to the operating system which is part of the second layer. This grouping helps in tackling the differences between implementation of same algorithm between different languages and bringing them onto a level ground.

A simplified syntax does not necessarily mean an efficient syntax. It definitely offers a developer convenience and helps in speeding up the development process. However, when we focus on efficiency of execution, layer 1 by clubbing together everything up to layer of language runtime just above the OS layer we take away any advantage or disadvantage by use of a specific language. More about this can be read in this article.

The OS Layer

The second layer is typically the one that comprises of the operating system and anything in between the operating system and the hardware. It has been categorized into 4 flows that are commonly observed.

1. The OS directly on top of the hardware, like in case of a typical PC or Laptop

2. The OS is installed on top of a hypervisor that adds one more software layer between the OS and the hardware.

3. The OS is part of a container system like Docker or podman etc. that itself is running inside a host OS which may again be a VM on hypervisor etc.

4. The OS is part of a container like flow 3 but the container is also managed by a larger orchestration mechanism like Kubernetes node layer that again runs on top of an OS running on top of hypervisor.

By clubbing together all these different systems into a single logical layer we unify the mechanisms whose purpose is to help the code output reach the CPU. As part of this logistics, these different systems also add more instructions to existing set of instructions generated by the layer1.

While the Operating system is the fundamental element, all other elements are optional and get added to bring in ease of management and operations for the overall ecosystem. Every additional layer from flow 1 to flow 4 adds more and more extra instructions to be executed by the CPU.

At one extreme end we find pure machine code and then as we go to other end, more and more layers of abstraction get added that result in more and more CPU instructions for a given feature or functionality of a program.

Once again, this layer let go off some of the efficiency to get ease of operations and management.

The Hardware Layer

This layer represents the real hardware that would execute the instructions and consume the power and emit the carbon in practical sense. We will focus on the data center based setup since that’s where most of the enterprise software pieces would run on.

The fundamental differences here start with various types of architectures like X86, X64, ARM etc. on the CPU front as well as different Generations and speeds of the GPU units as well as RAM. Just like above two layers, here also lot of combinations are possible.

This is the lowest layer, so more pondering is needed on what to include in the scope 1 here. It can be limited to the server unit itself or the rack or the data center in the entirety. In case we consider the rack or the data center then we also need to take into account the additional power consumption by the supporting components from the rack or in case of the data center like cooling, lighting etc.

Even for the hyperscalers like AWS, Azure , GCP and others, it would boil down to same basic where we see a hardware layer managed by some type of hypervisor.

It is at this lowermost layer that the measurement of power/carbon/greenness etc. will need to be defined and measured.

The standardization aspects

With so many different types and architectures of software applications around, we can not simply have one size fits all. There need to be a meaningful categorization like say Web Applications, Desktop Applications, Databases, Web/App servers, Middleware etc and maybe some categories based on size like S, M, L, XL, XXL etc.

Also, it is important that we meaningfully define the system boundary for test purpose. Like say while we consider the database connection drivers in the language runtime, the database itself should be considered as a remote system which is a common service provider to many such applications and an independent system having its own green rating.

Standardizing first layer

While the first layer defines the workload, the other two layers below impact the execution and hence need to be standardized.

Even though it is easier to say that layer one defines the workload, how to bundle it together can be tricky since all different languages have different ways of setup, run, compile etc. there are different types of files. sometimes there are interpreters sometime not etc etc. the final build artefact for each language is in different format and so on and so forth. Cloud hyperscalers do a good job in standardizing the top layer through their standard services but again the standardization is specific to a given platform. We need to define something more universal in nature.

Standardizing the second layer,

For second layer, it is mostly about choice of the operating system and the corresponding selection of components below.

Using a container platform could possibly simplify the overall things. It will help abstracting out the entire language runtime setup along with any other libraries, frameworks, dependencies, plugins etc. needed to run the application setup. This will also help in getting over the complexities that get introduced by chain of library dependencies and framework or the efficiency of the programming language itself. The container will hold everything together as a single unit of execution.

We could then fix the layers below that may not be absolute necessary like say the orchestrator layer like Kubernetes etc. this would give us a middle layer that would comprise of container runtime with the OS and hypervisor below. The hypervisor also can be done away with depending on what we select as our third layer of hardware.

Standardizing the third layer

The third layer standardization would primarily consist of selecting the hardware platform that is capable of running the workload and since there not as many players in this space as compared to other top 2 layers, setup perhaps could be standardized with lesser efforts. Also, this can have impact on the upper layer of OS as well, for example, if we use a single multi-CPU unit with huge RAM and GPU etc. as a standardized setup to carry out the process- then we can eliminate the hypervisor component from the middle layer.

Key factor here is how do we measure the consumption. The hardware manufacturers typically provide details about power consumption. So we can always know the rated power consumption. Or we can isolate a certain set of servers or say a blade chassis or a rack and measure the power consumption specifically and depending on time duration of the test get the actual reading of the power consumption. Hyperscalers do have details about how much sustainable or carbon efficient the services are but nothing much on the power consumption side of the infrastructure.

Some Approaches

We could standardize the carbon rating or greenness of a software on basis of carbon emission in mg per MB size of application per 1 Hr of execution in the standardized environment. Say “5 Mg/MB/Hr” and then decide the rating of say 1–2–3–4 or 5 stars based on certain ranges of this number. When we use the container, we know the initial size of the container and the final size of the container after installation of program will give us the size of “complete execution unit size” in MB. Then we run the container for a stand duration under standard load and measure the power consumption. Then we can convert it into carbon using standard formulas.

We can also stick to power consumption instead of carbon, like watt/MB/Hr and use that number to define the rating.

Standard hardware based approach with known power consumption and measurement

In this we simply enforce a standard hardware setup across categories. And measure for a specific period how much power is consumed for standard load.

Category based standardized setup and load definitions.

Define a standard way of testing for a given category of application and the load say for example for a web application we can say a standard load of 5 requests per second and running on top of a certain OS with certain CPU RAM configuration. Or say X number of concurrent application users. These definitions can be different for different categories decided by SME’s say for database we could say x number of reads and y commits per second and so on.

Cross comparison and benchmarking of platforms

We can first get the rating or consumption numbers for a standard piece of software of known size, category, complexity etc etc. then we use the same piece across multiple platforms like azure, aws etc. and then get the numbers from there based on their internal calculators.

Based on this we can establish the conversion factors between the platforms and then leverage that for further tests.

Some Key challenges

1. How to standardize the unit of measurement of code size or application size.

Since we consider everything, just counting the lines of code is not meaningful. Not even the build artefact.

One thing that can be done is we define a standard docker container of known size with only the bare metal OS without any language installed. Then let the team install everything. And after the installation check the size of the new image. The difference being the application installation size. This is similar to industrial approach where a truck or a container is weighed before and after loading. Here we focus purely on the program at hand and ignore the carbon consumed during the setup process or the one consumed by the team members working on it etc.

2. Handling the support systems that are essential but not part of the Application.

Like Database, Caching, other necessary API’s etc. For this kind of systems, a standard setup procedure can be adopted, say installing and exposing then from same local network with a specific network speed but outside the measurement environment where the application to be tested is setup.

3. If the app is dependent on internal systems that are available only within the internal network.

This may necessitate the organization to setup an equivalent testing setup on its own and self-certify following the defined standard procedures. However here we need to focus on impact of the network interaction of the program while interacting with these external services and not on the carbon efficiency of those external services. Those services can be evaluated as an independent entity and have their efficiency measured.

4. How do we account for cloud native services if the application can not function without it

Applications deployed on serverless functions like aws lambda/azure functions etc. or like aws iot or azure iot. There are cloud specific things that can not be setup outside in a meaningful manner. In such situations we may have to use the cross-comparison approach and translate the numbers as provided by the cloud . Cloud providers generally have numbers around carbon emissions / greenness / sustainability from their service. We need to build conversion matrix that will help a in translating these numbers into useful data for us.

5. How to ensure that all the features of the application have been executed in a manner that would represent its normal / peak usage etc. how to define standard load?

A regression test suite of the application can be handy. We may have to mandate a certain minimum code coverage requirement so that we are sure that at least all code is executed. Otherwise, an inefficient software may get a good rating by not executing all functions.

6. How do we meaningfully categorize the applications such that there are not too many categories or too few categories of applications?

This can be left to the core committee managing the initiative or the team managing the high-level process. And it might as well become self-governing because too many categories will anyway make the whole thing difficult to manage and too few categories may be challenged by the industry itself.

Conclusion

Given the all-around push on sustainability and greenness of everything, there is a greater need for bringing in the process standardization. It will bring the parity, will help in defining processes, in future may be in legal aspects. Also, it will help consumers in choosing right software or platforms by providing a common basis of comparison.

PS. : There is a new project initiated on this front from green software foundation in Jan 2024 that has similar aim (Standards / Software Carbon Efficiency Rating — Green Software Foundation — Green Software Foundation (atlassian.net) )