Carbon Accounting Revolution: Utilising Machine Learning to upgrade Sage Earth

Published in

Sage Ai

11 min readMar 13, 2024

Introduction

The climate crisis has become the focus of global concerns in recent years, igniting a sense of urgency and responsibility to take action. For most organisations, the vast amount of their carbon emissions occur in their supply chain. As sustainability continues to emerge as a key priority, manual carbon accounting is no longer a viable option for understanding carbon impact.

This is where Sage Earth comes in — Sage’s carbon accounting solution that automates this process for you. By integrating with accounting software and ingesting data such as invoice information and transaction spend history, Sage Earth can estimate an organisation’s carbon footprint, empowering them on their journey to reach net zero emissions in their operations and supply chain. Theirs initial solution relied on a rule-based text matching approach to map this spend data to carbon emissions.

Here at Sage Ai, we recently collaborated with Sage Earth to provide a new service which employs machine learning to perform the prediction part of this carbon estimation process. Being part of the solution is not just a professional endeavour for us — it’s a personal commitment driven by a shared passion and excitement for sustainability and a deep-rooted desire to make a positive impact on the world using our skillset as AI Engineers.

Since our integration has been released, we have seen the number of transactions that the product can provide carbon emissions for double without compromising on, and actually improving, accuracy. In this article, we’ll delve into our collaboration, detailing how we are using machine learning not only streamline this process and boost efficiency, but to revolutionise carbon accounting and pave the way for a more sustainable future.

Background

Carbon accounting involves quantifying carbon emissions using existing accounting data. This software is based on the Greenhouse Gas Protocol; a globally recognised framework to measure and manage greenhouse gas emissions. Essentially, this spend-based methodology relies on an equation which multiplies the financial value of a product or service by an environmental impact factor (EIF). The resulting number provides an estimation of the emissions produced by that product or service.

Sage Earth climate scientists developed a hierarchy of user-friendly ‘carbon categories’ which map to the EIFs. When a user views their transaction data on Sage Earth, they will see each transaction mapped to a carbon category. This process of mapping transactions to categories is what enables the product to estimate carbon emissions.

Spend data can be assigned to a tier 2 category, which is a more general category (think ‘vehicle fuel’), or to a tier 3 category, which is most granular category type (think ‘vehicle fuel — biofuels’). The difference between these tiers is that tier 2 categories tend to map to multiple EIFs and split the carbon emissions across them, whereas tier 3 categories map to a single EIF, thus providing a more accurate carbon emission estimation.

Because the original solution to map transactions relied on rule-based text-processing logic, it was unable to automatically adapt to new textual inputs without new rules being implemented and it tended to choose the more general tier 2 categories. By implementing a machine learning solution, we can now automatically learn patterns, rules and relationships from data, enabling the classification of new instances without the need to manually create new rules. Our models achieve a higher level of accuracy in the same way — they continuously learn and update their knowledge as new data becomes available — and predict solely at the tier 3 level, meaning spends are more effectively mapped to EIFs.

Defining the Problem

As with all successful machine learning projects, we first began by clearly defining what the problem was that we would solve. The hypothesis that we were testing was as follows:

Using a machine learning approach, can we match more spend items to carbon categories without negatively impacting accuracy, when comparing against the existing solution, and provide a more complete estimation of a company’s carbon impact?

A Look at the Data

We were provided with a dataset of 1M+ rows, where each row represents a unique input request. Approximately 125k of these rows were labelled (i.e. had a ground-truth target variable), however only 40% of these had a baseline prediction to benchmark against. A further ~150K rows had a baseline prediction but no labels.

Of the labelled data, the target variable was split 75% tier 3 labels / 25% tier 2 labels.

From this we created four datasets:

Dataset 0 of approx. 100K rows labelled at tier 3 level.
Dataset A of approx. 23k rows labelled at tier 2 level. We added an additional column to this dataset, where the majority of these rows were given a tier 3 level label*.
Dataset B of approx. 500 rows selected randomly from the data that had a baseline prediction but did not have a label. These were semi-manually labelled*.
Dataset C of approx. 500 rows selected randomly from the data that did not have a baseline prediction nor a label. These were semi-manually labelled*.

* The method to manually label the data is detailed later in this article.

The input features shared with us included a textual description of the spend (e.g. a bank transaction description or an invoice line item description), the merchant name, the spend amount and currency, the spend date and a user defined description of the spend (e.g. ‘Travel Expenses’ ).

Experimentation Approach

There are 3 possible types of model in the general sense: global (one model trained on all of the data), local (a model per company) and a hybrid approach. Due to the importance of a consistent approach for emission estimation and the need for a solution that does not suffer from the cold start problem, we chose a global model approach.

Dataset 0 was used for model training experimentation. We used a 3-fold cross validation time series split approach. The key product metrics are automation (proportion of predictions returned) and accuracy at a global level. The dataset was unbalanced, i.e. a number of the carbon categories had low support in the dataset, so we also considered Precision , Recall and the F1 score (all macro-averaged).

To get a better understanding of potentially useful features, and to understand the best approach for encoding the textual features, we conducted an experiment using combinations of different classifiers and encoders. The classifiers included Logistic Regression, LightGBM, kNN, Decision Trees, Random Forests, Extra Trees, SVM and SGD. The encoders tested included binary, base N, hash, one-hot, TF-IDF and count vectoriser. For TF-IDF and count vectoriser, we experimented with n-gram encoding at both character and word level. The intuition here was that n-grams, n>1, can capture contextual information compared to just unigrams and, due to the nature of free-text inputs, there may be minor variations in the text we would like to capture.

It is worth noting that we also experimented with pre-trained embeddings on the text-based features to enrich the feature representation — intuitively they might be useful because there are many words with similar meanings that embeddings can capture. We considered pre-trained embeddings from fasttext, glove and sentence-transformers. However the simpler encoders outperformed these approaches, and including these embedded features as additional features to the models did not provide any uplift in model performance. What we tend to see in the data is company or merchant specific language used in the textual data of the spend. We believe this is why pre-trained embeddings don’t help here; the language is completely different essentially, and the encoders have the advantage of being trained on it. When we have more data in the future, we may consider fine-tuning embeddings to our own use case.

To avoid turning this article into a mini-novella, I will briefly mention that we also included different data cleaning, feature-selection and class-imbalance handling approaches in our experimentation. If anyone is interested in more detail or has any questions on this, please feel free to ask them in the questions section.

Semi-Manual Labelling Approach

I mentioned above that we semi-manually labelled data in datasets A, B and C. In order to do this in a more automated & less resource intensive way (and to preserve our sanity) we used two approaches to group ‘similar’ inputs. These groups were then assigned tier 3 labels.

The first approach was to combine the encoded representation of the textual data with a clustering algorithm. We used agglomerative clustering, which is a hierarchal clustering method, to identify the different levels of groups, and then chose a level with a sane number of groups to label for.

The second approach leveraged the FuzzyWuzzy library. This library uses the Levenshtein distance to measure the similarity of two strings, returning a score out of 100, where the higher the score, the higher the similarity. This was used more as a sanity check on the first approach as it is time-consuming to run on large datasets. It was ran on a random sample of dataset A, and on the entirety of both datasets B and C.

After we had identified groups and assigned them tier 3 labels, Sage Earth’s team of climate scientists reviewed the final datasets. It’s important to note that these datasets were not used for any model training. The benefit they provided was to allow us to better understand how the model would perform on a more representative distribution of the data Sage Earth consumed.

Understanding the Model Generalisation

After selecting our best performing candidate models, we analysed their performance on our three hold out validation datasets — A, B and C. The goal of this step was to limit the bias introduced by the training dataset in our final analysis, and to give us a better understanding of how the models would generalise so that we could pick the best candidate.

The reason we talk about bias being introduced by the training dataset is that there may have been a small number of users who provided all of the user labels, or the inputs that were mapped to a carbon category were done so because they were the more straight-forward examples to map, etc. It is vital to consider the data as a whole, and to understand if your training data is representative of the real-world data distribution.

Evaluating the model in this way allowed us to directly compare three scenarios — how the model performs against the examples that the baseline solution could predict for, for the examples that it could not predict for and for the more difficult examples where we have no ground truth label and no baseline prediction. Based on this evaluation, we then worked with Sage Earth to set a prediction confidence threshold to meet their metrics requirements; the main goal for Sage Earth was to increase the automation, however they wanted to ensure that we didn’t return irrelevant carbon categories that would lower user trust in the product. If a prediction does not pass the threshold, it is not shown to the user and is not included in their carbon emissions reports.

From this, we could provide trustworthy estimates for our automation and accuracy metrics to Sage Earth, which were calculated by weighting the performance across the three datasets and incorporating the confidence threshold. I’m happy to report that we have exceeded our own expectations; what we are seeing in Production on average is that we have actually doubled the amount of categorisations in comparison to the baseline solution! We had initially estimated that we would be able to provide 50% more categorisations. This is incredibly important to us, as it fosters trust between our team and the products we work with. We want them to know that Sage Ai delivers what it promises, or even more.

A look at the Production Service

The resulting service, CASA (Carbon Accounting Solution Automation & yes this is a play on the Spanish word ‘casa’ which can mean ‘home’) a FastAPI based service, that was created to host this solution, falls under our AI Orchestration services. All product requests for these services are routed to their entitled services via our central gateway and orchestration system. I’m not going to go into detail about what these different services are but you can check out this link to learn more about some of the services we provide.

What this means for us is that Sage Earth forwards batch requests of spend data to our orchestration system, which processes the request into our standardised data format, and this batch of data is then forwarded to our CASA API for categorisation.

The service is now live in Production for all Sage Earth V1 customers! It was Initially rolled out to a smaller subset of these customers for testing purposes but over the last month it has been turned on for all of the customer base.

Considerations & Future Work

We collect feedback from users if they confirm or change our predictions, or if they choose carbon categories for spends that were not assigned a category. Usually you would incorporate this data into re-training of the model. However, we need to be careful and consider how this might bias our training dataset.

Carbon emissions estimation should be consistent across all users, and not biased by user specific category choices. While we know users will have more information about the nature of their spends, we don’t want this to result in a model that adopts a specific user’s behaviour. For example, we have seen inputs which are clearly related to vehicle fuel, however these are re-categorised to a more general tier 2 carbon category or to other tier 3 categories that are more related to employee expenses. This may be correct for the specific user, however it wouldn’t always make sense for us to predict these categories from day one for other users.

We also need to consider the impact on estimation and user experience if we start to predict different categories for the same data — the training dataset has been carefully curated by people who are experts in the climate science field. Incorporating user feedback is essential for user experience, however we need to do so in a way that doesn’t lead to inconsistent carbon estimations and is auditable. Two users buying the same exact product from the same place should have the same carbon emissions estimate. These are just some of the challenges of training and deploying a global model.

Further work will involve deciding the best way to incorporate user specific feedback. The Sage Earth team is in the process of defining more carbon categories which will help with this and will also help with the class imbalance we see in the data.

In the future we also hope to move to a more granular categorisation approach; not just predicting a category but also predicting the exact product or service, so that we can provide the most accurate carbon estimation. This would involve us incorporating more third-party data sources into our training process; data which maps products and services from specific merchants to exact estimations.

The future of Carbon Accounting is exciting and we’re delighted to be apart of it!

A special thank you to the Sage Earth team for bringing this problem our way and for the support they provided, and still provide, in this collaboration.