Can we predict ESG ratings from publicly available data?

Max Stieber
ELCA IT
Published in
13 min readDec 2, 2022

Do companies that discuss ESG topics during their earnings calls score better ESG ratings? We use different NLP techniques to give an answer.

The growing threat of climate change has caused the financial markets to increasingly target more sustainable investments. Companies are no longer only assessed by their profit, instead, their environmental and social footprint is taken into account. This caesura is accompanied by the question of how to measure this footprint and thus make different companies comparable. Environment, Social, and Governance (ESG) ratings have established themselves as an instrument that measures this footprint. Rating agencies were founded to address this problem and to give investors the best possible recommendations for sustainable investments. During my internship at ELCA, we were able to collect different sources of unstructured data, extract information, and discover links between this data and the ESG ratings.

Photo by veeterzy on Unsplash

Rating agencies collect and evaluate a variety of information sources. Both the choice of information sources and the methodology to build the ratings differ from agency to agency. It is not surprising, that the reported correlations between the different ESG rating agencies remain low as reported in the papers (cf. [Berg et al.|2019] and [Gibson et al.|2019]). Our data support this insight as you can see in the correlation matrix between three agencies that make their data available to the greater public.

Correlation matrix between different ESG agencies. The maximum correlation is 0.54, the minimum correlation is 0.31.
Correlation matrix between the ratings of the three different rating agencies. (figure by author)

The fact that the methodologies are only partially disclosed by the rating agencies has motivated us to analyze the ESG ratings by using open-source data. As ESG scores cover a wide range of topics, there are many relevant text sources such as news sources, employer ratings, or sustainability reports. For this experiment, we have decided to analyze earnings calls.

Earnings calls are convened every quarter by listed companies and serve as communication channels between investors and analysts. Increasingly, they discuss ESG-related issues such as how to deal with a pandemic, how to tackle various forms of discrimination, or what efforts they have undertaken to cut their greenhouse gas (GHG) emissions.

Now, we will give the short answer to the question if we can establish a link between a company’s earning call and its ESG rating.

Short answer

Overview of the framework: Features are extracted from earnings calls and used for ESG rating prediction. (figure by author)

We analyzed the earnings calls by extracting the number of different ESG mentions over time for more than 3000 companies. To this end, we teach an unsupervised classifier to identify ESG-relevant text by leveraging sustainability reports. We further classify the ESG-relevant paragraphs into 26 descriptive ESG categories. We create features by aggregating the information we gathered about the earnings calls for each company. These features are analysed with the help of linear models.

In our preliminary analysis, we establish a relationship between a company’s average number of ESG mentions in earnings calls and their ESG ratings. To do so, we fit a linear model on three descriptive variables: The company’s industry (there are 42 industries in total), the company’s “mean_total_mentions” and its “mean_controversy”. “mean_total_mentions” counts the number of paragraphs in which companies have discussed ESG topics during their earnings calls. With “mean_controversy” we try to capture topics that companies try to avoid but are brought up during the Q&A session of an earnings call.

We can see that the slope coefficients of the industries have a positive effect. “Oil & Gas Producers” or “Industrial Conglomerates” have the highest amount of ESG risk whereas “Textiles & Apparel” and “Media” have a relatively small ESG risk.

Does the same apply to “mean_controversy”? Does more controversy lead to higher associated risk? It turns out that the slope is not statistically significantly different to 0.

On the other hand, “mean_total_mentions” has a significant negative slope on the outcome variable. Companies that discuss more ESG topics during their earnings call have better ESG scores.

You can see some of the slope coefficients and confidence intervals of our linear model. (figure by author)

In the following, you can dive into the dataset, explore the machine learning pipeline for feature extraction and look at the linear models that we use to investigate the relationship between the constructed features and the ratings.

ESG ratings

ESG ratings are provided to investors by several ESG rating agencies, who developed their methodology to assess the ESG performance of different companies. As the name suggests, the ESG performance of a corporation is evaluated by identifying and weighting indicators in the following three areas: Environmental impact, social impact, and the quality of its governance. There are three sources of divergence in the ESG rating assessment:

  1. Scope: The three categories are subdivided into several subcategories that are deemed relevant. The choice of these subcategories is subjective and depends on cultural and personal backgrounds. Moreover, the rating agencies determine a set of “material issues” for different industries.
  2. Divergence of measurement: Inside these subcategories, the rating agencies identify the most suitable indicators to assess the performance of a company. The choice of indicators and the methods to assess these indicators (e.g. choice of data source) can vary between the different agencies. RepRisk doesn’t consider self-reported data sources as they judge them to be unreliable and biased.
  3. Divergence of weights: The different measurements need to be aggregated into the different subcategories and finally aggregated into one ESG rating.

For more details and results, have a look at [Berg et al. | 2019].

The divergence between the different rating agencies poses an interesting case. We can ask ourselves which documents are essential for ESG rating prediction of the different ratings.

Earnings calls

During the earnings calls, corporate management presents the quarterly earnings and discusses the factors that have significantly influenced their business. The prepared remarks of company officials are followed by a Q&A session where analysts and investors can ask questions about the company’s decision processes and their results. These sessions could be particularly valuable to discover flaws in the ESG strategy if great critical questions are asked.

During the last years, as the ESG performance of a company has become more significant for business, ESG topics have become more frequently addressed during earnings calls.

Based on this public data, we can analyze which company executives bring up ESG-relevant subjects and discuss them during their earnings call. We can also analyze the type of ESG issue and if it arises during the prepared remarks or rather the Q&A session.

Diving into the data

In the following, we show our approach to extracting information from earnings calls and preprocessing them for a regression task. We transform the unstructured data into tabular data and investigate a potential link between the extracted data and the ESG ratings.

How to extract information from the earnings calls

Our dataset is composed of ~43'000 earnings call transcripts of around 3'000 companies collected from different openly accessible sources. We subdivided the text into the “Prepared remarks” and “Q&A” sections and separated them into paragraphs. After these preprocessing steps, we extracted features in three steps:

1) Filter relevant paragraphs

We need to identify the paragraphs which contain relevant discussions about ESG topics. But how do we define ESG relevance when coming across 26 categories of ESG-relevant topics as defined by the Sustainability Accounting Standards Board (SASB)? SASB categories include “Greenhouse Gas (GHG) emissions”, “Employee Health and Safety”, or “Management of the Legal and Regulatory Environment”.

Many ESG topics are not easy to identify in the heaps of text data, especially if your dataset is composed of approximately 4 Mio. paragraphs. Domain knowledge is required to solve this task properly. But what if you don’t have access to a domain expert?

We leverage sustainability reports to identify ESG-relevant topics. Sustainability reports are company-produced documents that discuss their material ESG issues and explain how the company deals with them. Sustainability reports allow us to learn about the relevant ESG topics and their language.

We formulate the task of identifying relevant ESG paragraphs as an unsupervised learning problem. We sample 1 million ESG paragraphs, from which half are coming from earnings calls and the other half from sustainability reports. We use an “all-mpnet-base-v20” as a sentence embedding and reduce the embedding space from 768 to 10 dimensions by using UMAP. The dimensionality reduction is important to avoid the “curse of dimensionality” for the following clustering method. We then use “HDBscan” to identify similar paragraphs.

We recommend the use of “BERTopic”, as it implements the pipeline in an easy-to-use package. Moreover, it provides a class-based TF-IDF method to extract the most salient keywords of a cluster.

The pipeline that depicts our method for unsupervised relevance classification (figure by author)

To assign an ESG-relevant or ESG-non-relevant label to the ~141 clusters, we use the fact that earnings call paragraphs are dominated by non-ESG language, and the sustainability reports contain mostly ESG-related topics. Therefore, we classify the clusters that are dominated by paragraphs from sustainability reports as relevant and the ones containing mostly earnings calls paragraphs as non-relevant. We then end up with 500'000 earnings call paragraphs classified into ESG-relevant and non-ESG-relevant categories.

Illustration of voting scheme for relevance classification of cluster.
Illustration of the voting scheme that determines the ESG relevance of a cluster. (figure by author)

After correcting some obviously misclassified clusters by hand, we hold in our hands a labelled dataset with relatively little noise that was produced by automatically identifying the relevant ESG topics across different industries. The unsupervised classification method is illustrated in the figure above. This approach short-circuited a painful process of identifying the different relevant ESG topics for different industries and labelling earnings calls paragraphs that contain relatively little ESG-relevant data. In the following, we will use this dataset to train a supervised model.

For the evaluation of the different classification models, we create a gold standard dataset of hand-annotated paragraphs. We use a simple keyword-based approach, based on the work of Evan Tylenda and others, as a baseline for comparing our supervised models.

We evaluate different classification methods on two different types of text embeddings. On the one hand, we used TF-IDF embeddings to train a model that identifies the most discerning keywords to classify the paragraphs correctly. On the other hand, we also explore the use of Bert-embeddings (based on ESGBert) that were trained on ESG data.

The fine-tuned ESGBert on the classification task ends up being our model of choice, most likely because it is already pre-trained in ESG language. This transformer model thus solves our first problem: identifying relevant ESG paragraphs.

2) ESG topic classification

The relevant paragraphs are classified into one of 26 ESG categories, such as “Product Quality and Safety”, “GHG emissions”, “Energy Management”, or “Waste and Hazardous Materials Management” (here is a list of all the ESG categories according to SASB). ESGBert has been specifically developed for this task. Thus, we reuse this pre-trained model to classify the ESG-relevant paragraphs into 26 different categories.

3) Data aggregation

At this point, we have split up each company’s earnings calls into paragraphs, keep only the relevant ones, and assign them to an ESG category. How can we transform this information into a form that lets us investigate a correlation between the companies’ earnings calls and their ESG ratings?

Our goal is to leverage the fact that earnings call transcripts can be separated into a prepared remarks part and the Q&A session. While company officials can get ready to shine in the former, they typically need to endure the latter unpreparedly. We attempt to use this circumstance to assess if a company tries to avoid difficult ESG topics in the prepared part, that are brought up by the analysts or investors during the Q&A session.

In pure math, this approach can be formulated as follows. First, we count each ESG topic for each earnings call and each company:

Then, we introduce the distinction between the counts of the prepared remarks, and the Q&A sessions:

Finally, we define the topics that are mentioned in the Q&A session, but not in the prepared remarks, as potentially controversial:

To aggregate these counts for each company, we average out the earnings calls:

with K being the total number of earnings calls per company that we collected.

We now end up with an average count variable cᵖʳᵉᵖᵃʳᵉᵈ⁻ʳᵉᵐᵃʳᵏˢ, c {Q&A}, cᵒᵛᵉʳᵃˡˡ, and cᶜᵒⁿᵗʳᵒᵛᵉʳˢʸ. cᵖʳᵉᵖᵃʳᵉᵈ⁻ʳᵉᵐᵃʳᵏˢ and c{Q&A} are highly correlated and, therefore, should not be used as variables in linear regression. Hence, we use cᵒᵛᵉʳᵃˡˡ and cᶜᵒⁿᵗʳᵒᵛᵉʳˢʸ as features to describe each company.

We experiment with a simple unweighted mean and a weighted mean that allows us to put more emphasis on the ESG mentions in recent earnings calls. We don’t add the weights to the formulas as they would make the indexing confusing.

At this point, we have the (un)weighted average number of mentions for each category per company for the prepared remarks and the Q&A session.

ESG trends over time

We analyze the extracted features over time to identify potential flaws and discover problems. Below, we plot the share of earnings calls with at least one ESG topic mention over time. We can see that over the years, ESG topics are discussed more frequently during earnings calls. Nevertheless, many earnings call still don’t contain any ESG mentions. We also identify a peak in the first quarter of 2020.

The average number of earnings calls with at least 1 ESG mention over the years. (figure by author)

In the figure below, we can look at the average number of mentions per ESG category. The figure allows us to explain the peak we observed before. With the pandemic, which started at the end of 2019, many companies had to address the issues of the health and safety of their employees. We can see a strong peak in that category in the first quarter of 2020. Other ESG issues like “Employee engagement, inclusion, and diversity” have also risen a lot. With the “Black lives matter” and “LGBTQ+” movements after the killing of George Floyd in May 2020, these policies have become more important to many companies.

The average number of mentions of different ESG categories over the years. The graph was uncluttered by removing some categories that changed less over time. (figure by author)

Correlating ESG mentions with ESG ratings

After extracting and pre-validating the features, we run some experiments to see if we can establish a relationship between the extracted features and the ratings. We combine the earnings calls’ features with the ratings and end up with 3222 data points. The ratings measure the ESG risk of a company and higher values correspond to a worse ESG performance.

We conduct hypothesis testing of the regression slope to evaluate a possible linear relationship between the extracted features and the ESG ratings. We test with a significance level of 5%. Our ratings are approximately normally distributed.

To simplify the test, we compute the total number of mentions per company

We run a linear regression with

where tᵒᵛᵉʳᵃˡˡ is the average number of ESG mentions per earnings call. tᶜᵒⁿᵗʳᵒᵛᵉʳˢʸ is a metric

We use the “industry group” as an indicator variable as the mean ratings differ considerably from industry to industry. We log-scale tᵒᵛᵉʳᵃˡˡ as the distribution seems to follow a power law with most companies having only a few ESG mentions and some companies having a very high number of ESG mentions. We replace the zero mentions with min(number_of_mentions) / 2.

Summary of the statistics of the linear model. The variables can explain a reasonable part of the variance as can be seen by the R-squared score.

The R-squared value of the linear model indicates that our variables are able to explain a good amount of variance in the ratings. Most of the variance is explained by the industry indicator variables. The addition of our two features tᵒᵛᵉʳᵃˡˡ and tᶜᵒⁿᵗʳᵒᵛᵉʳˢʸ improves the R-squared score from 0.435 to 0.461. The adjusted R-squared score, which corrects for the additional degrees of freedom, is improved from 0.428 to 0.454.

We are also interested in the regression slopes and notice the following

  • We can reject the 0-hypothesis and find a negative linear relationship between the number of mentions tᵒᵛᵉʳᵃˡˡ and the ESG risks (Meaning that more ESG mentions lead to a lower amount of ESG risk). A negative slope is supported by our intuition that more ESG mentions should lead to a decreased risk (better ESG performance).
  • We cannot reject the 0-hypothesis for the mean controversy score (tᶜᵒⁿᵗʳᵒᵛᵉʳˢʸ) on the 5% significance level. It seems that it was not a great idea to create this variable.
  • The slopes of the indicator variables of the industry groups are all significant at the 5% level.

These results show a relationship between the extracted features from the earnings call and the ratings. Especially tᵒᵛᵉʳᵃˡˡ, the overall count of ESG mentions per earnings call, seems to help in predicting the ratings.

Summary of the linear model with the most important parameters e.g. slope, standard error, and the corresponding t-test.

Conclusion

We were able to establish a link between the average number of ESG mentions in earnings calls and a company’s rating. We can hope to squeeze out more information from the ESG mentions of the different categories. Strong correlations between these different features and a limited number of samples make it a challenging task.

In our further work, we will investigate the relationships between the features and the ratings on a more fine-grained level. Investigating potential interaction terms between industry groups and ESG mentions might also be helpful. Furthermore, we will try to improve the predictions by using signals derived from other document sources.

I would like to thank my supervisors Simon Häfeli and Luc Seiler for their valuable input and discussion during the internship and Nicolas Hubacher and Antoine Hue for the detailed review of the article. Many thanks :)

--

--

Max Stieber
ELCA IT
Writer for

Data Science master’s student at EPFL, intern at ELCA, and passionate about gaining insights into data that’s relevant to society.