Analyzing ESG with AI and NLP (Tutorial#2): Large-Scale Analyses of Companies’ Environmental Actions

Tobias Schimanski
7 min readNov 26, 2023

--

ESG, AI, and Natural Language Processing (NLP) — these are easily among the most used words to signal the hottest emerging topics that should play a role in shaping our future. This is the second of a series of tutorials to show you how you can use AI to analyze ESG — even if you have no prior knowledge. Tutorial 1 is here.

Follow me on LinkedIn to stay updated on all developments.

All tutorials are based on this paper. Today, we are looking at how to create large-scale analyses of companies’ environmental actions in disclosures using GPUs. GPUs largely facilitate and speed up the process of analyzing multiple reports.

Let’s get right into it:

For Beginners (or practical people): Besides this article, you can run the tutorial code on this google colab notebook. In there, you can just run everthing without any problems.

Our first step is very easy. We just load the models and can look a bit at how the results look like. It is important for this tutorial for you to have a GPU in your system. Fortunately, Google Colab delivers some free resources. So, try this out or rent a server online (I might make a tutorial about this in the future).

# In this tutorial, we make use of the following libaries.
# Install transformers and tika:
# ! pip install transformers tika

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline # for using the models

import spacy # for sentence extraction
from tika import parser # for the report extraction
### Load the models (takes ca. 1 min)
# Environmental model.
name = "ESGBERT/EnvironmentalBERT-environmental" # path to download from HuggingFace
# In simple words, the tokenizer prepares the text for the model and the model classifies the text-
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
# The pipeline combines tokenizer and model to one process.
pipe_env = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0) # set device=0 to use GPU

# Action model.
name = "ESGBERT/EnvironmentalBERT-action"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
pipe_act = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0) # set device=0 to use GPU

### IMPORTANT: SET RUNTIME TO GPU (see Google Colab Tutorial)

Play a bit around with the models and get a feeling for them.

# Tryout the model 
pipe_env("We are really relying on people improving their consumer decisions to fight climate change.")
pipe_act("We planted 7.500 trees in the last 5 years.")

Well done. Now, we want to analyze a couple of reports. For this tutorial, we use 2022 reports of Audi, BlackRock, Saudi Aramco, and Telekom. In the first step, we need to process the PDFs into actual sentences. We use the code from Tutorial 1 and encapsulated it in a function.

# Encapsulate code from tutorial 1 in a function.
def PDFtoSentence(path):
print(f"\nParsing {path}")
# The from_file() function of tika helps us to load the content of the document. (take ca. 30 sec)
print("- PDF to txt")
report = parser.from_file(path)
# For this, we use the nlp() function from spacy. (takes 20 secs)
print("- txt to sentences")
nlp = spacy.load('en_core_web_sm')
about_doc = nlp(report["content"][:1000000])
# One downside of spacy is that it can only parse 1.000.000 signs at a time.
# You can write a for-loop around the report["content"] or use other tools.
# For simplicitiy, we only use the first 1.000.000 characters.

# We transfer the sequences ("about_doc.sents") to a list of raw strings.
sequences = list(map(str, about_doc.sents))
# "\n" signals a new line. We remove this so that the output looks better.
sentences = [x.replace("\n", "") for x in sequences]
# Remove all empty text, i.e. if the value is "", i.e are empty.
sentences = [x for x in sentences if x != ""]
# A sentence should start with upper case.
sentences = [x for x in sentences if x[0].isupper()]
return sentences

Feel free to try it with one of the following reports.

# Example reports.
audi22 = "https://www.audi.com/content/dam/gbp2/downloads/report/annual-reports/2022/en/audi-report-2022.pdf"
blackrock22 = "https://www.blackrock.com/us/individual/literature/annual-report/ar-retail-br-exchange-portfolio.pdf"
saudiaramco22 = "https://www.aramco.com/-/media/publications/corporate-reports/saudi-aramco-ara-2022-english.pdf?la=en&hash=6BC0409B50ECFF4A4C743307DF2FF7BDBCEC8B43"
telecom22 = "https://report.telekom.com/annual-report-2022/_assets/downloads/entire-dtag-ar22.pdf"
reports = [audi22, blackrock22, saudiaramco22, telecom22]
# Run PDF to sentence.
audi22_sent = PDFtoSentence(audi22)

Look at the results. It’s quite fine. Of course, this method can be improved.

# Look at the sentences.
print(audi22_sent)

Cool, the first major step is done. The second step comprises classifying all sentences with the “environmental” model. Then, we want to classify all “environmental” sentences with the “action” model. We encapsulate these steps in a function. The single steps are explained in Tutorial 1.

#  Function that takes a report, peforms PDF to sentence and then classifies first "environmental", then "action" on the "environmental" sentences.
def classify(path, pipe_env, pipe_act):
# Get sentences.
print(f"\nSentence Extraction for {path}")
sentences = PDFtoSentence(path)
# Classify environmental.
print(f"\nClassify environmental for {path}")
# Batch size helps to handle the texts in parallel. If there are "out of memory" erros, decrease the batch size.
classifications = pipe_env(sentences, padding=True, truncation=True, batch_size=16)
# We only want the labels, so we take the first entry of the outputed dicts.
labels_only = [x["label"] for x in classifications]
# Create Dataframe with sentence and label
df = pd.DataFrame({"text": sentences, "environmental": labels_only})

# Take only environmental sentences and classify them.
print(f"\nClassify action for {path}")
df_env = df.loc[df["environmental"] == "environmental"].copy()
# Batch size helps to handle the texts in parallel. If there are "out of memory" erros, decrease the batch size.
classifications_act = pipe_act(df_env.text.to_list(), padding=True, truncation=True, batch_size=16)
df_env["action"] = [x["label"] for x in classifications_act]

# Combine action with all data.
# Only take the "action" column of df_env to not have "text" and "environmental" duplicated.
df_all = df.join(df_env[["action"]])

return df_all

Again, try it out and see the results. It looks quite cool already.

# Look at AUDI 22. (takes around a minute)
df_audi = classify(audi22, pipe_env, pipe_act)

Two interesting facts would be: How much does the company talk about “environmental” topics in general? How much does it talk about “environmental actions”? Let’s see.

# How sentences are about environment?
env_pct = df_audi[df_audi["environmental"] == "environmental"].shape[0] / df_audi.shape[0]
print(f"{env_pct:.2f} of the sentences are addressing environmental topics.")
# How many sentences are about environmental action?
envact_pct = df_audi[df_audi["action"] == "action"].shape[0] / df_audi.shape[0]
print(f"{envact_pct:.2f} of the sentences are addressing environmental actions.")

Interesting output.

Let’s compare this with the other companies. For this, let’s create a pipeline where we can run all reports. I’ll do it very naively with a for-loop.

# Run all reports (takes around 5 min).
# Store environmental and environmental action percentages.
env_pcts, envact_pcts, dfs = [], [], []
for rep in reports:
df = classify(rep, pipe_env, pipe_act)
df["path"] = rep
# Calculate and store pcts.
env_pct = df[df["environmental"] == "environmental"].shape[0] / df.shape[0]
envact_pct = df[df["action"] == "action"].shape[0] / df.shape[0]
env_pcts.append(env_pct)
envact_pcts.append(envact_pct)
# Store all outputs in the dfs list.
dfs.append(df)

Let’s have a visualization for this. I’m honest with you. I would always use ChatGPT to create these nowadays. It’s just working well and no need to learn complicated stuff about matplotlib. :)

# Visualize the ratios for the companies.
# Sample data
categories = ['Audi22', 'BlackRock', 'Aramco', 'Telekom']
values1 = env_pcts
values2 = envact_pcts

# Set the bar width
bar_width = 0.35

# Set the positions of the bars on the x-axis
r1 = np.arange(len(categories))
r2 = [x + bar_width for x in r1]

# Create the grouped bar plot
plt.bar(r1, values1, color='darkgreen', width=bar_width, edgecolor='darkgreen', label='Environmental')
plt.bar(r2, values2, color='red', width=bar_width, edgecolor='grey', label='Environmental Action')

# Add labels, title, and legend
#plt.xlabel('Nature categories')
plt.ylabel('Percentages')
plt.xticks([r + bar_width/2 for r in range(len(categories))], categories)
#plt.title('Grouped Bar Plot')
plt.legend()

# Show the plot
plt.show()

The output looks something like this:

The interpretation and underlying reasons for the results lie beyond this tutorial. I think it’s very interesting to see some patterns. Feel free to try out more.

Caveat: If you take a closer look at the results then you might find that some sentences are not actually environmental or action. This may lie in biases or shortcomings in the training data. No system is perfect. Specifically, the “action” dataset only contains 500 entries. Thus, fine-tuning your own models with extensions of the datasets might be a good strategy. Since I open-source all datasets on HuggingFace, you can use mine as a basis for further extensions.

I will address fine-tuning of models in the next tutorial. It’s actually quite straightforward. Until then, feel free to read the paper on the ESG BERTs or follow me on LinkedIn to not miss out on new developments.

The full executable code is also in this notebook.

Thanks for reading. Tobi :)

--

--

Tobias Schimanski

NLP for Sustainable Finance, Doctoral Researcher at University of Zurich, Affiliated Researcher at University of Oxford