Analyzing ESG with AI and NLP (Tutorial#1): Report Analysis Towards ESG Risks and Opportunities

Tobias Schimanski
6 min readNov 11, 2023

--

ESG, AI, and Natural Language Processing (NLP) — these are easily among the most used words to signal the hottest emerging topics that should play a role in shaping our future. This is the first of a series of tutorials to show you how you can use AI to analyze ESG— even if you have no prior knowledge. Tutorial 2 is here.

Follow me on LinkedIn to stay updated on all developments.

Today, we are looking at how we can analyze annual reports towards ESG and whether the company perceives it as a risk or opportunity.

I’m Tobias Schimanski (Tobi for you), a doctoral researcher at the University of Zurich and an affiliated researcher at the University of Oxford. I develop NLP systems for sustainable finance like the ESG models we discuss today (paper to understand everything in detail is here). These systems are very easy to use and that is why I want to share it with you.

Let’s get right into it:

For Beginners (or practical people): Besides this article, you can run the tutorial code on this google colab notebook. In there, you can just run everthing without any problems.

Our first target is to analyze an annual report on its ESG communication. Thus, we start by just loading the models. We load one model for environmental, social, and governance. There’s nothing really to understand here besides that you just download these models from HuggingFace.

### MAKE SURE TO INSTALL THIS LIB: !pip install transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline # for using the models

### Load the models (takes ca. 1 min)
# Environmental model.
name = "ESGBERT/EnvironmentalBERT-environmental" # path to download from HuggingFace
# In simple words, the tokenizer prepares the text for the model and the model classifies the text-
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
# The pipeline combines tokenizer and model to one process.
pipe_env = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Also load the social and governance model.
# Social model.
name = "ESGBERT/SocialBERT-social"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
pipe_soc = pipeline("text-classification", model=model, tokenizer=tokenizer)

# Governance model.
name = "ESGBERT/GovernanceBERT-governance"
tokenizer = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)
pipe_gov = pipeline("text-classification", model=model, tokenizer=tokenizer)

The pipeline objects allow us to just plug in one or multiple sentences and see the results. Try out yourself:

# You can input single sentences or arrays of sentences into the pipeline.
sentences_test = ["Besides financial considerations, we also consider harms to the biodiversity and broader ecosystem impacts.",
"Tokenization is used in natural language processing to split paragraphs and sentences into smaller units that can be more easily assigned meaning."]
test = pipe_env(sentences_test)
print(test)
# [{'label': 'environmental', 'score': 0.994878888130188},
# {'label': 'none', 'score': 0.997612714767456}]

As you can see, you will get a label as well as a model confidence for every input sentence. Nice! You can classify any sentence to be E, S, or G — simple as that.

Next step: load an annual report and split it into single sentences. We use the Python “parser” function from the package “tika” as a PDF-to-text converter.

### MAKE SURE TO USE: !pip install tika
from tika import parser

# We use the AUDI annual report to analyse in this example.
path = "https://www.audi.com/content/dam/gbp2/downloads/report/annual-reports/2022/en/audi-report-2022.pdf"

# The from_file() function of tika helps us to load the content of the document.
# (takes ca. 30 sec)
report = parser.from_file(path)

If you want to look at the raw text, you can use the following code. You can see that it’s still very messy.

# Have a look at the raw content extracted from the PDF.
print(report["content"])

To split this text into sentences, we could use some rules (like split at a “.”), but fortunately, people already thought about this! Specifically, people from “spacy”.

import spacy

# For this, we use the nlp() function from spacy. (takes 20 secs)
nlp = spacy.load('en_core_web_sm')
about_doc = nlp(report["content"])

# We transfer the sequences ("about_doc.sents") to a list of raw strings.
sequences = list(map(str, about_doc.sents))
# Look at the first 10 text sequences.
print(sequences[:10])

# ['\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nre\np\n\no\nrt\n\nCombined Annual and \nSustainability Report\n\n2022\n\n\n\n2 Audi Report 2022\n\nProducts & \nServices\n\nValue Creation & \nProduction\n\nOperations &\nIntegrity\n\nStrategy\nEmployees & \nSociety\n\nIntroduction Appendix\n\nWithout a question: 2022 was a challenging year.',
# 'A \nyear shaped by a difficult situation in the global econ-\nomy, sharply rising energy prices and continued supply \nshortages.',
# ...]

You will discover that this doesn’t look too good at the moment. But we already have sentences. So let’s use some simple techniques to get cleaner output.

# "\n" signals a new line. We remove this so that the output looks better.
sentences = [x.replace("\n", "") for x in sequences]

# Remove all empty text, i.e. if the value is "", i.e are empty.
sentences = [x for x in sentences if x != ""]

# A sentence should start with upper case.
sentences = [x for x in sentences if x[0].isupper()]

print(sentences[:10])

# ['A year shaped by a difficult situation in the global econ-omy, sharply rising energy prices and continued supply shortages.',
# ...]

Better. Well, actually, really nice! These couple of steps are enough to have a clean set of (almost, like 95%) all sentences in the report. Of course, better techniques should be used if you really want all all sentences.

We have the models, we have the sentences, so let’s run the models on the sentences! But before we get all too hyped, I have to admit something. I did not yet introduce GPUs. These are basically processing units that allow us to analyze a report in seconds instead of minutes. However, this is a topic for the next tutorial. Thus, we only use the first 100 samples of the report here.

# Classify subset of sentences.
# The padding and trunctuation parameters help us with classifying texts of different length.
sub_sentences = sentences[:100] # takes around 20 seconds
# full sentences take around 5mins WITHOUT GPU (see next tutorial for speed ups)
env = pipe_env(sub_sentences, padding=True, truncation=True)

# You might only want the labels.
env_labels = [x["label"] for x in env]

To look at the data in a convenient way, we use pandas DataFrames (one of the most popular tools in Python).

import pandas as pd

# Let's look at the results. We use a dataframe for this purpose.
data_env = pd.DataFrame({"sentence": sub_sentences, "environmental": env_labels})
# Which sentences are labeled as environmental?
data_env[data_env["environmental"] == "environmental"]

This allows us to simply create cool visualizations.

# It could also be interesting to look at the proporation of environmental sentences.
print(data_env.groupby("environmental").count())
data_env.groupby("environmental").count().plot(kind="bar")

Cool! We analyzed (parts of) the report towards E(SG)! The logic displayed here can be extended to all dimensions of E, S, and G and multiple reports (see next Tutorial).

Knowing the E-communication of a company is nice, but we can do more. Does the company perceive E as a risk, opportunity, or neutral?

That’s as simple to find out. There are models for this in the ClimateBERT project. We use the same logic as above:

Load the model.

# To load the model, we use the exact same steps as above.
model_name = "climatebert/distilroberta-base-climate-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, max_len=512)
pipe_sent = pipeline("text-classification", model=model, tokenizer=tokenizer)

Classify the data.

# Use the model on the dataset.
sentiment = pipe_sent(sub_sentences, padding=True, truncation=True)
# Add the sentiment to the DataFrame.
data_env["sentiment"] = [x["label"] for x in sentiment]

Have a look.

# Again, have a look at the outcome.
data_env[data_env["environmental"] == "environmental"]
# Audi seems to display a enthusiastic, pironeering spirit towards sustainability. This positive attitude is reflected as "opportunity".

And visualize (how many environmental sentences are labeled as risk, neutral, or opportunity?).

# Let's plot the distribution again.
print(data_env[data_env["environmental"] == "environmental"].groupby("sentiment").count()["environmental"])
data_env[data_env["environmental"] == "environmental"].groupby("sentiment").count()["environmental"].plot(kind = "bar")

Looking at the text and diagrams, we see that Audi perceives environmental issues as an opportunity and not so much as a risk. This is mainly driven by their aim to pioneer electric vehicles and reduce emissions. Funny thing, because I worked for Audi and I think they should not be as optimistic (but that’s another story).

Conclusion: Today, we have seen how we can analyze any annual report towards ESG and also see whether a company perceives it as a risk or opportunity. Next time, we will look at speeding up the process and mass production of these results (the link will be in here and on my LinkedIn).

The full executable code is also in this notebook.

The full paper with details on training and links to data and models is here.

Thanks for reading. Tobi :)

--

--

Tobias Schimanski

NLP for Sustainable Finance, Doctoral Researcher at University of Zurich, Affiliated Researcher at University of Oxford