Sustainable Development Goals Country Scores

Published in

LSEG Developer Community

11 min readJun 8, 2020

Environmental, Social and Governance (ESG) data is difficult to come by. It is also becoming critical for effective investment analysis. It helps you assess the risks — and opportunities — posed by companies’ performance in critical areas such as climate change, executive remuneration, and diversity and inclusion. But a definite lack of transparency and standardization in such reporting presents major challenges for investors.

This article attempts to lay a framework to allow any investor/agent to collect, analyse and gather insight into countries’ ESG metrics at granular and macro-levels. It reflects the DataStream Sustainable Development Goals Country Scores Excel capability.

The file ‘ESG-DS.csv’ will be needed to collect individual country series codes from our ESG database. it is available on the Refinitiv Portal.

At the end of this exercise, we will get an interactive graph ranking countries by their ESG scores, looking like this:

Pre-Requisites

Required Python Packages: DataStream Web Services (DSWS), Numpy, Pandas, and Plotly.

Suplimentary

pickle: If one wishes to copy and manipulate this code, ‘pickling’ data along the way should aid in making sure no data is lost when / in case there are kernel issues.

Get Coding

Import Libraries

Math, statistics, datetime, numpy and pandas are needed for dates and dataset manipulation and their statistical and mathematical manipulations.

import math
import statistics
from datetime import date
import numpy as npimport pandas as pd
# This line will ensure that all columns of our data-frames are always shown:
pd.set_option('display.max_columns', None)# xlrd is needed to export data-frames to Excel files.
# It doesn't need to be imported and can be installed via 'pip install xlrd'

Pickle is only used here to record data so that is doesn’t have to be collected every time this code is ran (if it is ran several times)

# need to ' pip install pickle-mixin '
import pickle

Plotly allows us to create graphs.

import plotly.graph_objects as go

Refinitiv’s DataStream Web Services (DSWS) allows access to 242 detailed indicators that feed into each of the 17 Sustainable Development Goals (SDGs). Their code numbers may be found and detailed in the attached zip document. A bottom up approach was taken to map each of the 242 indicators outlined in the framework against the 8.5 million active economic indicators available in Datastream. These SDGs consisted of:

GOAL 1: No Poverty

GOAL 2: Zero Hunger

GOAL 3: Good Health and Well-being

GOAL 4: Quality Education

GOAL 5: Gender Equality

GOAL 6: Clean Water and Sanitation

GOAL 7: Affordable and Clean Energy

GOAL 8: Decent Work and Economic Growth

GOAL 9: Industry, Innovation and Infrastructure

GOAL 10: Reduced Inequality

GOAL 11: Sustainable Cities and Communities

GOAL 12: Responsible Consumption and Production

GOAL 13: Climate Action

GOAL 14: Life Below Water

GOAL 15: Life on Land

GOAL 16: Peace and Justice Strong Institutions

GOAL 17: Partnerships to achieve the Goal

We can access DSWS via the Python library “DatastreamDSWS” that can be installed simply by using pip install.

import DatastreamDSWS as DSWS
# We can use our Refinitiv's Datastream Web Socket (DSWS) API keys that allows us to be identified by Refinitiv's back-end services and enables us to request (and fetch) data: Credentials are placed in a text file so that it may be used in this code without showing it itself.DSWS_username, DSWS_password) = open("Datastream_username.txt","r"), open("Datastream_password.txt","r"))ds = DSWS.Datastream(username = str(DSWS_username.read()),                      password = str(DSWS_password.read()))# It is best to close the files we opened in order to make sure that we don't stop any other
# services/programs from accessing them if they need to.
DSWS_username.close()
DSWS_password.close()

For full replication, note that the version of libraries used

import sys # ' sys ' is only needed to display our Pyhon version
print("This code is running on Python version " + sys.version[0:5])

This code is running on Python version 3.7.4

for i,j in zip(["np", "pd", "openpyxl"], [np, pd, openpyxl]):
    print("The " + str(i) + " library imported in this code is version: " + j.__version__)

The ‘datetime’ library is a Python-built-in library, therefore it does not have a specific version number.

from datetime import date

Setup Functions

We rate each ESG category on a scale from 1 to 10 as per the function defined bellow

def Point(value, # ' value ' as an integer or float between 0 and 1 (inclusive)
          polarity = "Positive"): # ' polarity ' informs us if we are grading the value as 'higher is better' (i.e.: positively polarised) or 'lower is better'
    
    if math.isnan(value):
        # This if function captures the eventuality when we don't have a value passed through this function
        result = np.nan
    elif value >= 0.9:
        result = 10
    elif value >= 0.8:
        result = 9
    elif value >= 0.7:
        result = 8
    elif value >= 0.6:
        result = 7
    elif value >= 0.5:
        result = 6
    elif value >= 0.4:
        result = 5
    elif value >= 0.3:
        result = 4
    elif value >= 0.2:
        result = 3
    elif value >= 0.1:
        result = 2
    elif value >= 0.0:
        result = 1
    
    if polarity == "Positive":
        # This function this far assumes positive polarity
        return result
    elif polarity == "Negative":
        # We now can look into negatively polarised data
        if math.isnan(value):
            return result
        elif result <= 2:
            return 10
        elif value >= 1:
            return 1
        else:
            return 12 - result

Collect Data

We first need to collect the individual country series codes to ping DataStream with. There are a great many of them, we thus collect them from the comma-separated values (csv) file ‘ESG-DS.csv’. It can be found on GitHub

# This file includes polarity information as well as series codes, the ' .iloc[:,6:-1] ' bellow ensures that we only collect the latter.
df = pd.read_csv("ESG-DS.csv", header = 1, index_col = 6).iloc[:,6:]

Certain metrics have to be manipulated for them to be comparable to all others. We thus add columns for these additional metrics in the list of columns ‘Country_Values_Table_Columns’

Country_Values_Table_Columns = ['1.1.1', '1.1.1.1', '1.2.1', '1.2.2', '1.3.1', '1.5.3', '1.A.1', '1.A.2', '1.A.2.1', '1.B.1',
                                '2.1.1', '2.1.2', '2.2.1', '2.2.2', '2.2.2.1', '2.A.1', '2.A.2',
                                '3.1.1', '3.1.2', '3.2.1', '3.2.2', '3.3.1', '3.3.1.1', '3.3.1.2', '3.3.2', '3.7.2', '3.8.2', '3.9.2', '3.A.1', '3.A.1.1', '3.C.1',
                                '4.1.1', '4.1.1.1', '4.1.1.2', '4.1.1.3', '4.1.1.4', '4.1.1.5', '4.4.1', '4.5.1', '4.6.1', '4.6.1.1', '4.6.1.2', '4.6.1.3', '4.6.1.4', '4.6.1.5', '4.A.1',
                                '5.1.1', '5.5.1', '5.6.1', '5.6.1.1', '5.B.1', '5.B.1.1', '5.B.1.2',
                                '6.1.1', '6.2.1', '6.4.2',
                                '7.1.1', '7.1.2', '7.2.1', '7.2.1.1', '7.3.1', '7.3.1.1', '7.3.1.2',
                                '8.1.1', '8.2.1', '8.2.1.1', '8.2.1.2', '8.2.1.3', '8.5.2', '8.5.2.1', '8.5.2.2', '8.5.2.3', '8.5.2.4', '8.5.2.5', '8.6.1', '8.6.1.1', '8.6.1.2', '8.7.1', '8.7.1.1', '8.7.1.2', '8.7.1.3', '8.7.1.4', '8.7.1.5', '8.7.1.6', '8.7.1.7', '8.7.1.8', '8.8.2', '8.10.1', '8.10.1.1', '8.B.1',
                                '9.1.2.', '9.1.2..1', '9.1.2..2', '9.2.1.', '9.2.1..1', '9.2.1..2', '9.2.1..3', '9.2.2.', '9.4.1', '9.5.1', '9.5.2', '9.C.1',
                                '10.1.1', '10.1.1.1', '10.2.1', '10.3.1', '10.3.1.1', '10.3.1.2', '10.3.1.3', '10.4.1', '10.5.1', '10.5.1.1', '10.5.1.2',
                                '11.6.2', '11.6.2.1',
                                '12.7.1', '12.C.1',
                                '13.2.1', '13.2.1.1', # There indeed is no 14th category
                                '15.1.1', '15.1.2', '15.1.2.1', '15.1.2.2', '15.2.1', '15.3.1', '15.5.1', '15.9.1', '15.9.1.1', '15.9.1.2', '15.9.1.3', '15.A.1', '15.A.1.1', '15.B.1', '15.B.1.1',
                                '16.1.1', '16.1.2', '16.1.2.1', '16.1.2.2', '16.2.2', '16.4.1', '16.4.2', '16.5.1', '16.5.1.1', '16.5.2', '16.5.2.1', '16.6.2', '16.9.1', '16.10.2', '16.A.1', '16.B.1', '16.B.1.1',
                                '17.1.1', '17.1.1.1', '17.1.1.2', '17.1.1.3', '17.3.1', '17.4.1', '17.6.2', '17.8.1']

Bellow we can see the discrepancies in columns

# yields the elements in `Country_Values_Table_Columns` that are NOT in `df.columns`.
columns_in_Country_Values_Table_not_in_df = list(np.setdiff1d(Country_Values_Table_Columns, df.columns))

Now we can collect our data from DSWS

We can call for data for each of the (211∗153=211∗153=) 3228332283 codes, but that would be extremely time consuming. Instead, we will call data in batches of 50 (because we can pull data from DSWS in baches of a maximum size of 50; n note that DataStream accepts calls in batches of up to 50 codes at once).

# First: create a pandas data-frame to be populated with our data.
Country_Values_Table = pd.DataFrame(index = df.index, columns = Country_Values_Table_Columns)# Second: Collect data in batches:for i in range(len(df.columns)):
    
    # List and placeholders to be populated:
    data_points = [] # List to be populated
    count = 0 # Dynamic placeholder
    
    # For each set of 50 countries (or less, lastly)
    # The ' math.ceil( ' is there to make sure we run through and collect DSWS data for the last batch of codes even if it summs up to less than 50 codes
    for times in range(math.ceil(len(df.index)/50)):
        
        # Create a string of the 50 (or less) codes to pull from DSWS in one go:
        codes = str( # Change the following list into a sting
            list( # Change the following pandas data-frame into a list
                df.iloc[count:(count+50), # From the data-frame ' df ' choose the row that starts at the multiple of 50 we left of from and ends 50 rows later
                        i] # From the same ' df ' choose only from column ' i ', i.e.: go from ESG category to ESG category
            )).replace("[", "").replace("]", "").replace("'", "") # Finally: replace the scuare brackets and (single) inverted commas left off from the list 'grammar'
        
        sum_placeholder = len(data_points) + len(codes.split(","))
        
        # Collect data from DSWS for the ' codes ' list of country ESG category codes
        batch = ds.get_data(tickers = codes, fields = "X", start = '2000-01-01', freq = 'Y')
        
        # Now we can collect the last populated value for each country's SG category called from DSWS
        for k in batch.columns:
            try:
                data_points.append(float(batch[k].dropna().iloc[-1]))
            except:
                data_points.append(np.nan)
            
        count += 50
    
    # Not all countries report for all ESG cattegories.
    # The following if statement will account for when all of the coutries encapculated
    # in ' data_points ' do not repport for the ESG category 'i' in our loop. 
    if sum_placeholder != len(data_points):
        for times in range(sum_placeholder - len(data_points)):
            data_points.append(np.nan)
    
    Country_Values_Table[df.columns[i]] = data_points

Pickle

It is quite time consuming to request DSWS for all these codes. Let’s save our progress this far using Pickle:

pickle_out = open("ESG-DS.pickle","wb")
pickl = (df, Country_Values_Table_Columns,
         columns_in_Country_Values_Table_not_in_df,
         Country_Values_Table)
pickle.dump(pickl, pickle_out)
pickle_out.close()

The cell bellow can be run to load these variables back into the kernel

# pickle_in = open("ESG-DS.pickle","rb")
# df, Country_Values_Table_Columns, columns_in_Country_Values_Table_not_in_df, Country_Values_Table = pickle.load(pickle_in)
# pickle_in.close() # We ought to close the file we opened to allow any other programs access if they need it.

Sorting Out Our Data

The use of ‘ Country_Values_Table2 ‘ is only there to deliminate between before and after our ‘pickling’:

Country_Values_Table2 = Country_Values_Table.copy()

Certain metrics have to be manipulated for them to be comparable to all others:

# Going through the mathematical manipulation:
Country_Values_Table2["5.B.1.2"] = Country_Values_Table2["5.B.1"] / Country_Values_Table2["5.B.1.1"]
# Removing unnecessary columns:
Country_Values_Table2 = Country_Values_Table2.drop(columns = ["5.B.1", "5.B.1.1"])Country_Values_Table2["7.3.1.2"] = Country_Values_Table2["7.3.1"] / Country_Values_Table2["7.3.1.1"]
Country_Values_Table2 = Country_Values_Table2.drop(columns = ["7.3.1", "7.3.1.1"])Country_Values_Table2["8.2.1.3"] = Country_Values_Table2["8.2.1.2"] / (Country_Values_Table2["8.2.1"] * Country_Values_Table2["8.2.1.1"])
Country_Values_Table2 = Country_Values_Table2.drop(columns = ["8.2.1.2", "8.2.1" , "8.2.1.1"])Country_Values_Table2["9.2.1..3"] = Country_Values_Table2["9.2.1..1"] / Country_Values_Table2["9.2.1..2"]
Country_Values_Table2 = Country_Values_Table2.drop(columns = ["9.2.1..1", "9.2.1..2"])Country_Values_Table2["16.1.2.2"] = Country_Values_Table2["16.1.2"] / Country_Values_Table2["16.1.2.1"]
Country_Values_Table2 = Country_Values_Table2.drop(columns = ["16.1.2", "16.1.2.1"])

Country Points Table

We can now apply the points system defined above to our data in ‘ Country_Values_Table2 ‘

Country_Points_Table1 = pd.DataFrame(index = df.index)# NOTE THAT THIS IS NOT THE SAME WRANKING AS IN THE EXCEL SHEET, BUT IT IS A MORE MATHEMATICALLY COMMON AND LESS AMBIGUOUS ONE
for j in range(len(Country_Values_Table2.columns)):
    Country_Points_Table1[Country_Values_Table2.columns[j]] = list(Country_Values_Table2.iloc[:,j].rank(method = "dense",
                                                                                                        na_option = "keep",
                                                                                                        pct = True))Country_Points_Table1.head()

Country_Points_Table2 = pd.DataFrame(index = Country_Values_Table2.index)for j in range(len(Country_Values_Table2.columns)):
    Country_Points_Table2_column_j = [] # Create a list to be populated
    for i in range(len(Country_Values_Table2.index)):
        
        if math.isnan(Country_Values_Table2.iloc[i,j]):
            # Accounting fo rhte possibility that we did not collect a value
            val = np.nan
            
        else:
            
            # The following 3 lines are used to recreate Excel's ' PERCENTRANK(...) ' function which is different to
            # median percentage, percentile, ' scipy.stats.percentileofscore ' and ' pd.DataFrame.rank(pct = True)) '.
            # Note also that valus might differ slightly due to (backend) rounding errors on Excel
            array_of_lower_vals = Country_Values_Table2.iloc[:,j][Country_Values_Table2.iloc[:,j] < Country_Values_Table2.iloc[i,j]]
            column_len_no_na = len(Country_Values_Table2.iloc[:,j].dropna()) - 1
            val = len(array_of_lower_vals) / column_len_no_na
        
        Country_Points_Table2_column_j.append(val)
    
    Country_Points_Table2[Country_Values_Table2.columns[j]] = Country_Points_Table2_column_jCountry_Points_Table2.head()

Polarity

Certain data-points are better when lower (e.g.: poverty levels), others better when heigher (e.g.: availability of affordable and clean energy). We thus use a Polarity array to denote when which rule is applied.

# You will need the ' ESG-DS.csv ' file
Polarity = pd.read_csv("ESG-DS.csv", header = 1).iloc[:,1:3].dropna()Polarity

# Lists to be populated.
Negatives, Positives = [], []# Copy of the ' Polarity ' table to delaminate 'before' and 'after' this point in the code
Polarity_temp = Polarity.copy()for i in range(len(Country_Points_Table2.columns)):
    for j in range(len(Polarity_temp["Country Points Table full"])):
        if Country_Points_Table2.columns[i].split(".")[0:3] == Polarity_temp["Country Points Table full"][j].split(".")[0:3]:
            # For each ESG category in ' Country_Points_Table2 ', find this category in ' Polarity_temp '
            # and save the placement of the specified polarity of that category.
            if Polarity_temp.polar[j] == "Negative":
                Negatives.append(j)
            if Polarity_temp.polar[j] == "Positive":
                Positives.append(j)
            # Once that placement is saved, replace ' Polarity_temp ' data so that it isn't mistakenly used again
            Polarity_temp["Country Points Table full"][j] = "u.s.e.d"

Polarised Table

We may now apply polarity rules to our data-points

# Create a data-frame to be populated
Country_Polarised_Points_Table1 = pd.DataFrame(index = Country_Points_Table2.index,
                                               columns = Country_Points_Table2.columns)for k in [[Negatives, "Negative"], [Positives, "Positive"]]:
    for j in k[0]:
        Country_Polarised_Points_Table1_column_j = []
        for i in range(len(Country_Points_Table2.index)):
            Country_Polarised_Points_Table1_column_j.append(Point(Country_Points_Table2.iloc[i,j], polarity = k[1]))
        Country_Polarised_Points_Table1[Country_Points_Table2.columns[j]] = Country_Polarised_Points_Table1_column_j

Note that not all the individual country series codes are retrievable (e.g.: ‘AAGFORVO’)

SDG Aggregate Ranks

# Create a data-frame to be populated
SDG_Aggregate_Ranks = pd.DataFrame({("No Poverty", "Scores Available") : list(np.full(len(Country_Polarised_Points_Table1.index), np.nan))},
                                   index = Country_Polarised_Points_Table1.index)for j,k,m in zip(range(1,18),
                 ["No Poverty", "Zero Hunger", "Good Healthcare and Wellbeing", "Quality of Education",
                  "Gender Equality", "Clean Water and Sanitation", "Affordable and Clean Energy", "Decent Work and Economic Growth",
                  "Industry, Innovation and Infrastructure", "Reduced Inequalities", "Sustainable Cities and Communities", "Responsible Consumption",
                  "Climate Action", "Life Bellow Water", "Life on Land", "Peace, Justice and Strong Institutions", "Partnerships for the Goals"],
                 [3,3,3,5,
                  2,1,1,7,
                  3,4,1,0,
                  0,0,5,5,3]):
    
    # Create lists to be populated:
    col, SDG_Aggregate_Ranks_col1_j, SDG_Aggregate_Ranks_col2_j = [], [], []
    for i in range(len(Country_Polarised_Points_Table1.columns)):
        if Country_Polarised_Points_Table1.columns[i].split(".")[0:3][0] == str(j):
            # I fhte three fist strings (delimited by a full stop '.') ar the same,
            # then it must be the same ESG category. Here we focus on each category
            col.append(i)
    
    for i in range(len(Country_Polarised_Points_Table1.iloc[:,col])):
        # For each category, we tally up the number of observations we have
        SDG_Aggregate_Ranks_col1_j.append(str(len(Country_Polarised_Points_Table1.iloc[i,col].dropna())) + "/" + 
                                         str(len(Country_Polarised_Points_Table1.iloc[i,col])))
        
        # It was decided that only if enough records are found should we consider
        # the median score for a country's ESG category to contain significant insight:
        if len(Country_Polarised_Points_Table1.iloc[i,col].dropna()) > m:
            SDG_Aggregate_Ranks_col2_j.append(Country_Polarised_Points_Table1.iloc[i,col].median())
        else:
            SDG_Aggregate_Ranks_col2_j.append("insufficient scores (<=" + str(m+1) + ")")
    
    SDG_Aggregate_Ranks[(k, "Scores Available")] = SDG_Aggregate_Ranks_col1_j
    SDG_Aggregate_Ranks[(k, "Median Points (Higher is better)")] = SDG_Aggregate_Ranks_col2_j

Now we tally up the scores:

# Create lists to be populated
list_of_scores, country_median = [], []for j in range(len(SDG_Aggregate_Ranks.index)):
    
    # Create lists to be populated
    scores, scores_max, country_median_i = [], [], []
    
    for i in range(0,len(SDG_Aggregate_Ranks.columns),2):
        scores.append(int(SDG_Aggregate_Ranks.iloc[j,i].split("/")[0]))
        scores_max.append(int(SDG_Aggregate_Ranks.iloc[j,i].split("/")[-1]))
    list_of_scores.append(str(sum(scores)) + "/" + str(sum(scores_max)))
    
    for i in range(1,len(SDG_Aggregate_Ranks.columns),2):
        try:
            # The try loop here allows us to account for the lack of sufficient data-points
            country_median_i.append(float(SDG_Aggregate_Ranks.iloc[j,i]))
        except:
            pass
        
    country_median.append(statistics.median(country_median_i))
    
SDG_Aggregate_Ranks[("Overall", "Scores Available")] = list_of_scores
SDG_Aggregate_Ranks[("Overall", "Median Points (Higher is better)")] = country_median

Overall Results Chart

We can now proceed in creating a horizontal bar graph to overview results on a country level

# Create a data-frame from the list of our results this far
Overall_Results_Chart = pd.DataFrame(country_median,
                                     index = Country_Polarised_Points_Table1.index,
                                     columns = ["Overall Results"])
Overall_Results_Chart["Countries"] = list(Country_Polarised_Points_Table1.index)

Let’s view our results in assending order

Overall_Results_Chart.dropna().sort_values(ascending = True, by = ["Overall Results"])

Plots

ax1 = Overall_Results_Chart.dropna().sort_values(ascending = True, by = ["Overall Results"]).plot.barh(x = "Countries", y = "Overall Results", figsize = (10,40), title = "Overall ESG Scores (out of 10)\n", grid = True)

The cell bellow produces the same chart via Plotly. Plotly offers a range of advantaged and is more dynamic than what is show this far. Note that in order to use plotly you will need the jupyterlab-chart-editor and @jupyterlab/plotly-extension extensions

import plotly.graph_objects as godataf = Overall_Results_Chart.dropna().sort_values(ascending = True, by = ["Overall Results"])fig = go.Figure(go.Bar(x = dataf["Overall Results"],
                       y = dataf["Countries"],
                       orientation = "h"))fig.show()

Saving Our Data In A CSV

# ' ExcelWriter ' is needed to create an Excel Workbook with multiple tabs/sheets
with pd.ExcelWriter("ESG-DS_output.xlsx") as writer:
    Overall_Results_Chart.to_excel(writer, sheet_name = "Overall_Results_Chart")
    SDG_Aggregate_Ranks.to_excel(writer, sheet_name = "SDG_aggregate_ranks")
    Country_Polarised_Points_Table1.to_excel(writer, sheet_name = "country_polarised_points_table1")
    Polarity_temp.to_excel(writer, sheet_name = "polarity.u.s.e.d")
    Polarity.to_excel(writer, sheet_name = "polarity")
    Country_Points_Table2.to_excel(writer, sheet_name = "country_points_table2")
    Country_Points_Table1.to_excel(writer, sheet_name = "country_values_table1")
    Country_Values_Table2.to_excel(writer, sheet_name = "country_values_table2")
    Country_Values_Table.to_excel(writer, sheet_name = "country_values_table1")
    df.to_excel(writer, sheet_name = "individual_country_series_codes")