Using word embeddings to improve screening for Sustainable Development Goals in financial news articles
Wouldn’t you agree that one of the most legitimate claims of the last decade can be summarised as “NLP is the future!”? Well, there is substantial evidence in both academia and industry that this may indeed be the case on many fronts: business intelligence, translation, healthcare, recruitment, marketing, advertising, and more. In this article, we will focus on the business intelligence front where data-driven approaches scan for information on the world wide web to help users gather insights.
Such data-driven approaches have found their way into game-changing applications, also referred to as monitoring (or radar) applications, that allow businesses to keep informed on specific events of their choosing. Monitoring applications can scan millions of news articles on the web and provide structured insights from various angles depending on the reporting source.
In today’s capital markets where investors seek to do good or avoid harm alongside financial returns, such applications are in high demand. According to the IFC (International Finance Corporation), “investors increasingly recognise the need to follow international norms and principles designed to address Environmental, Social and Governance (ESG) risks.” Within the scope of ESG efforts, for this article, we consider a monitoring application that addresses the United Nations’ Sustainable Development Goals (SDGs), which are targets for global development set to be achieved by 2030. According to the survey of 207 investors from 28 countries conducted by LGT Capital Partners, SDGs provide focus on tangible outcomes as they allow investors to measure impact towards achieving quantitatively defined targets that have been globally agreed upon.
In an ever expanding web, the main challenge is to search for, identify and retrieve what is relevant in a reliable and timely manner. Considering the unstructured nature of the web documents together with non-existing reporting standards on ESG/SDG, matching (and preferably exceeding) business needs on scope, relevance, quality and timeliness requires intelligent and autonomous tools that employ adequate and cost-efficient query expansion methods for information retrieval.
In the literature, there exists several techniques to tackle the query expansion problem. According to a recent survey¹, core approaches include (a) linguistic-based, (b) corpus-based, ( c ) search log-based, (d) web-based, (e) relevance feedback and (f) pseudo-relevance feedback. While (e) and (f) are categorised as local analysis that select expansion terms from the collection of documents retrieved in response to the user’s initial (unmodified) query, (a)-(d) are referred to as global analysis that select expansion terms from hand-built knowledge resources or from large corpora for reformulating the initial query . Under the global analysis category, linguistic-based approaches use thesauruses, dictionaries, ontologies, Linked Open Data (LOD) cloud or other similar knowledge resources such as lexical databases (e.g. WordNet) to find synonyms, hyponyms, etc. while corpus-based approaches examine the contents of the whole text corpus. Herein, we will focus on a corpus-based approach using context-predicting vector-space models, which have shown considerable improvement over the classic term overlapping-based retrieval approach and are widely adopted for query expansion .
Within the last decade, the fields of Natural Language Processing (NLP), Machine Learning (ML) and Information Retrieval (IR) have experienced a surge in the development of vector-space models of words, particularly those that frame the vector estimation problem directly as a supervised task . Such models represent words as points in an N-dimensional Euclidean space where words with similar meanings are expected to be close together . Compared to using the traditional localist representation of ones and zeros in an array per word in vocabulary, using distributed representations where each word is represented by many -shared- dimensions has allowed for capturing useful associations between words in large corpora .
The utilisation of such semantic associations has been shown to be effective in tackling the query-document vocabulary mismatch problem . By enriching the query with additional semantically related words, the aim is to improve retrieval performance. In practice, expansion of the original query with candidate terms that are semantically related is anticipated to boost the relevance and hence the quality of the search results as well as to enable the retrieval of documents that might have been missed using the original query.
In the rest of the article, we focus on the use of word embeddings in automatic query expansion from a practical perspective. We limit the scope of our analysis to expansion of keywords, where we regard the query as a seed term, being either a single word (e.g. sustainability) or a single phrase -i.e. multiple words- (e.g. gender equality). By comparing the behaviours of -locally trained- different word embedding models, we aim to develop in-depth understanding of how to build an efficient automated dictionary, where keys are the seed words (i.e. user’s query) and the values are the semantically close neighbours (i.e. expanded terms) of the corresponding seed words.
Among several methods used for estimating word embeddings, we will focus on word2vec⁶ (with skip-gram algorithm) and its variation np2vec (noun phrase to vector) using NLP Architect’s implementation. The idea behind np2vec is that noun phrases provide good approximations for candidate terms that belong to the same semantic class as the input seed terms . In their complete solution for term set expansion, Mamou et al.  use term representations with arbitrary context embeddings trained using the generic word2vecf toolkit, which produces dependency-based word embeddings.
By applying dependency parsing (i.e. extraction of grammatical relationships between words), we can tackle the shortcomings of original word2vec, which does not take into account different contexts for words. Context-independency may lead to either missing important context or learning coincidental context. An example adapted from  below demonstrates a coincidental and an important context within a context window size of 2. Using original word2vec with window size set to 2, we would be extracting the following training samples for the input word “discovers”: (discovers, Australian), (discovers, scientist), (discovers, star) and (discovers, with). Generally, depending on the end task, we might opt to configure the window size to be small in order to allow the model to learn focused information about the target word. Given such a constraint, one could argue that the value of representing the word “discovers” within the context of “telescope” is more important than the value of associating the action of discovery with a nationality.
Apart from investigating the impact of type of word embeddings on the utility value of an automatically extracted keyword, we also experiment with different querying techniques. Herein, querying technique refers to the way we compose the initial set of keywords that are sent to the model. For the scope of this article, we consider three categories of such composition (note that this is not an exhaustive list):
- Seed only
- Seed and context
- Seeds and manually extracted keyword list
The practical difference between the second and third points above is an important one. Context can be seen as a set of umbrella terms that we manually impose on the seed term with the objective of getting more specific candidate terms, whereas manually extracted keyword list is the result of a tedious curation phase conforming to a human’s perspective and may be either biased or limited to one’s knowledge on the subject. It is possible to crowd-source this task, however, it is time- and cost-inefficient.
To illuminate our point, let’s put things in perspective of the previously introduced SDG use case where the first goal to seek information about is ‘no poverty’. On one hand, it is possible that we might get an OOV (Out of Vocabulary) error when we use the phrase ‘no poverty’ as a seed term and query the model to explore its associations. (Let’s come back to this case later and focus on the operationalisation of the different querying techniques for now.) On the other hand, it is a relatively generic term (perhaps not as abstract as SDG 9: Industry, Innovation and Infrastructure) that has the potential to appear in articles talking about poverty, yet not necessarily pertaining to UN SDGs. To tackle this problem, we manually provide context, which could be one or more terms from a curated list of including but not limited to ‘sustainability’, ‘sustainable development’, ‘sdg’, etc.
Going back to the case where we get an OOV error for a phrase, we can tokenise the term further and query the model for a list of tokens. This is relevant and useful as we are using word2vec, however, the ‘FastText’ variant of word2vec can cope with this problem by using n-gram substrings where such sub-strings are morphemes hinting at meaning. For this post, we do not consider ‘FastText’ variant of word2vec.
Having clarified the second and the third querying techniques, we continue with the experiments and observations.
For our experiments, we use data from Webhose, a leading data collection provider that turns unstructured web content into machine readable data feeds. The dataset consists of slightly above 850k articles in the financial news domain from two time periods (i.e. July-October 2015 and September 2017-June 2019. The former period is available for free download on the website).
We use locally trained word embeddings that model semantic associations within our dataset. In other words, we use topically-constrained corpora where the constraint is financial news. The motivation to train our own embeddings stems from the empirical evidence gathered from both the existing literature such as  and our own observations. For brevity, the comparisons reported in this article will not involve globally trained (using topically-unconstrained corpora) models (e.g. Google News pre-trained word vectors).
Since we are interested in understanding how to build an efficient automated expansion method, the utility value of an expansion term given by different models and querying techniques is our focal point. Regarding creation of different word embedding models, the pre-processing of the input text together with the hyper-parameter configuration can be considered as a task-dependent exercise and based on previous experience, we configure and train three models as follows:
1) For a sample of the dataset, in which the data is temporally sampled down to the size of around 50k articles, we perform pre-processing to lowercase the text, tokenise, and structure the file to be one sentence per line. Then, we train a trigram word2vec model that has window size set to 16 and is compressed to a dimension of 300. We will abbreviate the settings and refer to this model as W2V-TS-TRI-D300-WS16. The rest of the hyper-parameters are configured as presented in the code block below (using python 3.7 and gensim 3.7.3). The code snippet demonstrates a method we use for configuring, training and saving a trigram model.
from gensim.models import Word2Vec
from gensim.test.utils import datapath
from gensim.models.word2vec import LineSentence
from gensim.models.phrases import Phrases, Phraser
import multiprocessingdef train_trigram_and_save(model_prefix, input_file_path, bigram_phraser=None):
sentences = LineSentence(datapath(input_file_path))
if bigram_phraser == None:
bigram = Phrases(sentences, min_count=5, delimiter=b’ ‘)
bigram_phraser = Phraser(bigram)
trigram = Phrases(bigram_phraser[sentences], min_count=10, delimiter=b’ ‘)
trigram_phraser = Phraser(trigram)
model_tri = Word2Vec(
tri_model_name = model_prefix + ‘_tri_w2v_s100_w10.model’
2) For the entire dataset, which has a size of around 850k articles, we apply the same pre-processing steps as above. Then, we train a trigram word2vec model that has window size set to 10 and is compressed to a dimension of 100. We will abbreviate the settings and refer to this model as W2V-PRCS-TRI-D100-WS10. We keep the rest of the hyper-parameters the same as above.
3) Using the original formatting of the input data, we perform dependency parsing using spaCy to mark the noun phrases (for more information on np2vec, see here) and pre-process for the purpose of structuring the file to be one sentence per line. Then, we train np2vec model that has window size set to 10 and is compressed to a dimension of 100 as suggested by the authors of the NLP Architect library for set expansion task. We will abbreviate the settings and refer to this model as NP2V-ONP-W2V-D100-WS10 (ONP stands for Only Noun Phrases, it is a choice to prune the non-np words from the vocabulary). We keep the rest of the hyper-parameters the same as above.
Observations on querying with seed only
We query the models using the most_similar method from the Gensim library and analyse the top 16 results. We note that increasing the number of candidate terms may result in noisy expansion and hence decreased retrieval performance. This analysis has been left out for the scope of this article.
The following figure lists the candidate terms sorted by their similarity score in a descending order when the models are queried using the seed word ‘gender equality’.
Let’s start with the model on the left (i.e. W2V-TS-TRI-D300-WS16), which is trained on smaller amount of documents and has larger window size and vector dimension to capture more topic/domain information. We observe that the candidate terms in bold font relate strongly to the seed term we used in our query. One named entity emerges as a candidate: TIME’S UP, which is an organisation that insists on safe, fair and dignified work for women of all kinds. For our document retrieval task, we do not prefer to use named entities as expansion candidates of the seed term in order to ensure a generalisable-yet-specific solution for the problem.
From a data-driven perspective, it is expected that the model will uncover associations that only exist in the corpora. Hence, the most important caveat we should highlight is that all the models we consider here are bounded by language, time period and sources. When using the results of these models, we should acknowledge the limitations and seek tangible achievements.
When we look at the rest of the candidate terms (that are not in bold font) given by the first model, we can deduce that although they are somewhat related to the seed term, they also appear to be generic in such a way that they may be susceptible to retrieve irrelevant content and hence cause an increase in the false positives for the retrieval task.
The model in the middle (i.e. W2V-PRCS-TRI-D100-WS10) seems to be performing better than the first one by looking at the number of candidate terms that are in bold. Apart from the candidate term ‘equality’, which is a hypernym of the seed term, the model retrieves two highly related named entities: ‘fawcett society’ and ‘male champions’. The Fawcett Society is the UK’s leading charity campaigning for gender equality and women’s rights whereas Male Champions of Change (MCC) Institute works with influential leaders to redefine men’s role in taking action on gender inequality. Although the model has the potential to teach new things (if you have never heard of these organisations due to -for example- geographical differences), the named entities should not be regarded as expansion terms for our case study. (We intend to deal with the named entities as part of future work.) Furthermore, some candidate terms such as ‘equal pay’, ‘wage equality’ and ‘diversity’ are good examples of the capability of the model to reason. This may be attributed to the fact that the second model has been trained on bigger corpora, which means yielding better semantic relationships. By decreasing the window size, we also aim to capture more about the word itself. Empirically speaking, the inference from gender equality to wage equality makes a lot of sense in financial news corpora.
Lastly, the np2vec model seems to provide very relevant candidate terms apart from a couple of generic ones on an individual level such as ‘empowerment’, ‘women’, and ‘the workplace’. A more interesting candidate term in this example is ‘gender-based violence’, which is a valuable inference of the model that has not been picked up by the others (at least in the top 16 most similar words/phrases).
Observations on querying with seed and context
As mentioned previously, context for our case acts as an umbrella term that gathers the expansion terms closer towards the provided context. While this can also be visually observed as a harmonisation in the clusters (see how the coloured dots mingle towards each other in the right plot compared to the left plot in the figure below), we are interested in assessing the semantic changes in the vector-space for the IR task.
For this article, we choose the context to be ‘sustainability’. We have experimented with other term sets such as [‘peace’, ‘future’], [‘sdg’], [‘sustainable development’], [‘sustainability’, ‘development’], however, the candidate terms became significantly more generic pertaining to all of the sustainable development goals, which is expected to some extent but not desired for the deliverable (i.e. the automated dictionary) of the retrieval task. By querying the model for the seed term and giving the context as a positive example, we retrieve the following candidate terms for each of the three models:
While interesting candidate terms (e.g. ‘civic engagement’, ‘inclusiveness’) extracted using W2V-TS-TRI-D300-WS16 can be inferred from the seed term ‘gender equality’, compared to the previous output of the same model, context term ‘sustainability’ does not seem to have improved the automated dictionary in the way we require. Looking at the candidate terms of the W2V-PRCS-TRI-D100-WS10, we observe a tendency in the model to shift the focus toward the context term more notably (e.g. candidate terms such as ‘united nations sustainable development’, ‘sdgs’). For the np2vec model, however, we observe robustness in the expanded set when we compare the candidate terms given by the two different querying techniques. There is a subtle introduction of terms related to the environmental issues; but, the majority of the candidate terms are likely to be valuable for the IR task considering the particular seed ‘gender equality’.
Observations on querying with seed and manually extracted keyword list
For the final querying technique, we follow a similar approach as above. By querying the model for the seed term and giving the manually extracted keyword list as positive examples, we retrieve the following candidate terms for each of the three models:
All models seem to have inferred some semantic relationships with the addition of these keywords as highlighted in the figure below:
What is quite interesting in this figure is that -for example- ‘paternity leave’, ‘unpaid care work’ and ‘patriarchy’ is extracted as candidate expansion keywords from each model, respectively. The extent of the topical inference with the help of human-driven keyword list is noteworthy. Overall, using manually curated keyword list may help guide the expansion process -which could also be done iteratively to get to further levels of conceptual depth- and inject profound inferences that perhaps the model itself would not be able to present within the top results we manually checked here. Further experiments are needed to be able to clearly draw concluding remarks on the collaboration between the notions ‘data-driven’ and ‘human-driven’.
In this article, we have demonstrated how mining semantic word-word relationships embedded in locally trained context-predicting vector-spaces can help improve the coverage of various aspects of the topic we are interested in retrieving. Despite its simple and exploratory nature, this approach offers insight into determining the utility value of keywords extracted using a word-embedding based automated dictionary. A greater focus on the integration of this dictionary with the retrieval task could produce additional actionable insight into choosing the right model as well as automated post-processing techniques that we leave out of scope here. As part of future work, we plan to improve the automated dictionary by using retrieval performance as feedback.
 Azad, H.K. and Deepak, A., 2019. Query expansion techniques for information retrieval: a survey. Information Processing & Management, 56(5), pp.1698–1735.
 Baroni, M., Dinu, G. and Kruszewski, G., 2014, June. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 238–247).
 Nematzadeh, A., Meylan, S.C. and Griffiths, T.L., 2017. Evaluating Vector-Space Models of Word Representation, or, The Unreasonable Effectiveness of Counting Words Near Other Words. In CogSci.
 Fernández-Reyes, F.C., Hermosillo-Valadez, J. and Montes-y-Gómez, M., 2018. A Prospect-Guided global query expansion strategy using word embeddings. Information Processing & Management, 54(1), pp.1–13.
 Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
 Mamou, J., Pereg, O., Wasserblat, M., Dagan, I., Goldberg, Y., Eirew, A., Green, Y., Guskin, S., Izsak, P. and Korat, D., 2018. Term Set Expansion based on Multi-Context Term Embeddings: an End-to-end Workflow. arXiv preprint arXiv:1807.10104.
 Diaz, F., Mitra, B. and Craswell, N., 2016. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891.