Heuristic NLP based pipeline for Horizon 2020 database thematic study. Part 1: Data pre-processing
Demonstration of thematic analysis of Horizon 2020 energy projects applies NLP simple tricks, taking your research beyond a simple query search.
This post is the first out of two parts of publications dedicated to tudy energy practices directed to communities, allocation and generalisation of practices and methods addressed within the projects. The study deals with the H2020 (H2020) database - CORDIS, a primary public repository and portal for disseminating information on all EU-funded research projects and their results in the broadest sense.
It highlights a path for beginners and Python enthusiasts to explore an open dataset beyond the keywords query.
In brief, this post is an author’s reflection about heuristic approach as kick-off point to thematic study. The first step described here is a of data-preprocessing procedure that includs stop words removal, tokenization, stemming and n-grams generation. It will be followed by selection n-grams of a certain length as a term (i.e.meaningful semantic expression that can be related to some category) performing manual tagging of generated n-grams within classification categories, projects’ labelling according to tagged terms and ends up with further allocation, visualisation. In the end obtained results will be a interpretatory input for further analysis of Horison 2020 practices understanding energy projects trends itself.
Our study will be based on the question: how we can look at EU’s highly competitive H2020 energy projects to allocate them within addressed ambitions, activities, and approaches to decarbonisation, for further analysis?
The following steps will be presented here:
- Data export and query filtering
- Tokenization, Lemanization or Stemming
- N-grams Generation
- Terms Frequency and Next Stage Frontiers
1. Data export and filtering query
At the moment of work performing, the database held 35 326 projects funded under Horizon 2020 programme from 2014 to January 2022. It has a web interface to make a query search, as well as a web application for some exploration and visualisation of Horizon 2020 data. However, our motivation tends to explore open data a little bit “above and beyond”, and that is where an option to use a natural language processing (NLP) as a support tool comes to help.
The dataset can be easily dowloaded in csv format from Horizon 2020 website. The whole actions done using Python in Jupyter notebook. To begin with, lets upload the open data csv file as pandas DataFrame.
cordish2020 = pd.read_csv( '../data/cordis-h2020projects-csv/project.csv', sep=';',error_bad_lines=False)
categories = pd.read_excel('../data/terms_catalogue_query.xls')
categories = categories.fillna(method='ffill')
categories.columns = ['keywords']
keywords_catalogue = list(categories['keywords'])
We have a resulting dataset in the form tablewith 19 columns of parameters, where each line means the project. Among those the primary specified to: status, title, objective, start date, end date, and budget (“ecMaxContribution”).
In addition, as an input file, we have our keywords_catalogue list specified for a filtering query. Following that, simple data preparation was performed such as the following:
Now it is time to realise our keywords filtering and for that we will use flashtext
Python library, thus addfrom flashtext import KeywordProcessor
. Firstly, we write a short function that will produce as output a list(aka column vector) of mathed items from terms catalogue.
This block below aimed to take a specified catalogue of terms as a dictionary, and with a help of flashtext KeywordProcessor()
iterate the project objective and title columns, pass the arguments to our specified function extract()
where each cell in respective column iterated and particular match with the dictionary it returns as array of matched terms.
In the end we can save resulting DataFrame with interested columns and go ahead with tokenization and stemming. a N-grams generation.
2. Tokenization, Lemanization or Stemming
Primary we import the libraries, where the main one is nlkt
.
The tokenization procedure is described in various details and is basic within any corpus processing. As a rule, it includes such steps as text cleaning and stop words removal so the code block below provides a function where it was realised together with the stemming procedure, applying PorterStemmer
, the most common and pretty often considered effective enough. At this point, stemming (producing morphological variants of a root/base word) chosen over lemmatization (considering a language’s full vocabulary to apply a morphological analysis to words) desided to to be enough within heuristic approach to the problem that we are working on.
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()def only_tokenizer(series, stemmer=True, stop_words=[]):
tokenized_list=[]
if isinstance(series, pd.Series):
text=series.str.cat(sep=' ')
else:
text=seriestext = re.sub(r"http\S+", "", text) #remove urls
text=re.sub(r'\S+\.com\S+','',text) #remove urls
text=re.sub(r'\@\w+','',text) #remove mentions
text =re.sub(r'\#\w+','',text) #remove hashtagstext=re.sub(r"""['’"`«»]""", '', text)
text=re.sub(r"""([0-9])([\u0400-\u04FF]|[A-z])""", r"\1 \2", text)
text=re.sub(r"""([\u0400-\u04FF]|[A-z])([0-9])""", r"\1 \2", text)
text=re.sub(r"""[\-–.,!:+*/_]""", ' ', text)for word in word_tokenize(text):
if word.isalpha():
word=word.lower()
else:
continue
if stemmer is True:
word=porter_stemmer.stem(word)
if word not in stop_words and len(word)>1:
tokenized_list.append(word)
return tokenized_list
3. N-grams Generation
So, we have only_tokenizer function that take for input 3 parameters, corpus (in format of string or pandas Series), stemmer and stop words list. We can create an objective_words list which will hold the whole corpus of our dataset objective field in one list.
objective_words = only_tokenizer(results_keywordSearch['objective'], True, sw)
print('Length of the objective_words list: {}'. format(len (objective_words))) objective_words_unique=[]>>> Length of the objective_words list: 211730
>>> Length of the objective_words_unique list: 9626
With a few lines of code, we can find out which ngram occurs the most in this particular sample and plot (top 20 ngrams) as bar chat.
bigrams_series = (pd.Series(nltk.ngrams(objective_words, 2)).value_counts())[:20]
bigrams_series.sort_values().plot.barh (color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Bigrams')
plt.ylabel('Bigrams')
plt.xlabel('# of Occurances')
As a result in fugures below the bigrams and trigrams graphical output is presented.
To finalyze, we also check length of corpus and out genereted ngrams. as it seen below the total corpus length is 211 728 elements, with unique unigrams items is equal to 9626 items, bigrams 139133, trigrams 199793.
objective_ngrams1=pd.Series(nltk.ngrams(objective_words, 1))
print ('\nWhole number of unigram items (terms): ', len(objective_ngrams1))
objective_ngrams2=objective_ngrams2.unique()# leave only unique
print ('Unique 2ngrams terms: ', len(objective_ngrams2))
objective_ngrams3=objective_ngrams3.unique() # leave only unique
print ('Unique 3ngrams terms: ', len(objective_ngrams3))>>> Whole number of unigram items (terms): 211728
>>> Unique 2ngrams terms: 139133
>>> Unique 3ngrams terms: 199793
The number of ngrams for selected corpus is volumious. After all, the amount of unique unigrams is 9626 which grows in geometric progression for bigrams.
The bigrams and trigrams terms are considered to be the most adequate for specification of the further manual classification steps, since the unigrams missing the neighbour word to understand the context, while 4grams provide the set of terms that are useful to understand but number of combinations increases dramatically to consider them as terms.
The next bullets points were outlined:
- there is a need to allocate terms by its frequency for further manual categorisation, for that a TF and TF-IDF approach will be applied;
- the bigrams and trigrams will be considered as supplementary for categorization of terms.
4. Terms Frequency and Next Stage Frontiers
The general idea of desired output is next: as the whole sample frequency distrition described below doen’t give a picture, the term frequency distribution will be created for word considering it in each projects as following:
After that created DataFrame will be transposed and the frequency sum of each term will be presented as separate column:
To generate term frequency will use sklearn
library with two well known methods being the word’s TF-IDF score or their frequency counts (bag-of-words approach) using CountVectorizer()
for term frequency (TF) and TfidfVectorizer()
for term frequency — inverse document frequency (TF-IDF).
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
For creation of abovementioned table of term frequency and TF-IDF, we will put the data into the list, which will hold the token of each of our project, a list of the lists.
objective_terms_tokenized =[]
for j in results_keywordSearch['objective_tokenized']:
objective_terms_tokenized.append(str(j))
Term Frequency (TF)
BOW_bigram_vectorizer = CountVectorizer(analyzer='word', ngram_range=(2, 2))
BOW_matrix_bigram = BOW_bigram_vectorizer.fit_transform(objective_terms_tokenized)
BOW_terms_bigram=BOW_bigram_vectorizer.get_feature_names()
terms_bigram_freq = BOW_matrix_bigram.toarray()df_BOW_bigram = pd.DataFrame(index=results_keywordSearch.index, data=terms_bigram_freq,columns = BOW_terms_bigram)
df_BOW_bigram_transp = df_BOW_bigram.T
df_BOW_bigram_transp['term_treq'] = df_BOW_bigram_transp.sum(axis=1)
term_bigram_freq = df_BOW_bigram_transp[['term_treq']]
term_bigram_freq=term_bigram_freq.sort_values(by=['term_treq'], ascending=False)
bigrams = pd.DataFrame(index=term_bigram_freq.index, data=term_bigram_freq)
bigrams.to_csv('../data/ngrams/bigrams.csv')
The TF output for bigrams is:
Term Frequency — Inverse Document Frequency (TF-IDF)
tfidf_bigram_vectorizer = TfidfVectorizer(ngram_range=(2,2))
tfidf_bigram = tfidf_bigram_vectorizer.fit_transform(objective_terms_tokenized)
tfidf_terms_name_bigram = tfidf_bigram_vectorizer.get_feature_names()
tfidf_bigram_matrix = tfidf_bigram.toarray()
print (tfidf_terms_name_bigram[:5])
tfidf_bigram = pd.DataFrame(index=results_keywordSearch.index, data=tfidf_bigram_matrix,columns = tfidf_terms_name_bigram)tfidf_bigram_transp = tfidf_bigram.T
tfidf_bigram_transp['tf-idf'] = tfidf_bigram_transp.sum(axis=1)
tfidf_bigram_term_freq = tfidf_bigram_transp[['tf-idf']]
tfidf_bigram_term_freq=tfidf_bigram_term_freq[tfidf_bigram_term_freq['tf-idf']>0.01]
tfidf_bigram_term_freq.sort_values(by=['tf-idf'], ascending=False)
To sum up, a popular word cloud visualisation can be build for presentation on the example of bigrams:
bigrams_cloud = pd.read_csv('../data/ngrams/bigrams.csv')
from wordcloud import WordCloud
d = {}
for a, x in bigrams_cloud.values:
d[a] = x
wordcloud = WordCloud(background_color="white", width=800, height=400)
wordcloud.generate_from_frequencies(frequencies=d)
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
The ouptut is:
As a result, we obtain the table with bigrams, their TF and TF-IDF scores, additional supplementary trigrams and/or uni- and 4 grams with corresponding scores for further manual labelling. This output will be used as input data for the Part 2 publication about the manual heuristic classification.