ATS-optimization Part2 — NLP-based top extraction — A beginner’s guide

19 min readOct 24, 2022

In the previous part 1 of this article, we obtained as result an Excel format file with job offers from LinkedIn corresponding to a job search results filtered on

a set of thematic search words
a geoID and
further filter criteria as offered by the current LinkedIn job offer search interface

The search word(s) are the most important filter, as they define the scope of the result. If you look specifically for jobs requiring e.g. Python know-how, this is where to put the “Python” keyword in the code of Part1. If you are interested in more generic job descriptions like e.g. data scientist or data engineer, you put these more generic keywords. The geoID gives you a sort of geo-fencing, producing only search results from (more or less) the area you define via the geoID. Currently the best way to obtain the geoID is to do a manual search, write the desired city/region/country into the corrsponding filter field in LinkedIn and copy&paste the resulting geoID from the search result URL as variable value into the code from part 1. The same works for additional filters (like e.g. hierarchy level or legal status of the jobs you look for), as described in part 1.

In any case, the code from part 1 produces an Excel-readable file with all job positions that are show as result from your search specification. This is alread interesting by itself. But as the code can easily return hundreds of results, it may be interesting to unleash the power of NLP (natural language processing) on this data to get a concise overview. This is the motivation for this part 2.

In NLP-theory, the total of results is referred to as “corpus”, the single results as “documents”. The single document is, on our case, the job description that we scraped for a particular job offer from the search result list. And the sum of job descriptions constitutes the corpus.
If you red part 1, you will remember that we scraped more structural data from the results (e.g. direct link, location, company size, industry etc.) — but for the following analysis part, only the (very long) textual job description for every single job offer returned as search result is relevant.
The real goal of our efforts is to get a feel for the things these job offers have in common. This may have differnt motivations. It can help to optimize CVs, such as to “check all the right boxes” and get your application passed the target company’s or recruitment agency’s ATS (a sort of automated screening process for CVs). Or it can serve to give you an overview of trends and tendencies in the field of your interest. Or you can study what companies comparable to yours (aka: competitors) are apparently up to by identifying the skills they recruite for.

After a bit of research, I opted for 3 ways of extracting this desired information from the corpus. This methodology is generally called “topic extraction”. You may think of topic extraction as the process of extacting the key words that best describe the gist of specific documents within the corpus. If a topic extraction algorithms produces as output a list of keywords like [NASA, booster, space, astronaut], it’s easy to infer from this list that the document will likely treat the subject of space travel. Or if you think about a collection of recipes, both for vegetarians and barbecue enthusiasts, a list with the result words [lettuce, peppers, onions, olive oil] is likely to belong to the vegetarian recipe, a result list [T-Bone, medium, smoker] rather to the barbecue recipe.

Or, in other words: the algorithms for topic extraction do NOT produce ready-made topics as outcome. They rather present the user with a list of words — and the “human in the loop” must still infer the true meaning of these “salient terms”, aka the words that stick out as particularly informative about the meaning of the document they are taken from.

The 3 different approaches I chose to further investigate the scraping result were:

Word Cloud
Latent Dirichlet Allocation (LDA) analysis
BERTopic topic extaction

Word Cloud

That’s the easiest to explain — and you’ve probably already seen more than one word cloud. It is a very visual approach: The word cloud is a sort of quilt or patchwork of words that occur in the corpus. There is no specific order to them — but usually, their writing can be oriented horizontally or vertically (what differentiates the word cloud easily from “normal” text). Most importantly, the size of the words changes corresponding to the frequency of the word in the corpus. Aka: the more often a word appears, the larger it will be printed.

Word Cloud from jop posts scraping:

The word cloud is a powerful means for a first assessement. It is both easy to produce and visually powerfull (e.g. you do not need any further explanation of what “salient term” means once you’ve seen a word cloud….).
The apparent drawback of the word cloud: it works on the entire corpus, showing no differentiation between documents or document groups. Taking the previous example of a mixed vegetarian & barbecue recipe collection: both terms lettuce and t-bone would appear side by side in a word cloud — and there’s no telling that they belong to two different clases / topics of documents.
A further drawback: an important concept in NLP is the “stopword list”: a lot of older NLP-approaches work with pure statistics applied to the “bag of words”. You can think of a “bag of words” as a list of all words in a document, with the number of occurences of the word written next to them. This number can serve e.g. to determine the size of a word in a word cloud (aka: the more often the word occurs, the bigger it will be printed).
There are, of course, very frequent “stopwords” that occur much more often than others. They provide the structural backbone of human language…. but do little to transfer true meaning. Articles like “the” and “a” or words like “and” or “or” are the best and most obvious examples.
Those are words you want to eliminate from your “bag of words” prior to any analysis — or from a visualization, like the word cloud. They would only drown out the true meaning with their noise — so you want to eliminate this noise. And you do this by using stopword lists.
And in order to produce informative word clouds. you very often need to manually adjust and extend pre-configured stopword lists.
In our given case, otherwise potentially informative words like “company” or “position” are very likely to appear in every job description — so it is a good idea to add them to the stopword list in this particular case, even though they will not belong to a standard stopword list in any language.
Or, to make a long story short: tweaking the stopword list in order to obtain a truely informative word cloud can be time-consuming…. and inflate quite a lot the stopword list used .

Laten Dirichlet Allocation

LDA-analysis is a statistical method for topic modelling. To cite the linked Wikipedia Article: “ Latent Dirichlet Allocation ( LDA) is a generative statistical model that explains a set of observations through unobserved groups, and each group explains why some parts of the data are similar.”

Sounds both complicated and fascinating ! And is does not only work, but there’s a Python package for applying this method to standard data frames and visualize the result (gensim in combination with pyLDAvis) .

The drawback of this method: it’s hard to wrap one’s head around the theory — so the outcome is determined more or less by as much a black box as in the neural network case. If you’re good at statistics, the result is certainly more explainable than the weights and biases of a trained neural network. But, unfortunately, I’m not exactly brilliant at statistics. So you must basically trust the package.

However, there was another very interesting aspect for me: thanks to reading about LDA, I accidently discovered a lot of interesting historical details about my region. More on this in the “Results & Findings” section — for this was pure serendiptiy at work.

BERTopic

Transformer models are currently all the rage. After the first amazing purely text based models, text-to-image transformers like Dall-E or StableDiffusion are currently stealing the show. But the pure language models paved the way — and BERT (Bidirectional Encoder Representations from Transformers) was one of the earliest one of this new generation of language models.
BERT has been somehwat eclipsed by newer models like GPT-2 and GPT-3. But it is open source and produces very good results when applied to tasks like topic extraction whilest still being managable in terms of model size (the larger transformer models have become so large, they are seemingly hard to integrate in open source packages).
And a single, dedicated Topic Extraction enthusiast, [Marteen Grotendoorst]( https://twitter.com/MaartenGr), a trained psychologist turned data scientist, has fused the BERT model into a Python package specifically conceived for topic extraction: BERTopic — and the NLP community owes him a lot for this.
My less than perfect understanding of transformer models: words of a document are “transformed” to vectors. In each single document, words can be linked or associated in different ways — and the transformer model is trained on A LOT of documents and encapsulates how specific words (aka vectors) are “normaly” related to other words (aka other vectors) and also if specifc words are similar to one another. An example often given is “king” and “queen”: both tend to appear in the same texts and to be related to the same subjects and hence words…. because they are essentially the same, only one denoting a male, the other a female ruler. And the transformer model can “see” this similarity (or closeness) by comparing the vector representation of “king” and “queen” and finding that both are unusually “close” to one another. If you want to learn more in detail about this, the terms to look up are “distance measure” and “cosine similarty” — but these details are beyond the sope of this article.
What’s important for this article:

transformer models are better at semantics than “bag-of-words” approaches. Because they encapsulate relationships between words instead of only counting them. And topic extraction is about semantics in the end.
BERTopic is an open source package by Marteen Grotendoorst that leverages the power of the BERT model for the task of topic extraction and combines it with the ease of Python

The Code

First a warning: you can access the code from this arcticle from two Jupyter Notebooks stored on github. The details are down below in the “Sources”-section. However, these Notebooks are not active — and hence they do NOT show the visual outcome . You will need to install and run the Notebook(s) in order to reproduce the graphical output both of the LDA and the BERTopic analysis. In the meantime, screenshot must do for a first impression.

Seceond warning: the indents so important in Python are somehow NOT properly shown in Medium. The spaces should be there…. but the visual impact in terms of indent is, to say the least, limited. Please mind the correct indents when copying code including loops, if/then/else statements or anything else that requires indents to work properly.

The world cloud and the LDA-analysis will be done in the same notebook. The 2nd Notebook is doing the BERTopic extraction. Both notebooks will first do the typical NLP data munging, that is”clean up” the text from the original job descriptions with the standard steps prior to NLP-analysis:

Tokenization (aka: “cutting out” the words from the the text)
Lemmatization (aka: cutting word endings to “condense” words and have not too many very similar variations)
Cleaning up “funny” characters
Elimnating the mentioned stopwords

The goal is to get a clean list of words that the algorithms can work efficiently with. These data munging tasks are performed via applying functions to the data frame column of interest, that is the job description column.

Word Cloud

1) Package import

# The standard packages

import pandas as pd
import time
import random as rdm
import matplotlib.pyplot as plt
import numpy as np
import string

# The visualization packages

import matplotlib.pyplot as plt
%matplotlib inline # to show visualization directly in the Notebook
import seaborn as sbn

# The WordCloud package !

import wordcloud
from wordcloud import WordCloud

# Natural Language Toolkit for text data munging

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer

# The packages necessary for the LDA analysis

import gensim
from gensim import corpora
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis import pyLDAvis.gensim_models

# one-time only data downloads with working data for NLP-packages: needs to be done once and repeated occasionally

# nltk.download(‘stopwords’)
# nltk.download(‘punkt’)
# nltk.download(“wordnet”)
# nltk.download(‘omw-1.4’)

2) Variable declartations (global)

s_separator = “ “ # define separator string to be used in case of need
for text maniupulation: special characters to be replaced and their replacement
s_2b_replaced = [‘,’, ‘;’, ‘.’,’!’,’&’, ‘:’, ‘?’, ‘-’, ‘(‘, ‘)’]
s_replacement = “ “
############################
Name of main data frame column to be analysed:
############################
c_category = ‘JobDescDetails’

3) Functions Definitions

Cleaner function: cleans string of non-ASCII characters

def f_clean_txt_characters(d):
for elem in s_2b_replaced : # see definition of s_2b_replaced in variable declaration section
if elem in d: # Replace any
d = d.replace(elem, s_replacement)
return (d)

Transform text in list of single words WITHOUT stop-words

def f_stopword_filter (d):
t = word_tokenize(d) # word-tokenize document > going from string to list
t = [w for w in t if not w.lower() in stop_words] # actual stop-word-elimination !!!
return (t)

Reassamble single text document from tokenized stop-word-filtered list

def f_filtered_txt_strg(l):
s = ‘ ‘
for n in l:
s = s + n.lower() + “ “
return (s) # returns string of filtered text instead of list

Lematize words in input list

def f_lematize(l):
lematized = “ “.join(lemma.lemmatize(word) for word in l.split())
return lematized

4) Actual Program

4a) Data Import

Read IMPORT file > result of LinkedIn scraping with Selenium. Don’t forget to adjust the code if the file and result worksheet were named differently

xlsx_imp = pd.ExcelFile(‘LinkedIn_Input.xlsx’)
df_imp = pd.read_excel(xlsx_imp, ‘Sheet1’)

4b) Data munging

Extracting only the relevant column FROM THE IMPORT for the further data preparation / manipulation

Check for non-strings in column to be evaluated >>> mus be corrected if any found

for k in range(len(df_w1)):
if type(df_w1.loc[k, c_category]) != str:
print (k)

Cleaning text: stripping punctuation, most special characters and “exotic” utf-8 text. Further text cleaning: eliminating all characters that are not part of standard ASCII-set. This should e.g. eliminate chinese text

df_w2 = df_w1.copy()
df_w2[c_category] = df_w2[c_category].apply(f_clean_txt_characters)

Stop word elimination and tokenization of text: As the result of the particular search yielded job descriptions in English, Dutch and Germa, all the languages needed to be considered for the stop words list:

stop_words = list(set(stopwords.words(‘english’)) |set(stopwords.words(‘german’)) |set(stopwords.words(‘dutch’)) )

#######################################################

add more stop-words to eliminate frequent common words

-> adjust this list in order to eliminate non-informative words

both from the word cloud and the LDA analysis.

This adjustment depends highly on the domain of the search

and the language(s) present in the corpus

#######################################################
l_thematic_stop_words = [‘finance’,’financial’,’business’,’financiële’,’jij’,’bent’,’onze’,’jouw’,’experience’,’binnen’,’wij’,’ervaring’,’functie’,’werken’, ‘work’,’ga’,’elk’,’kijken’,’stellenbörse’,’stammt’,’erfahren’,’mehr’,’anzeige’,’stammt’,’looking’,’jobdescdetails’,’team’,’controller’,’management’,’organisatie’,’hebt’,’control’,’kennis’,’skills’,’manager’,’company’,’jaar’,’working’,’jou’,’per’,’contact’,’probleem’,’diploma’,’bruto’,’mogelijk’,’projecten’,’ontwikkelen’,’nieuwe’,’goede’,’organisatie’,’vragen’, ‘minimaal’,’collega’, ‘daarnaast’, ‘including’, ‘sowie’, ‘bedrijf’, ‘process’, ‘processes’, ‘bieden’, ‘bieten’, ‘unternehmen’, ‘gut’, ‘gute’, ‘guter’, ‘goede’,’goed’,’within’, ‘waar’, ‘verder’, ‘verschillende’, ‘graag’, ‘samen’, ‘processen’,’waarbij’, ‘zoek’, ‘role’, ‘rolle’, ‘rol’, ‘strong’, ‘required’]
stop_words.extend(l_thematic_stop_words)

Enhance data frame with text for wordcloud WITHOUT any words from the stopword list:

df_w2[‘DOC_LIST4WC’] = df_w2[c_category].apply(f_stopword_filter) # perform stop-word-elimination on df-column

Going back from tokenized list to single string….

df_w2[‘DOC_TEXT4WC’] = df_w2[‘DOC_LIST4WC’].apply(f_filtered_txt_strg) # tranform from list to string

This code just serves to compare the impact of the previous code on the 50th search result as an example: it prints the text of the 50th job description BEFORE the text treatment (aka: as it was scraped) with the text after applying all NLP manipulations for the word cloud.

4c) The word cloud creation

all_words = ‘ ‘.join([str(text) for text in df_w2[‘DOC_TEXT4WC’]])
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)
plt.figure(figsize=(18, 12))
plt.imshow(wordcloud, interpolation=”bilinear”)
plt.axis(‘off’)
plt.show()

This last line produces a word cloud shown further above in this text. You can, of course, still tweak and play around with the settings of the word cloud package.

LDA-Analysis

This code continues simply after the word cloud creation.

4d) Data preparation for LDA-analysis

Prepare data for analysis: v_num_topics is an important parameter: as in k-means, the user must define the number of topics upfront. It is up to the user to repeat and experiment in order to find a good number of topics

For further steps, a document corpus must be generated as simple list with each list element corresponding to one text from the original data frame

l_docs = df_w2[‘DOC_TEXT4WC’].tolist()
lemma = WordNetLemmatizer()
l_doc_lem = [f_lematize(doc).split() for doc in l_docs]
corpus = corpora.Dictionary(l_doc_lem) # instantiate a corpus dictionary object
mat_doc_term = [corpus.doc2bow(doc) for doc in l_doc_lem] # generate document term matrix !!

4e) Training the LDA model and returning results

ACTUAL LDA-model-training with the previously generated doc term matrix

Output textual result of the model

print(“DocTermMatrix bases LDA-model — Perplexity:-”,mod_LDA_DocTermMatr.log_perplexity(mat_doc_term))

Output (interactive) visual result of the model:

vis_DocTermMatr = gensimvis.prepare(mod_LDA_DocTermMatr, mat_doc_term, corpus)

The result should look something like this:

BERTopic

This is an entirely new notebook, so it starts all over with the package import again

1) Package Import

Importing the packages required for the code to run

import re as re
import time
import pandas as pd
import random
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from bertopic import BERTopic

There are no variable declarations or function definitions. So we continue right away with

4) The actual program

4a) Importing the data

df_imp = pd.read_excel (r’LinkedIn_Input.xlsx’, sheet_name=’Sheet1′)
docs = df_imp[‘JobDescDetails’].tolist()

4b) Data munging / preparation

The same logic as mentioned for the word cloud and LDA analysis applies to extending the stopword list for the BERTopic topic extraction.

stop_words = list(set(stopwords.words(‘english’)) |set(stopwords.words(‘german’)) |set(stopwords.words(‘dutch’)) )
additional_stop_words = [‘finance’, ‘financial’,’business’,’financiële’,’jij’,’bent’,’onze’,’jouw’,’experience’,’binnen’,’wij’,’ervaring’,’functie’,’werken’, ‘work’,’ga’,’elk’,’kijken’,’stellenbörse’,’stammt’,’erfahren’,’mehr’,’anzeige’,’stammt’,’looking’,’jobdescdetails’,’team’,’controller’,’management’,’organisatie’,’hebt’,’control’, ‘kennis’,’skills’,’manager’,’company’,’jaar’, ‘working’, ‘jou’, ‘per’, ‘contact’ ,’probleem’, ‘diploma’, ‘bruto’, ‘mogelijk’,’projecten’,’ontwikkelen’,’nieuwe’,’goede’,’organisatie’,’vragen’, ‘minimaal’,’collega’, ‘daarnaast’, ‘including’, ‘sowie’, ‘bedrijf’, ‘process’, ‘processes’, ‘bieden’, ‘bieten’, ‘unternehmen’,’gut’, ‘gute’, ‘guter’, ‘goede’,’goed’,’within’, ‘waar’, ‘verder’, ‘verschillende’, ‘graag’, ‘samen’, ‘processen’,’waarbij’, ‘zoek’, ‘role’, ‘rolle’, ‘rol’, ‘strong’, ‘required’]

EXTEND instead of APPEND as an entire list is added, not a single element appended !!! Plus: works IN_PLACE by default

stop_words.extend(additional_stop_words)

Removing all stopwords from the list of documents

docs_without_stopwords =[]
tokenizer = RegexpTokenizer(“[w’]+”)
for d in range(0, len(docs)):
doc_token = tokenizer.tokenize(docs[d])
str_no_stopwords = “
for w in range(0, len(doc_token)):
if doc_token[w].lower() not in stop_words:
str_no_stopwords = str_no_stopwords + doc_token[w].lower() + “ “
else:
pass
docs_without_stopwords.append(str_no_stopwords)

4c) Playing with and running the actual BERTopic extraction

Defining and training the actual model (with important parametrization to toy around with, like ngram_range or nr_topics)

Getting some basic information on the model

topic_model.get_topic_info()
topic_model.get_topic(-1)
topic_model.visualize_hierarchy()
topic_model.visualize_barchart()

Example: search for all topics relavant for given search term:

similar_topics, similarity = topic_model.find_topics(“learning”, top_n=5)

Get the most relevant topic containing the search term

for topic in range(0, len(similar_topics)):
print(“Topic #”+str(topic))
for keyword in topic_model.get_topic(similar_topics[topic]):
print(keyword)

Print extracted topics and associated keywords and check if, by error, found topics happen to be included in the stopword list:

for topic in range(0, len(similar_topics)):
print(“Topic #”+str(topic))
for keyword, probability in topic_model.get_topic(similar_topics[topic]):
if keyword in stop_words:
print(keyword+” -> ERROR: topic word is stop word!” )
else:
print(keyword)

Word Cloud

Easy does it ! The word cloud is a good example for applying the KISS-methodolgy: keep it simple & stupid ! It’s an excellent tool for story-telling and communication. Espacially it is intuitively understandable by people that are domain experts, but not data scientists.

The big to caveat applying:
1) The quality depends a lot on the stopword list. And extending the stopword list manually also creates the opportunity for willfull manipulation.
2) Only the entire corpus is modelled. So you do not really see “topics” as defined above, but terms that are important over the entire corpus.

LDA

This rather old method worked suprisingly well. E.g. it seperated super well on the basis of language: many of the job descriptions scraped were in English, some in German (and also Dutch in a previous version until LinkedIn changed to the geoID-approach explained in part 1):

Upper left quadrant: only English terms from English language job offers:

Upper right quadrant: only German terms from German language job offers:

Needless to say: the clusters found in the middel, between the two languages, are job descriptions formulated in the hopeless and buzzword-ridden mixture of English and German often referred to as “Danglish” and usually used by consulting companies 😉

One salient term in all clusters for German job descriptions was, by the way, “Englischkenntnisse” aka “knowledge of English”. Which is to say that it’s basically impossible to get a management job in Germany without any knowledge of English, even if a good part of job descriptions (but the minortiy) is still in German.

You may recall from part 1 that my use case was to find topics and trends defining the current and mid-term future finance field. Based on this, one major finding for me personally ( which emerged both from the word cloud and the LDA-analysis) was the importance of the terminolgy “business partnering” for finance jobs. Even though “partnering” should come natural in daily work, I wasn’t aware that this had evolved to become a sort of “buzzword of the year”. Not really relevant for my use case — but interesting to know if you need to tick boxes to evade ATS elimination.

Another concret finding: management job offers in finance still demand surprisingly little concrete technical knowledge. Sure, a manager should primarily mange people, not code or hack mindlessly figures into spreadsheets. But I would have expected a bit more concret technological skill requirements, enabling people in such positions to fully leverage the potential of digitalisation — because they know and understand the potential.

I also ran a search later on entry level finance jobs to cross-check this finding: there, technical skills were more concretly formulated….. but at a deceptively low level: most of the time “advanced / excellent Excel skills” were sufficient, with demand for SQL-literacy occasionaly strewn in. Some proficiency in BI-tools was mentioned sometimes as a plus.

Based on these findings, I have now real doubts to what extent most finance departements

are able not to drive but simply follow the digital transformation
are seen as equal partners by business, IT and data / AI departments in the discussion how to implement new digital technology and processes

BERTopic

That’s a short one: I failed to leverage BERTopic yet ! The code ran & produced results — but not very meaningful ones. Here, I still need to invest time and learn the tweaks. That is the reason why I spare you a screenshot.

Certainly not a fault of BERTopic, but mine: all articles I red so far praise the package for its abilities. So here is a good starting point for toying around during cold winter nights. And the code base should provide a solid starting point for doing this.

Serendipious learnings: WTF is “Dirichlet”

Now I’ll digress and will go completely go off-topic: but it’s sometimes just too surprising how things are suddenly interconnected in space and time…. and how we discover about it.

Skip directly to “Sources” if you’re only interested in the technical part of this article and the corrsponding notebooks 🙂

I had a very unexpected moment of serendipity when I discovered the subject of LDA-analysis that I simply wanted to share:

When I red for the first time about LDA, the full name “Latent Dirichlet Allocation” stuck with me as being particularly weird. “Latent” and “Allocation” were a bit strange, but somewhat relatable. But what is “Dirichlet” ? I remember that I did actually reflect for a moment what “Dirichlet” could stand for: just a name (as often in mathematics, like with Cauchy-Schwarz or Euler and many others….) or maybe some weird Greek word for something I never heard about before. But in the end, I had better things to do and didn’t look for a detailed answer.

Just a week later, we had friends for diner. One of the friends invited is an Englishman from the very north (Newcastle). We usually have discussions on all kinds of subjects. He has somewhat fallen in love with German romanticism (…an art period referred to under this name from the early 19th century) — and had just learned that a German composer from that period had written several symphonies inspired by the British Isles northern landscapes…. and wanted to know if I knew about this.

Sure enough, yes, I knew that Felix Mendelsohn Bartholdy had composed both “The Scottish Symphony” and “The Hebrides”. Felix who ? you might say. Well: you know him….. even if you don’t. That’s the guy who brought THE wedding march to you. Which is actually the Wedding March from his opera “A midsummernight’s dream”.

Anyways: the English (….ok, by now “naturalized German”; Brexit oblige…) friend was surprised to learn that Mendelsohn-Bartholdy also has a direct relationship to the city he and I live in, Düsseldorf: he was the cantor (aka: in charge of music) at a local church for several years somwhere in the early 1830s. That triggerd our friend’s curiosity and we quickly went over Mendelsohn’s biography in the German Wikipedia article as he wanted to know more about Mendelsohn’s life.

WTF, what does this have to do with LDA you may ask — well, my eyes immediately got stuck, in the midth of the article, on the word “Dirichlet”. Coincidence ? Typo ? But the article said that Felix’s sister married a mathematician with this name: Peter Gustav Lejeune Dirichlet.

As this was only the 2nd time in my life that I had red this word, this triggered now my curiosity. All the more that the indication of the person with this name having bein a mathematician suggested a relationship with that name in “LDA”. And so it was. And what an interesting life:

He was born in Belgium before Belgium became Belgium (in 1832) — so he was acutally a French citizen who moved with his father a few kilometers eastwards from the region around Liège on the Maas river (or Luik in Dutch aka Lüttich in German) to Düren (close to Aachen), which belonged to Prussia (….and no, no Germany yet at that time; and the region had been even part of France for a while only shortly before, under the Napoleonic rule ) — and from there, it is only a few kilometers to the Rhine and Düsseldorf where he met his future wife, the sister of Felix Mendelsohn-Bartholdy.

We forget about this, but it’s amazing how open borders (and minds) were before the sickness of nationalism befel the European nations in the midth of the 19th century. So, according to the Wikipedia biography, he received his “germanized” name of “Lejeune Dirichlet” as litterally remembering him as “the young one from Derichlette” — a small community close to Liège — and moved on to become a famous mathematician in Prussia. He finally became even the successor of Gauss (….yes, the guy with the bell-shaped curve and, besides Bayes, the godfather of statistics) at the famous math faculty of the university of Göttingen….. giving us the theoretical basis for “Laten Dirichlet Allocation” in the process.

You may classify this as part of “The Encylopedia of Useless Knowledge”…. but I fond this discovery process as inspiring as surprising.

Fun fact to round all this off: Marteen Grootendorst, the mentioned creator of the BERTopic package is Dutch and from Tilburg….. just a few kilometers both from Liège and Düsseldorf. So that makes it a lot of topic extraction theory in a small, but culturally diverse radius (mixing Dutch, Wallonian and German culture) of just ca. 50–60 miles — and over a span of roughly 200 years:

The two notebooks from this part 2 on GitHub:

Originally published at https://www.my2ndworld.com on October 24, 2022.

ATS-optimization Part2 — NLP-based top extraction — A beginner’s guide

Word Cloud

Laten Dirichlet Allocation

BERTopic

The Code

Word Cloud

1) Package import

2) Variable declartations (global)

3) Functions Definitions

4) Actual Program

4a) Data Import

4b) Data munging

4c) The word cloud creation

LDA-Analysis

4d) Data preparation for LDA-analysis

4e) Training the LDA model and returning results

BERTopic

1) Package Import

4) The actual program

4a) Importing the data

4b) Data munging / preparation

4c) Playing with and running the actual BERTopic extraction

Word Cloud

LDA

BERTopic

Serendipious learnings: WTF is “Dirichlet”

Written by syrom