Topic Modeling on Spanish Texts
Create topics and classifying spanish documents using Gensim and Spacy
Often we have many documents that we would like or we need to classify or grouping into natural groups for better undestand them. In this post We will be reviewing topics modelling, attempting to answer the questions like what kind of words appear more often than others in a collections of documents? Can we group our data? can we find underlying themes?
Topic Modeling
Topic models are algorithms models that uncover the hidden topics or themes in a collection of documents(text). Is unsupervised Machine Learning because we do not have label data and we need for find the labels from the text. You can refer to this article to know a little bit more about topic model.
In this post we will be using algorithms like latent Semantic Indexing (LSI), hierarchical Dirichlet process (HDP), and latent Dirichlet allocation (LDA), if you are unfamiliar please follow the link of each algorithms.
Getting Ready
For this article we will need Python, Spacy, NLTK, Gensim and theirs spanish dependences (we have an spanish dataset)If you do not have it yet, please install all of them.
1. The Dataset
The dataset is composed of data extracted from an Spanish news portal, it has around 46k documents or news. Let’s check our data:
In this article we only are going to use the column text because this columns hold the complete text of the news.
And the info about the dataset:
The flow process for train a Topic models is as follow:
2. Preprocessing the Data
For topics modeling as preprocessing I recommend:
- use lemmatizing instead of stemming because lemmatized words tend to be more human-readable than stemming.
- Using bi-grams as part of your corpus before applying the topic modeling algorithm in order to get results more human interpretable.
- Select the proper part of speech, in this case we are selecting only the nouns of the corpus, this way we are removing adverb, adjetives, verbs, etc. This is do with spacy
- Cleaning (deeply) properly the text.
The code that implement all the steps above is:
After apply the cleaner function we must create the corpus and the dictionary:
we cleaned the data now is time to train the models.
3. The Models
After preprocessing our data we are making some several models like LDA, LSI and HDP and comparing between them.
For selecting and evaluationg our models, in adition to visual inspection we will using topics coherence coeficient which is a measure of how interpretable topics are for human beings. Good topics are topics that can be described by a short label, therefore this is what the topic coherence measure should capture, so the best models achieve highest topic coherence scores.
To know more about topic coherence please follow this link
3.1 The HDP Model
The HDP topic model infers the number of topics from the data (we don’t need to give the number of topics we need)
Create the model is simple as show below:
View the topics:
Topic 0: presidente gobierno personas parte ministro seguridad partido trump autoridades policia
Topic 1: gobierno presidente millones personas parte ministro acuerdo seguridad partido euros
Topic 2: personas ciudad seguridad policia autoridades gobierno parte fuerzas presidente zona
Topic 3: personas gobierno millones presidente parte caso ministro autoridades seguridad policia
Topic 4: partido gobierno may acuerdo presidente ministra personas ministro elecciones vez
Topic 5: trump presidente acuerdo gobierno kim parte millones mundo anuncio lunes
Topic 6: personas caso policia autoridades lunes grados centro millones martes proceso
Topic 7: incendio paris notre abril bomberos fuego llamas catedral notredame historia
Topic 8: ministro partido acuerdo may votos presidente elecciones gobierno trump despues
Topic 9: partido gobierno mes ministro parlamento grupo spain caso abril anuncio
Topic 10: trump presidente elecciones partido cumbre gobierno ministro viernes domingo consejo
Topic 11: est pas pour une par sont qui espagne jeunes mais Topic 12: personas policia ciudad ley parte autoridades horas martes gobierno zona
Topic 13: mujer horas personas redes von lugar vida juicio autoridades seguridad
Topic 14: personas gobierno martes lunes acuerdo mundo parlamento ley ciudad autoridades
Topic 15: trump presidente grupo gobierno medios fuerzas jueves despues parte anuncio
Topic 16: frutas personas enfermedad mujeres espacio trump mujer consumo salud peque
Topic 17: grados autoridades temblor escala magnitud viernes lugar seismo union parte
Topic 18: millones personas autoridades parte hora jueves partido sismo poblacion martes
Topic 19: presidente boeing ministro partido euros erc parlamento votos personas georgia
As we could see there are 20 topics, however is kind of dificult to interpret or follow it, so we decide to move to another models.
3.2 The LSI Model
This model implements fast truncated SVD (Singular Value Decomposition) to form the topics.
We are fiting models with several numbers of topics (in range 1–20) and looking for the coherence measurement to see what is the optimal number of topics, the snippet code is below:
The chart below show the results:
It seems that the best model is the model with only 3 topics
View the topics:
Topic 0: presidente gobierno personas ministro seguridad atentados parte venezuela trump policia
Topic 1: presidente atentados venezuela trump policia personas gobierno juan paris ataques
Topic 2: trump venezuela gobierno juan nicolas casa clinton votos noviembre elecciones
It seems that the model lsi with 3 topics contains words overlapped, so this model is not useful for us.
3.3 The LDA Model
In few words LDA Generates topics based on word frequency from a set of document.
Again we fitted multiple models and look for the coherence:
For this model the best number of topics is 12, let’s review the topics:
Topic 0: derechos humanos muerte guerra tribunal juez caso libertad personas juicio
Topic 1: estudio tierra universidad mundo agua investigadores cambio expertos corea sistema
Topic 2: policia hombre casa mujer muerte familia hospital autoridades caso despues
Topic 3: presidente gobierno ministro justicia mubarak pueblo presos venezuela jefe caso
Topic 4: mujeres mundo mujer papa vida iglesia matrimonio caso hombres casos
Topic 5: personas metros rescate zona autoridades isla mar kilometros agua horas
Topic 6: partido presidente elecciones votos gobierno candidato comicios partidos domingo ministro
Topic 7: personas seguridad fuerzas ciudad ataque policia fuentes grupo ataques protestas
Topic 8: millones euros gobierno dolares medidas crisis parte empresas trabajo sector
Topic 9: personas terremoto ayuda ciudad seismo zonas zona millones agua muertos
Topic 10: gobierno presidente ministro consejo acuerdo seguridad parte paz union asuntos
Topic 11: aeropuerto avion accidente vuelos pasajeros horas vuelo seguridad autoridades personas
It looks like the topics are:
- Topic 0: is about trials (justice)
- Topic 1: is like nature studies
- Topic 2: is about violence (domestic maybe?)
- Topic 3: protest and disturb (like venezuelan case)
- Topic 4: is about life and family
- Topic 5: sea disasters
- Topic 6: elections
- Topic 7: is about terrorism
- Topic 8: is about economic crisis
- Topic 9: is like seism studies
- Topic 10: is about peacy treaty
- Topic 11: airport and security
The topics for this models looks well defined, so we will select this models with this number of topics to tag our dataset.
3.4 Coherence of the Best Models
Let’s compare the coherence of the best models:
Again when compared to the other algorithms the best model is the LDA one.
In despite the helps of the topics coherence coeficient the question how to select the right number of topics? is not easy to answer on my experienced really depends on:
- The kind of corpus you are using.
- The size of the corpus.
- The number of topics you might expect to see based in past or relates projects for example.
4. Visualization the Final Model
The most popular topic modeling visualization libraries is LDAvis, you can use to get a nice visualization of the topics:
The dynamic chart you must see:
From the chart you can see hoe some topics sharing word and how related (or well differentiated) are each other, so you can either apply more preprocessing or select a different number of topics.
5. Classiying all the Corpus with the Topics found
Now that we have been select the best model and topics number (for this article), is time to assign a topic to every document, means clustering the collections according to the topics.
We selected the ldamodel with 12 topics and implemented a function to asign a dominant topic to each document, then map each topic with a label:
The result:
and map to the text:
5.1 Finally the topic distribution on our labels in the dataset:
After having tagged, let’s examine how many documents are of every topic:
It seems that the topis are almost balanced.
Final Words
Topic modeling requires multiple runs of cleaning the data, reading the results, adjusting the preprocessing accordingly and trying again and again until you are satisfied with the results.
The complete code can be found on this Jupyter notebook, and you can browse for more projects on my Github.
If you need some help with Data Science related projects: https://www.disruptio-analytics.com/