Topic Modeling on Spanish Texts

8 min readSep 15, 2019

Create topics and classifying spanish documents using Gensim and Spacy

Often we have many documents that we would like or we need to classify or grouping into natural groups for better undestand them. In this post We will be reviewing topics modelling, attempting to answer the questions like what kind of words appear more often than others in a collections of documents? Can we group our data? can we find underlying themes?

Topic Modeling

Topic models are algorithms models that uncover the hidden topics or themes in a collection of documents(text). Is unsupervised Machine Learning because we do not have label data and we need for find the labels from the text. You can refer to this article to know a little bit more about topic model.

In this post we will be using algorithms like latent Semantic Indexing (LSI), hierarchical Dirichlet process (HDP), and latent Dirichlet allocation (LDA), if you are unfamiliar please follow the link of each algorithms.

Getting Ready

For this article we will need Python, Spacy, NLTK, Gensim and theirs spanish dependences (we have an spanish dataset)If you do not have it yet, please install all of them.

1. The Dataset

The dataset is composed of data extracted from an Spanish news portal, it has around 46k documents or news. Let’s check our data:

In this article we only are going to use the column text because this columns hold the complete text of the news.

And the info about the dataset:

The flow process for train a Topic models is as follow:

2. Preprocessing the Data

For topics modeling as preprocessing I recommend:

use lemmatizing instead of stemming because lemmatized words tend to be more human-readable than stemming.
Using bi-grams as part of your corpus before applying the topic modeling algorithm in order to get results more human interpretable.
Select the proper part of speech, in this case we are selecting only the nouns of the corpus, this way we are removing adverb, adjetives, verbs, etc. This is do with spacy
Cleaning (deeply) properly the text.

The code that implement all the steps above is:

After apply the cleaner function we must create the corpus and the dictionary:

we cleaned the data now is time to train the models.

3. The Models

After preprocessing our data we are making some several models like LDA, LSI and HDP and comparing between them.

For selecting and evaluationg our models, in adition to visual inspection we will using topics coherence coeficient which is a measure of how interpretable topics are for human beings. Good topics are topics that can be described by a short label, therefore this is what the topic coherence measure should capture, so the best models achieve highest topic coherence scores.

To know more about topic coherence please follow this link

3.1 The HDP Model

The HDP topic model infers the number of topics from the data (we don’t need to give the number of topics we need)

Create the model is simple as show below:

View the topics:

Topic 0: presidente  gobierno  personas  parte  ministro  seguridad  partido  trump  autoridades  policia   
Topic 1: gobierno  presidente  millones  personas  parte  ministro  acuerdo  seguridad  partido  euros   
Topic 2: personas  ciudad  seguridad  policia  autoridades  gobierno  parte  fuerzas  presidente  zona   
Topic 3: personas  gobierno  millones  presidente  parte  caso  ministro  autoridades  seguridad  policia   
Topic 4: partido  gobierno  may  acuerdo  presidente  ministra  personas  ministro  elecciones  vez   
Topic 5: trump  presidente  acuerdo  gobierno  kim  parte  millones  mundo  anuncio  lunes   
Topic 6: personas  caso  policia  autoridades  lunes  grados  centro  millones  martes  proceso   
Topic 7: incendio  paris  notre  abril  bomberos  fuego  llamas  catedral  notredame  historia   
Topic 8: ministro  partido  acuerdo  may  votos  presidente  elecciones  gobierno  trump  despues   
Topic 9: partido  gobierno  mes  ministro  parlamento  grupo  spain  caso  abril  anuncio   
Topic 10: trump  presidente  elecciones  partido  cumbre  gobierno  ministro  viernes  domingo  consejo   
Topic 11: est  pas  pour  une  par  sont  qui  espagne  jeunes  mais   Topic 12: personas  policia  ciudad  ley  parte  autoridades  horas  martes  gobierno  zona   
Topic 13: mujer  horas  personas  redes  von  lugar  vida  juicio  autoridades  seguridad   
Topic 14: personas  gobierno  martes  lunes  acuerdo  mundo  parlamento  ley  ciudad  autoridades   
Topic 15: trump  presidente  grupo  gobierno  medios  fuerzas  jueves  despues  parte  anuncio   
Topic 16: frutas  personas  enfermedad  mujeres  espacio  trump  mujer  consumo  salud  peque   
Topic 17: grados  autoridades  temblor  escala  magnitud  viernes  lugar  seismo  union  parte   
Topic 18: millones  personas  autoridades  parte  hora  jueves  partido  sismo  poblacion  martes   
Topic 19: presidente  boeing  ministro  partido  euros  erc  parlamento  votos  personas  georgia

As we could see there are 20 topics, however is kind of dificult to interpret or follow it, so we decide to move to another models.

3.2 The LSI Model

This model implements fast truncated SVD (Singular Value Decomposition) to form the topics.

We are fiting models with several numbers of topics (in range 1–20) and looking for the coherence measurement to see what is the optimal number of topics, the snippet code is below:

The chart below show the results:

It seems that the best model is the model with only 3 topics

View the topics:

Topic 0: presidente gobierno personas ministro seguridad atentados parte venezuela trump policia   
Topic 1: presidente atentados venezuela trump policia personas gobierno juan paris ataques   
Topic 2: trump venezuela gobierno juan nicolas casa clinton votos noviembre elecciones

It seems that the model lsi with 3 topics contains words overlapped, so this model is not useful for us.

3.3 The LDA Model

In few words LDA Generates topics based on word frequency from a set of document.

Again we fitted multiple models and look for the coherence:

For this model the best number of topics is 12, let’s review the topics:

Topic 0: derechos humanos muerte guerra tribunal juez caso libertad personas juicio   
Topic 1: estudio tierra universidad mundo agua investigadores cambio expertos corea sistema   
Topic 2: policia hombre casa mujer muerte familia hospital autoridades caso despues   
Topic 3: presidente gobierno ministro justicia mubarak pueblo presos venezuela jefe caso   
Topic 4: mujeres mundo mujer papa vida iglesia matrimonio caso hombres casos   
Topic 5: personas metros rescate zona autoridades isla mar kilometros agua horas   
Topic 6: partido presidente elecciones votos gobierno candidato comicios partidos domingo ministro   
Topic 7: personas seguridad fuerzas ciudad ataque policia fuentes grupo ataques protestas   
Topic 8: millones euros gobierno dolares medidas crisis parte empresas trabajo sector   
Topic 9: personas terremoto ayuda ciudad seismo zonas zona millones agua muertos   
Topic 10: gobierno presidente ministro consejo acuerdo seguridad parte paz union asuntos   
Topic 11: aeropuerto avion accidente vuelos pasajeros horas vuelo seguridad autoridades personas

It looks like the topics are:

Topic 0: is about trials (justice)
Topic 1: is like nature studies
Topic 2: is about violence (domestic maybe?)
Topic 3: protest and disturb (like venezuelan case)
Topic 4: is about life and family
Topic 5: sea disasters
Topic 6: elections
Topic 7: is about terrorism
Topic 8: is about economic crisis
Topic 9: is like seism studies
Topic 10: is about peacy treaty
Topic 11: airport and security

The topics for this models looks well defined, so we will select this models with this number of topics to tag our dataset.

3.4 Coherence of the Best Models

Let’s compare the coherence of the best models:

Again when compared to the other algorithms the best model is the LDA one.

In despite the helps of the topics coherence coeficient the question how to select the right number of topics? is not easy to answer on my experienced really depends on:

The kind of corpus you are using.
The size of the corpus.
The number of topics you might expect to see based in past or relates projects for example.

4. Visualization the Final Model

The most popular topic modeling visualization libraries is LDAvis, you can use to get a nice visualization of the topics:

The dynamic chart you must see:

From the chart you can see hoe some topics sharing word and how related (or well differentiated) are each other, so you can either apply more preprocessing or select a different number of topics.

5. Classiying all the Corpus with the Topics found

Now that we have been select the best model and topics number (for this article), is time to assign a topic to every document, means clustering the collections according to the topics.

We selected the ldamodel with 12 topics and implemented a function to asign a dominant topic to each document, then map each topic with a label:

The result:

and map to the text:

5.1 Finally the topic distribution on our labels in the dataset:

After having tagged, let’s examine how many documents are of every topic:

It seems that the topis are almost balanced.

Final Words

Topic modeling requires multiple runs of cleaning the data, reading the results, adjusting the preprocessing accordingly and trying again and again until you are satisfied with the results.

The complete code can be found on this Jupyter notebook, and you can browse for more projects on my Github.

If you need some help with Data Science related projects: https://www.disruptio-analytics.com/