Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

A beginner’s guide to OCTIS vol. 2: Optimizing Topic Models

--

Photo by Joel Filipe on Unsplash

In a previous post, I introduced the Python package OCTIS (Optimizing and Comparing Topic Models Is Simple); I demonstrated how to get started and its features. The package allows for simple topic model optimization and comparison (as the name suggests). This post focuses on the first letter of the name: Optimization.

At the end of this post, you will:

  1. Have a basic understanding of hyperparameter optimization.
  2. Know what to consider when optimizing a topic model.
  3. Be able to optimize your own topic models.

Hyperparameter Optimization

Almost all machine learning algorithms have hyperparameters. Hyperparameters indicate the settings used by an algorithm for a learning process; based on these settings, the algorithm follows the learning procedure to learn its parameters. Yet, determining the optimal value for each hyperparameter is not straightforward and is typically done through advanced trial-and-error methods, also called hyperparameter optimization. With any of these hyperparameter tuning methods, an algorithm is trained based on a set of hyperparameters. The outcome of this algorithm is evaluated through a performance indicator or error measure. Then, the hyperparameters are (slightly) adjusted, and a model is trained and evaluated based on the new settings. The most basic approach is a grid search, in which the algorithm is trained on all combinations of a set of predefined settings. Grid search is very easy to code, can be run in parallel and does not need any form of tuning. However, navigating through the search space is inefficient as it does not use the information gained in earlier tries. And since the search is likely not to include all possible settings, there is no guarantee to find optimal values.

A more efficient way to find hyperparameters is through Sequential Model-Based Optimization (SMBO). Here, a surrogate model of the performance indicator/error measure can be fitted to previous tries and indicate the optimal point.

A common SMBO approach is Bayesian optimization, which can be explained through Gaussian Process Regression. With Gaussian process regression, additional information (new hyperparameter setting information) is added to the prior of the points that were sampled already. Bayesian optimization is efficient to optimize a few hyperparameters only. Yet, the performance degrades with an increasing search space. Also, Bayesian optimization cannot be parallelized as newly learned information is added to the priors. Further details of Gaussian Process Regression can be found here, and a more detailed review of hyperparameter optimization methods can be found here.

Optimizing topic models with OCTIS

OCTIS uses Bayesian optimization for hyperparameter optimization. In my previous post, I trained a NeuralLDA model on the BBC dataset, so I will do that here as well.

Assuming, you have installed OCTIS, let’s start directly in Python. Again, we need to import the dataset and the NeuralLDA model:

from octis.dataset.dataset import Dataset
from octis.models.NeuralLDA import NeuralLDA

Furthermore, we install the spaces in which we optimize, the Coherence (which we use as a performance metric for evaluating each setting) and the Optimizer itself:

from skopt.space.space import Real, Categorical, Integer
from octis.evaluation_metrics.coherence_metrics import Coherence
from octis.optimization.optimizer import Optimizer

Let’s fetch the dataset and initialize the model, and we are almost ready to go:

dataset = Dataset()
dataset.fetch_dataset('BBC_news')
model = NeuralLDA(num_topics=20)

In the previous post, we found NeuralLDA to have the following (hyper)parameters:

  • num_topics
  • activation
  • dropout
  • learn_priors
  • batch_size
  • lr
  • momentum
  • solver
  • num_epochs
  • reduce_on_plateau
  • prior_mean
  • prior_variance
  • num_layers
  • num_neurons
  • num_samples
  • use_partition

Search Space

Theoretically, all these variables can be optimized. But since Bayesian optimization does not perform well in a high-dimensional search space, we limit our search space dimensionality to three variables:

search_space = {"num_layers": Integer(1,3), 
"num_neurons": Categorical({100, 200, 300}),
"dropout": Real(0.0, 0.95)}

For the Categorical values, all the possible values need to be listed, while the lower- and upper bounds need only be given in the Real and Integer space. Note that although num_neurons are integers, we define them as a categorical value here, so that only these three values are considered, rather than any integer value between 100 and 300.

Coherence

Furthermore, we evaluate different hyperparameter settings based on the coherence score, which indicates how well different words in a topic support each other. There are various measures to calculate the coherence score. OCTIS’ default coherence measure is c_npmi. However, I recommend using c_v as it correlates better with human interpretation [1]:

coherence = Coherence(texts=dataset.get_corpus(), measure = ‘c_v’)

Optimization

Now we are almost ready to start the optimization. First, we need to define the number of optimization iterations, which is a trade-off between time and performance. More iterations means higher performance, typically, but this comes at the cost of longer training time. Also, more hyperparameters means more iteration rounds. A rule of thumb is to use 15 times the number of hyperparameters for the number of iterations. We have three hyperparameters, so we will use 45 iterations.

optimization_runs=45

Furthermore, each time a new setting is trained, the neural network is initialized with random weights. To get robust results per setting, each setting should be run several times. Then, we take the median from these measures as a proxy per setting. Again, the number of model runs is a trade-off between time and quality; the total training time is model_run times longer than having one run only. We will use 5:

model_runs=5

Now we are ready to start optimizing and saving the results. Let’s time the optimization to know how long training takes. This might be handy for future reference.

optimizer=Optimizer()import timestart = time.time()
optimization_result = optimizer.optimize(
model, dataset, coherence, search_space, number_of_call=optimization_runs,
model_runs=model_runs, save_models=True,
extra_metrics=None, # to keep track of other metrics
save_path='results/test_neuralLDA//')
end = time.time()
duration = end - start
optimization_result.save_to_csv("results_neuralLDA.csv")print('Optimizing model took: ' + str(round(duration)) + ' seconds.')

While the model is training, you should see something similar to the picture below in your console:

Epoch: [49/100] Samples: [76293/155700] Train Loss: 922.4048726517341 Time: 0:00:00.465756
Epoch: [49/100] Samples: [334/33400] Validation Loss: 920.509180857036 Time: 0:00:00.027928
Epoch: [50/100] Samples: [77850/155700] Train Loss: 922.2904436115125 Time: 0:00:00.490685
Epoch: [50/100] Samples: [334/33400] Validation Loss: 920.9605340101048 Time: 0:00:00.027925
Epoch: [51/100] Samples: [79407/155700] Train Loss: 927.4208815028902 Time: 0:00:00.435834
Epoch: [51/100] Samples: [334/33400] Validation Loss: 920.7156858158683 Time: 0:00:00.026928
Early stopping
Optimizing model took: 6922 seconds.

So, it took less than two hours. Not too bad.

Analysis

Now, we are ready to analyze the results. results is a dictionary with much information.

import json
results = json.load(open(“results/test_neuralLDA/result.json”,’r’))
results.keys()>>> dict_keys(['dataset_name', 'dataset_path', 'is_cached', 'kernel', 'acq_func', 'surrogate_model', 'optimization_type', 'model_runs', 'save_models', 'save_step', 'save_name', 'save_path', 'early_stop', 'early_step', 'plot_model', 'plot_best_seen', 'plot_name', 'log_scale_plot', 'search_space', 'model_name', 'model_attributes', 'use_partitioning', 'metric_name', 'extra_metric_names', 'metric_attributes', 'extra_metric_attributes', 'current_call', 'number_of_call', 'random_state', 'x0', 'y0', 'n_random_starts', 'initial_point_generator', 'topk', 'time_eval', 'dict_model_runs', 'f_val', 'x_iters'])

Instead of measuring the time as we did before. We can see the time per model (/total time), by using 'time_eval' :

results[‘time_eval’]>>> [184.53835153579712,
148.67225646972656,
109.24449896812439,
256.08221530914307,
154.24360704421997,
109.99915742874146,
160.35804772377014,
151.49168038368225,
155.40837001800537,
187.06181716918945,
130.2092640399933,
147.94106578826904,
123.84417510032654,
144.77754926681519,
161.64789295196533,
155.76212072372437,
135.94082117080688,
135.95739817619324,
173.89717316627502,
157.23684573173523,
163.68361639976501,
162.50231456756592,
149.47536158561707,
128.70402908325195,
160.75109601020813,
175.52507424354553,
172.95005679130554,
153.52869629859924,
186.61110424995422,
157.76874375343323,
191.34919786453247,
140.3497931957245,
132.05838179588318,
151.3189136981964,
153.88639116287231,
150.12475419044495,
178.27711153030396,
140.40700840950012,
94.96378469467163,
174.16378569602966,
92.2694320678711,
151.19830918312073,
152.21132159233093,
176.6853895187378,
147.0237419605255]
sum(results[‘time_eval’])>>> 6922.101717710495

'f_val' shows the median value for each trained setting over the different runs.

results[“f_val”]>>> [0.4534176015984939,
0.4955475245370141,
0.38868674378398477,
0.4024525856722283,
0.47846034921413033,
0.3844368770533101,
0.4581705543253348,
0.4126574046864957,
0.4434880465386791,
0.49187868514620553,
0.5102920794834441,
0.48808652324541707,
0.4935514490937605,
0.4858721622773336,
0.4952236036028558,
0.465241037841993,
0.4881279220028529,
0.5023211922031112,
0.4990924448499987,
0.5024987857267208,
0.4770480035332,
0.5067195917253968,
0.4898287484351199,
0.4550063558556884,
0.4748375121047843,
0.497097248313755,
0.5034682220702588,
0.48220918244287975,
0.486563299332411,
0.4845033510847734,
0.4564350945897047,
0.4893138798649524,
0.4832975678079854,
0.4791275238679498,
0.4723902248816144,
0.5060161207074427,
0.49878190002894496,
0.4760453248044104,
0.47155389774180045,
0.5026397878110507,
0.40795520033104554,
0.5018909026526471,
0.5056155984233146,
0.4999288519956374,
0.49207181719944526]

This is not very intuitive, so lets plot it:

import matplotlib.pyplot as pltplt.xlabel(‘Iteration’)
plt.ylabel(‘Coherence score (c_v)’)
plt.title(‘Median coherence score per iteration’)
plt.plot(results[“f_val”])
plt.show()

From this plot, we can see that the model has not benefitted from having a high number of iterations. The maximum median score was found in the 11th iteration and had a median coherence score of 0.51:

results[ ‘f_val’].index(max(results[ ‘f_val’]))
>>> 10
results["f_val"][10]
>>> 0.5102920794834441

Now, to find the settings that were used in the 11th iteration, we use 'x_iters’. This is a dictionary containing the parameters that have been optimized:

results[‘x_iters’].keys()>>> dict_keys(['dropout', 'num_layers', 'num_neurons'])

And the optimal hyperparameter settings are:

print([results[‘x_iters’][parameter][10] for parameter in results[‘x_iters’].keys()])>>> [0.5672132691312042, 1, 200]

So, with the given number of iterations and number of topics, the optimal settings in our search space are:

  • dropout : 0.5672
  • num_layers : 1
  • num_neurons : 200

Conclusion

That’s it! If you made it to the end of this post, you’re now ready to optimize your own topic model. You have a basic understanding of the considerations for choosing the number of parameters/search space and know how to program this in Python.

With my research group, I have developed new topic modeling algorithms which have been published in the Symposium Series on Computational Intelligence 2021. Initial results indicate high interpretability for these models in comparison the topic models covered by OCTIS. In future posts, I will explain the workings of these models and compare them with existing topic models from OCTIS.

Good luck optimizing your own topic models in OCTIS and don’t hesitate to reach out to me if you have any questions/remarks.

Furthermore, here is a link to the Google Colab in which a CTM model is optimized by the OCTIS developers.

If you like this work, you are likely interested in topic modeling. In that case, you might be interested in the following as well.

We have created a new topic modeling algorithm called FLSA-W (the official page is here, but you can see the paper here).

FLSA-W outperforms other state-of-the-art algorithms (such as LDA, ProdLDA, NMF, CTM and more) on several open datasets. This work has been submitted but is not peer-reviewed yet.

If you want to use FLSA-W, you can download the FuzzyTM package or the flsamodel in Gensim. For citations, please use this paper.

References

[1] Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399–408).

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Emil Rijcken
Emil Rijcken

Written by Emil Rijcken

PhD candidate in Natural Language Processing

No responses yet