DataCamp Project: The Hottest Topics in Machine Learning

Pedro Meira
Time to Work
Published in
6 min readSep 22, 2019

Using Natural Language Processing (NLP) on NIPS papers to uncover the trendiest topics in machine learning research over the years.

1. Loading the NIPS papers:

The Neural Information Processing Systems Foundation is a non-profit corporation whose purpose is to foster the exchange of research on neural information processing systems in their biological, technological, mathematical, and theoretical aspects. Neural information processing is a field which benefits from a combined view of biological, physical, mathematical, and computational sciences (nips.cc/About).

The NIPS conference (Neural Information Processing Systems) is one of the most prestigious yearly events in the machine learning community. At each NIPS conference, a large number of research papers are published. Over 50,000 PDF files were automatically downloaded and processed to obtain a dataset on various machine learning techniques. These NIPS papers are stored in datasets/papers.csv. The CSV file contains information on the different NIPS papers that were published from 1987 until 2017 (30 years!). These papers discuss a wide variety of topics in machine learning, from neural networks to optimization methods and many more (DataCamp).

2. Data process to analysis

For the analysis of the papers, we are only interested in the text data associated with the paper as well as the year the paper was published in.

We will analyze this text data using natural language processing. Since the file contains some metadata such as id’s and filenames, it is necessary to remove all the columns that do not contain useful text information.

3. Machine learning evolve over time

In order to understand how the machine learning field has recently exploded in popularity, we will begin by visualizing the number of publications per year.

By looking at the number of published papers per year, we can understand the extent of the machine learning ‘revolution’! Typically, this significant increase in popularity is attributed to the large amounts of compute power, data and improvements in algorithms.

Number of papers over the years

4. Text data Preprocessing

Let’s now analyze the titles of the different papers to identify machine learning trends. First, we will perform some simple preprocessing on the titles in order to make them more amenable for analysis. We will use a regular expression to remove any punctuation in the title. Then we will perform lowercasing. We’ll then print the titles of the first rows before and after applying the modification.

5. WordCloud

In order to verify whether the preprocessing happened correctly, we can make a word cloud of the titles of the research papers. This will give us a visual representation of the most common words. Visualisation is key to understanding whether we are still on the right track! In addition, it allows us to verify whether we need additional preprocessing before further analyzing the text data.

Python has a massive number of open libraries! Instead of trying to develop a method to create word clouds ourselves, we’ll use Andreas Mueller’s wordcloud library.

WordCloud generated:

6. Prepare the text for LDA analysis

In natural language processing (NLP), latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model.

The main text analysis method that we will use is latent Dirichlet allocation (LDA). LDA is able to perform topic detection on large document sets, determining what the main ‘topics’ are in a large unlabeled set of texts. A ‘topic’ is a collection of words that tend to co-occur often. The hypothesis is that LDA might be able to clarify what the different topics in the research titles are. These topics can then be used as a starting point for further analysis.

LDA does not work directly on text data. First, it is necessary to convert the documents into a simple vector representation. This representation will then be used by LDA to determine the topics. Each entry of a ‘document vector’ will correspond with the number of times a word occurred in the document. In conclusion, we will convert a list of titles into a list of vectors, all with length equal to the vocabulary. For example, ‘Analyzing machine learning trends with neural networks.’ would be transformed into [1, 0, 1, ..., 1, 0].

We’ll then plot the 10 most common words based on the outcome of this operation (the list of document vectors). As a check, these words should also occur in the word cloud.

7. Analysing trends with LDA

Finally, the research titles will be analyzed using LDA. Note that in order to process a new set of documents (e.g. news articles), a similar set of steps will be required to preprocess the data. The flow that was constructed here can thus easily be exported for a new text dataset.

The only parameter we will tweak is the number of topics in the LDA algorithm. Typically, one would calculate the ‘perplexity’ metric to determine which number of topics is best and iterate over different amounts of topics until the lowest ‘perplexity’ is found. For now, let’s play around with a different number of topics. From there, we can distinguish what each topic is about (‘neural networks’, ‘reinforcement learning’, ‘kernel methods’, ‘gaussian processes’, etc.).

Topics found via LDA:

Topic #0:
bayesian models data learning latent

Topic #1:
learning reinforcement gradient probabilistic algorithm

Topic #2:
learning optimization sparse online algorithms

Topic #3:
learning estimation gaussian regression non

Topic #4:
networks neural model deep network

Topic #5:
inference multi stochastic efficient optimal

Topic #6:
time spectral continuous brain analog

Topic #7:
analysis feature functions visual search

Topic #8:
learning prediction sampling recognition structure

Topic #9:
learning fast clustering markov image

8. The future of machine learning

Machine learning has become increasingly popular over the past years. The number of NIPS conference papers has risen exponentially, and people are continuously looking for ways on how they can incorporate machine learning into their products and services.

Although this analysis focused on analyzing machine learning trends in research, a lot of these techniques are rapidly being adopted in industry. Following the latest machine learning trends is a critical skill for a data scientist, and it is recommended to continuously keep learning by going through blogs, tutorials, and courses.

9. Bibliograph

DataCamp — Project: The Hottest Topics in Machine Learning https://projects.datacamp.com/projects/158 created by Lars Hulstaert, Data Scientist at Microsoft.

--

--