AI Distillery (Part 2): Distilling by Embedding
Word embeddings (word2vec, fastText), paper embeddings (LSA, doc2vec), embedding visualisation, paper search and charts!
At MTank, we work towards two goals: (1) Model and distil knowledge within AI. (2) Make progress towards creating truly intelligent machines. As part of these efforts we release pieces about our work for people to enjoy and learn from. If you like our work, then please show your support by following, sharing and clapping your asses off.
- Part 1: A bird’s eye view of AI research
- Part 2: Distilling by Embedding
In part 1 we presented an overview of a problem we felt the field of AI was confronting, namely that staying abreast of new AI research and the tools for discovery are suboptimal. Finding new and relevant papers was often like wading through a riverbed sifting for gold, with the most lucrative spots often provided by the denizens of Twitter.
We built a web-app at ai-distillery.io which we believed were the beginnings of tackling this problem, with different features for paper search, paper/word embeddings, proximity and visualisation of these embeddings. Part 1 was an overview, and here we’ll dive a little deeper into the methods we used to construct these embeddings, as well as our ‘Paper Search’ page. If you’re unfamiliar with our work so far, we’d advise swinging back to part 1 at this point — it’ll bring you fairly nicely up-to-speed.
Word2vec is a popular algorithm used to generate word representations (aka embeddings) for words in a vector space. The approach is grounded in distributional semantics : words that appear in similar contexts are similar. A context, in this case, is the neighbourhood of a word. The amount of words to consider within the context of a word is defined by the window.
For instance, consider the sentence “I like machine learning” and a context window of size 1. Then, the words which give context, or appear in the context window around the word “machine”, are “like” and “learning” (the window is considered both on the left and on the right).
The key idea of word2vec is to start with the centre word “machine”, look at its vector representation (initialised randomly), and use this vector to predict the context words: “like” and “learning”. The vector representation is then updated according to the error signal from this prediction step. Then, the algorithm proceeds with the following word as the new centre word, i.e. “learning”, sets up the new context, and repeats the same procedure.
Therefore, given an input corpus, word2vec generates vector representations of words in such a way that similar words will be close together in the vector space. Word2vec was first introduced by Mikolov et al . You can find a great introductory tutorial here with more details on how word2vec achieves these vector representations.
Two of the most popular methods for creating word embeddings at present are fastText  and word2vec. It was for this reason, namely their proliferation and popularity, that we opted to deploy them in our first AI Distillery iteration.
While we trained these models, we haven’t quantitatively compared these embeddings (yet), but you can qualitatively compare the difference between them on our web-app, by searching the same word on the word embedding proximity page (see below).
Paper embeddings: A paper’s essence (in theory)
For search-ability, we’re also interested in automatically learning representations of the entire research paper as well as the similarity between them. It is typically difficult to fit an entire paper into a single vector due the amount of varied and complex information within this large body of text . Apart from averaging the word vectors of all the words within a paper, there are also specialised approaches to learn embeddings for whole documents. As a first step, we investigate the traditional approach — latent semantic analysis (LSA)  and doc2vec , which is based on the word2vec approach.
Both LSA and doc2vec allow us to take documents of text and embed them in a vector space. These embedding vectors should, in theory, contain somewhere between a meaningful portion and the majority of “semantic meaning” relevant to the piece of research within them. If the embedding vectors work as expected, computer vision papers should be closer together in this space, and reinforcement learning (RL) papers close to other RL papers. Simple, like with like.
Latent semantic analysis and doc2vec
Latent semantic analysis (LSA)  is a traditional technique that can be employed to learn vector representations of documents. It also forms the basis of two well-known extensions: probabilistic LSA  and latent Dirichlet allocation (LDA) .
The typical approach is to conduct (truncated) singular value decomposition on the tf-idf weighted term-document matrix. The first N singular vectors are used to transform the documents into a lower dimensional, continuous vector space. The main benefit is that LSA does not purely rely on a lexical match to obtain a notion of similarity. In essence, the global co-occurrences of words within documents are taken into account to span the lower dimensional vector space.
In contrast, doc2vec  (or paragraph vectors) exploits the techniques introduced in word2vec and leverages them to the document level. In addition to word vectors, the training algorithm keeps track of document (or paragraph) vectors and optimizes them in the same way as word vectors. In our case, we deliberately use the whole full-text of an ArXiv submission as the document. For an awesome blog on more recent techniques for word embeddings and sentence/document embeddings check here and this repo for a list of all the recent papers in this space.
We are curious as to how the two approaches differ in terms of modelling the topical similarity of scientific papers. In future work we may present a detailed comparison of these approaches and others that have been implemented in the project.
Readers, please feel free to input your favourite papers, or papers from your general research area to see what other papers appear as similar. And let us know by way of the comments if you find papers you’d uncovered yourself, ones you didn’t know, and if the ‘similar’ papers are completely unrelated.
Ok, so now we have word embeddings and paper embeddings. This is starting to get a little more interesting. The fact that word and paper embeddings are both still embeddings (i.e. just a vector of numbers) means that we can actually visualise both of them with the same tool.
So here come a few cool visualisations — no longer will you have to wonder what a map of global AI research colour-coded by topic and embedded in vector space looks like. (Well, actually, you’ll still have to wonder because right now it’s just k-mean cluster colour, but in the future you won’t).
As well as advantages in development, using a single tool for both embeddings adds to ease of understanding and visualisation. So we borrowed, and modified, the word2vec-explorer GitHub repo, and adjusted it to our needs for visualising any type of embedding (not just the default of gensim word2vec models) with T-SNE .
Within both embedding pages, the user can choose the number of embeddings to show, how many k-mean clusters to split these into, as well as which embedding type to show. Types present are LSA and doc2vec for the paper embedding page, plus word2vec and fastText for the word embedding page. Below we have some examples and descriptions to give you a better idea of what we’re talking about.
Visualise any type of embedding
Word embeddings: Distance in vector-space
Word embeddings: Sanity-check
Paper embeddings: Zooming in to the similarity between papers and their clusters
So you grabbed a corpus, added some embeddings and figured out a nifty way to visualise the research. Ok why? What’s the point of all this?
Well, now we have both search-ability, automatic charts and cool insights from the generally-opaque body of research we dive into with our lucky artefacts each day.
Paper Search, baby.
We had a hell of a journey trying to create an efficient paper search system — in terms of speed and memory — and in the last blog we got a workable system up with Whoosh (awesome). However, this tended to be quite slow on the server and the tool wasn’t that popular and therefore not well documented (not awesome).
Determined to be our best-selves, we converted our paper indexing and search tool from Whoosh to ElasticSearch, which gives us much more flexibility, robustness and speed in dealing with large quantities of papers (both for querying and indexing). It’s a bonus that ElasticSearch is also considered the standard for general text information retrieval.
If people (yes, like you) like the paper search, go star the repo at ai-distillery, and in return we’ll make sure to prioritise updating the search functionality just for you.
On the back of all of this work, now we can create valuable metrics for the AI community. Silicon Valley says ‘you can’t manage what you don’t measure’, and we say ‘laziness is the mother of invention’. Automate it for everyone, and then we can go make something else.
There’s quite a few metrics available now for anyone who’s curious on the ‘charts and additional insights page’ — tracking things like absolute paper numbers over time, word mentions, citation counts and ranking of prolific authors.
Personally, we quite like ranking of most commonly assigned topics via semantic scholar but we’re also into the ranking, ordering and stoking of rivalries between the world’s most prolific researchers.
- Number of papers released on arxiv over time
- The number of mentions of “GAN” in titles over time since 2014
- Top 100 cited papers according to semantic scholar from 2014+ (in our arxiv dataset)
- Authors who have published the most papers from 2014+
Roundup and future plans
We’re only just beginning, so if you’re into what we’re at, be sure to follow along for future updates. We’d like to be here for the long haul, converting research into insights and our insights into research. Show some support and let us know how we can improve!
Summarising progress within AI is tough business, but AI will help us achieve it, just as it helps automate difficult tasks within countless other fields. As AI Distillery moves forward, here’s some more stuff we think is cool, and that we’d like to do:
- Analysing more datasets e.g. twitter feeds, conference submission sites and other APIs.
- Entity embeddings from graph methods like GCN and GAT (using one of the graph approaches allow us to represent author embeddings, institution embeddings, embeddings of entities found from papers e.g. “VAEGAN”).
- Better visualisations of embeddings, e.g. 3D visualisations with WebGL and more ways of interacting and exploring the space
- Timeline visualisation, e.g. showing the biggest papers or the rise of trends within the last few years or even going as far back as comparing the rise of symbolic and connectionist periods within AI.
- More embeddings of every type (e.g ELMo, Concept Embeddings, Link Prediction embeddings)
If you would like to collaborate with us in our wild journey of making AI progress more transparent, or have any comments regarding any part of our research or web-app, we’re open to suggestions. So feel free to reach out in the comment section or by email (firstname.lastname@example.org).
Be sure to follow along here, or on the ai-distillery.io site.
Star our repo: ai-distillery
And clap your little hearts out for MTank!
- Harris, Z. S. (1954). Distributional structure. Word, 10(2–3), 146–162.
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In NIPS (pp. 3111–3119).
- Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. TACL, 5, 135–146.
- Conneau, A., Kruszewski, G., Lample, G., Barrault, L., & Baroni, M. (2018). What you can cram into a single \$ &!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 2126–2136).
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 391–407.
- Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In ICML (pp. 1188–1196).
- Hofmann, T. (1999, July). Probabilistic Latent Semantic Analysis. In Proceedings of the Fifteenth conference on Uncertainty in Artificial Intelligence (pp. 289–296). Morgan Kaufmann Publishers Inc.
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. JMLR, 3(Jan), 993–1022.
- Salton, G., & Buckley, C. (1988). Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), 513–523.
- Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579–2605.