AI Distillery (Part 1): A bird’s eye view of AI research
Different lenses to see through AI; motivations and introduction to our web app
At MTank, we work towards two goals. (1) Model and distil knowledge within AI. (2) Make progress towards creating truly intelligent machines. As part of these efforts we release pieces about our work for people to enjoy and learn from. If you like our work, then please show your support by following, sharing and clapping your asses off. Thanks in advance!
- Part 1: A bird’s eye view of AI research
- Part 2: Distilling by Embedding
What is this and why did you do it?
Welcome to our first instalment of the AI Distillery Project, where our MTank team, frustrated by the volume of AI research globally, attempt to hack a solution to our pile of unread papers that mount daily. In our previous vision blog, we playfully, and accurately, described the volume of global AI research publications as a firehose — of incredibly high volume, but a medium which prevents one from quenching their thirst properly. So we decided to try our hand at aquaduct-ing that force into refreshing insights about AI and various related fields.
Why, you ask? Well, firstly, we thought it was an interesting problem. Secondly, we heard (and matched) the tortured wails of researchers distraught at their inability to keep up with progress, even in the most esoteric of AI sub-subfields. Often, a researcher needs to divide their time between reading, coding, admin, teaching, etc. And sometimes, when one needs to write a paper before a deadline, authors guiltily admit that they don’t read any new papers for possibly months at a time while they prepare for their submission.
To a large extent, scientific knowledge is disseminated in one main format: scientific papers. More recently, public online repositories that allow citation, like ArXiv, have become a widely adopted method to rapidly publish scientific content (see Yann LeCun’s tweet). Papers still hold primacy in terms of how we transfer knowledge within science, a paradigm that is yet to meaningfully shift. These papers get accepted to journals and conferences, or just get popular from social media alone. At present, ArXiv is where the majority of the biggest papers within AI surface well before peer review.
In one sentence, our goal is to:
Automatically model and distil knowledge within AI
This goal is large, vague and perfect for the kind of work we would like to accomplish over the next few years. It, of course, includes the work we have done manually in our previous two survey publications: A Year in Computer Vision and Multi-Modal Methods. Publications which, while writing, made us scramble to try and add the best and most recent state-of-the-art (SOTA) papers within these sub-fields, until we realized how futile that was.
The looming monster of AI-progress is unrelenting in its push forward as we desperately tried to digest, quantify and write about its adventures. For AI Distillery, however, the aim is to extend our approach and tackle research from another angle — we’d draw your keen eye to the word automatically.
Maybe it’s time to apply AI to AI, and automate the curation and summarisation of knowledge in the field? We know there are many wonderful resources dedicated to AI research, for instance distill.pub, but the compilation, editing and creative process of such resources is very time consuming. Is there another way to create insights near-passively?
The field of Network Science is quite dedicated to studying and finding relationships within large citation networks. Arxiv-sanity, one of our biggest inspirations, greatly helps people finally search for the papers they are looking for or recommend papers they might like. That’s a check for search-ability and automation.
But we’re interested in the meta-research game — what can our research itself say about AI research? Where’s all this university, startup and industry fervour headed? What fields are collaborating most? What’s hot right now, and what’ll be hot soon research-wise?
We don’t know yet, but follow along and maybe we’ll find out together.
The problem from an information retrieval (IR) perspective
Different situations require different methods for retrieving information. Conducting exploratory search is difficult in standard IR systems as terminology might differ even in closely related fields (network analyses vs graph neural networks). How to find similar phrases without knowing what you’re searching for? How to find related papers to your new idea in the forest of GAN papers?
Modern natural language processing has yielded tools to conduct these types of exploratory search, we just need to apply them to the data from valuable sources, such as ArXiv. As a result, we aim to supply the most relevant, meaningful information as fast and as accurately as possible. This way, researchers and practitioners would be relieved from cumbersome “query-engineering” to find the information they need from the large pool of papers.
Crafting a dataset
As a starting point for our lofty goal, we used the arxiv-sanity code base (created by Andrej Karpathy) to collect ~50,000 papers from the ArXiv API released from 2014 onwards and which were in the fields of cs.[CV|CL|LG|AI|NE] or stat.ML. Kudos to both of these systems, as such incredible open-source resources bring us to a point in which anyone can access this knowledge. However, at least one little externality have arisen as a result:
How do we find what we need if there are so many [goddamn] papers?
Well, perhaps there’s a way to visualise papers, old and new, in the context of research around them. That is, not just the sub-field itself, but the various nestings which it inhabits. Exploration becomes easier, discovery and navigation are aided, unusually significantly, by first knowing where in the space of papers and knowledge you are located and what is around you.
Purifying the textual corpus
The ~50000 papers were broken down using pdf2text. We removed stopwords (e.g. “a”, “the”, “of”) and tokens that appear less than a threshold amount of times (e.g. 5 or 30 — different for each method). The common bigrams (“deep_learning”) and trigrams (“convolutional_neural_networks”) are what we’d like to learn embeddings for but there is an issue due to combinatorial explosion when creating n-grams.
Put simply, we would like to to avoid learning embeddings for bi-grams like “and_the” and “this_paper” of which there are thousands. Because, even more simply, they provide no value in the context of AI research. They represent the vernacular of papers generally.
Instead, we manually defined the important set of concepts from the larger set of most common n-grams — “recurrent neural networks”, “support vector machine”, etc. As a first approach, we find these concepts in the text and replace them with concept tokens (convolutional_neural_networks, support_vector_machine).
AI Distillery: a web-app for exploring AI research
We created a web-app, available via ai-distillery.io, which is where we will show the majority of our results, tools, widgets, insights, charts and more. Using the web-app it’s possible to explore some of our trained models on the datasets we collected, as well as allowing anyone to explore related concepts, find similar papers or get an overview of each along with trends and track their progress over time. In total there are currently 6 pages available, and we plan to update this a lot over the coming months. These are:
Paper Search: similar in functionality to arxiv-sanity-preserver but we use the Whoosh search library for more flexibility and scalability. Throw a query and find the most relevant papers to this query.
Word Embedding Proximity: find semantically similar words, e.g. “CNN” is close to “convnet” and “RNN” is close to “LSTM”
Paper Embedding Proximity: find similar papers, e.g. “AlexNet” paper might be close to the “GoogLeNet” paper or more generally, papers within the same field will tend to be closer than papers from separate fields.
Word embedding Visualization: 2D T-SNE chart showing what words are close to each other in the embedding space with word embedding methods: Word2vec and fastText
Paper Embedding Visualization: Another T-SNE chart but for visualising the paper embedding space itself and with our two chosen embedding methods being LSA and doc2vec.
Charts and Additional Insights: Charts and insights we find interesting and that we created along our journey e.g. top authors, top papers, number of papers released per month, etc.
We use our best weapons to tame the beast of AI progress, i.e. with Flask, ReactJS, D3.js, ChartJS and Whoosh. We had a fun journey swapping from Heroku (too little RAM) to Google Compute Engine (too expensive for too little RAM), before finally hosting the current version of the app with Hertzner.
We began AI Distillery with two “paper embedding” methods, Latent Semantic Analysis (LSA) and doc2vec — and two word embeddings algorithms, word2vec and fastText. In our next installment we’ll walk readers through these embeddings, as well as each of the pages we’ve created. For now, feel free to explore the site (ai-distillery.io). You can find our experiment code at the AI Distillery GitHub repo where we used frameworks like gensim, sklearn and spacy to do some of the above.
As always, thanks for taking the time to read our work. And please like, clap for and share MTank’s work with anyone you think might like it. Your support keeps us all motivated to try new things and contribute our two cents to the AI community. So, in this case, don’t hold your applause if you like what we’re doing!
If you would like to collaborate with us in our wild journey of making AI progress more transparent or have any comments regarding any part of our research or web-app, we’re open to suggestions so feel free to reach out in the comment section or by email (email@example.com). Keep an eye out for Part 2 of this series which is coming soon and the beginning of the new blog series we mentioned in our vision blog (From Cups to Consciousness).