The Too-small World of Artificial Intelligence
Overcrowded and overlooked parts of the AI world
I spent the last eight years as an insider of the artificial intelligence (AI) community, working for different companies and in various roles. At DeepTrait, with our focus on AI applications for genetic analysis and engineering, I viewed the same AI community from a very different angle. Here, I relay my perceptions of AI from both inside and outside points of view.
The success of AlexNet at the ImageNet competition in 2012 foretold a rebirth of neural networks and the beginning of the new exciting cycle within the field. I became involved with machine learning (ML) in 2011, right before the explosion of its popularity, and witnessed its growth over these years. ICML 2013, one of the top tier ML conferences, was a quiet gathering in a hotel in Atlanta, Georgia, with a few hundred attendees. In 2018, the same conference was a giant event in Stockholm, Sweden, bringing together five thousand participants from all around the world. In December 2019, NeurIPS, the largest conference on the topic, got together staggering thirteen thousand AI researchers and engineers.
With the growth of funding and participation, ML research flourished. For historical reasons, virtually all AI papers are free and accessible at arXiv. Today, there are more than sixty thousand AI papers published there, with the numbers growing exponentially since 2012 (Fig. 2).
In 2013, a determined industry AI expert could be familiar with all publications in her subfield. In 2019, this would be impossible. Today, the vast majority of AI engineers in the industry rely on “best paper” and other shortlists.
Working in such a popular and rapidly growing field gives one the impression that AI is everywhere. If you need a neural network for object recognition — no problem, just take a look at the state-of-the-art in image recognition and pick the architecture that fits your requirements. If you need something for sentiment analysis — same story, just go through the publications on this problem and choose the solution that works for your data, on your hardware, and with the required performance. Even when pre-existing publications or relevant solutions for your particular problem do not exist, it involves a “subproblem of a subproblem.” For example, the standard augmentation techniques do not produce desirable results for your dataset. Or, the architecture of your favorite neural network underperforms on the data you collected. Or the best-in-class word-embedding technique does not work well with the specific vocabulary of your task. And so on.
Over the years, the experience in recognizing these subproblems of subproblems leads to the impression that all big problems of AI have been largely solved. This impression is reinforced by the increasing majority of the published papers focused on the ever-decreasing scope.
Naturally, when we started DeepTrait to develop the AI system for genome analysis, we explored the existing literature. We figured that everything must have been explored in detail within deep learning, not to mention various related problems with heterogeneous data analysis. Today, genome analysis is one of the most promising and vital areas of human research, and there have been more than 60 thousand AI papers published over the lifetime of the field. There must be an extensive body of work already finished to build upon, right?
Wrong. Upon accessing arXiv on Dec. 12, 2019, and typing “deep learning,” there were 22,140 papers. Now, change the query to “deep learning genome,” and you would find only 76, many of which did not address genomic data, but mention genomes as potential, future, or relevant applications.
After searching for all deep learning papers for genomics in all other sources, including bioRxiv, we found slightly more than two hundred. The vast majority of them used outdated neural network architectures and training techniques. A substantial amount of them used these tools improperly, for example, applying convolutional neural networks to heterogeneous data such as SNPs. This resulted in an underperforming model — something that any AI expert could have easily predicted. We found this to be a repeating pattern.
Those who used AI tools correctly did so to analyze small subsequences of the genome, such as promoters or protein-binding sites. Their input data ranged from one to twenty thousand nucleotides long, at most. Nothing even close to 135 million nucleotides in Arabidopsis thaliana genome — the one we targeted in our first major test. There was nothing we could build on, no examples, no neural network architectures, and no training techniques for sequences of this size. Nothing at all! We had to start from scratch.
Where is everybody?
That made me wonder. Understanding the genome has enormous potential. High-throughput sequencing produces tons of data, and AI seems to be the obvious tool to make sense of it all. And still, genomics gets about 1% of AI research attention, as measured by the fraction of papers. Where are the remaining 99%? This is clearly an opportunity. If such a ripe opportunity could be overlooked, perhaps there are more.
I went back to arXiv to look for other potentially revolutionary AI applications. For example, modern astronomy generates vast amounts of highly variable data. Imagery data, radio frequencies, annotated celestial bodies for every tiniest fraction of the sky, and so on. And huge questions that could potentially change our understanding of the universe, such as “what is dark matter?” and ourselves, such as famous Enrico Fermi’s “where is everybody?”. Harnessing the power of AI to solve these critical mysteries by probing the combined astronomical data on our universe should be an obvious idea, right?
Still, an arXiv query “deep learning dark matter” gives you 20 results today.
What next? How about material science? Modern reinforcement learning models can beat the best human players in Go and StarCraft 2. These models are so good that the victory of AlphaGo was featured in Nature, and recently Lee Sedol, the best Go player in the world, retired, saying that “AI cannot be defeated.”
That should be inspiring, right? How about applying the same approach to material science? Humanity already knows quite a bit about physics and chemistry. We could build a simulator in which reinforcement learning could learn how to create new materials, such as graphene, on its own. These new materials could enable new plane and ship designs, space lifts, underwater stations, and possibly extraterrestrial human colonies. It should be an exciting problem to work on.
Yet, “deep learning crystal structure” gives 16 papers on arXiv.
The small world
It turns out, virtually all modern AI research and industrial applications are centered around a dozen technical problems in two subfields: computer vision and natural language processing (Fig. 3).
We can model the AI world with the reverse pyramid. Each lower level enables the upper level, shapes it, and defines it in some sense.
At the very bottom, there lies very deep, basic science and tech. It deals with the theoretical understanding of the neural networks, optimization algorithms, statistical properties, and probabilistic nature of these tools.
In the middle, there is a Technical problem level. Here lie the dozen technical subproblems I mentioned earlier. For computer vision, they are image recognition, image segmentation, and image generation for NLP — parsing, text classification, machine translation, and question answering. The latter is well represented by the General Language Understanding Evaluation (GLUE) benchmark.
Most researchers and industry experts live at this level. Surely, not all of them are focused on the enlisted GLUE or vision tasks, and if you are one of the exceptions, you might rightfully disagree with me on this. However, as an insider, you can also very well imagine how many of us, living at this level, work on something outside of this task list, its reformulations, or combinations.
The limits of the middle layer are circumscribed by the bottom layer of theoretical science. Any new idea arising at the bottom level, such as gradient descent, a memory cell, or convolutional filter, enables a whole range of new movements at the Technical problem level.
Just as advances in theoretical science enable a whole range of technical expansion, solving a single technical problem enables a whole range of industrial applications at the top of the pyramid.
This model illustrates an essential limitation to the industry: while casting the product idea from the Technical problem level into the Industrial application is relatively straightforward, the reverse could easily turn out to be impossible. Think of the flow of applications as essentially a series of one-way arrows. If all we have are a dozen specific computer vision and natural language processing tools we have at the technical level, many industrial applications will lie outside their reach. If fact, the vast majority do. A dedicated AI specialist who starts her journey with the need to design an Industrial application may hope to finish her journey somewhere in the Technical problem layer, but could actually end up with something much broader and more exciting.
The descent into AI
The current state of the technical problems and industrialization practices makes carving the reverse path, from an application outside of this funnel to the existing technical level tools, next to impossible. The existing toolbox is tailored to the very specific applications in computer vision, and NLP and the more advanced the tool is, the narrower is its focus.
One example could be the size of the data. In the plant genomics, for instance, we start with 135 million “letter” genome of Arabidopsis thaliana. To put it on the scale, if you print it out in volumes, one A.thaliana genome will take 150 volumes per data point. And that is just the beginning. The tomato genome would give you 950 million “letter” text or 1,055 printed volumes, barley — 5.3 billion “letters” or 5,888 volumes, and wheat — 17 billion “letters” or 18,888 volumes. The current NLP does not work with anything even close to this size. All the modern deep learning tools for NLP, such as transformer-like networks, can only handle sequences up to thousands of elements long.
Another example is the nature of data. A genome is made of four discreet nucleotides that are symbolized by four “letters”: A, C, T, and G. A nucleotide cannot get “slightly more T” or “slightly less T.” In addition, changing a single T to, say, A could lead to an entirely different phenotype, deadly disease, or lethal condition. This limits the use of computer vision techniques developed for continuous data. The data size adds up here as well: a human genome represented as a square four-channel “image” will have a resolution of 54,772 by 54,772 “pixels” exceeding by far anything that modern computer vision neural networks can process.
The nature and the size of genomic data eliminate all the state-of-the-art deep-learning technical tools from our list. There are no existing neural network architectures or training practices we can borrow from computer vision or NLP world that could solve our problem.
A quick overview suggests that astronomy, chemistry, material science are all data-rich applications with the same issue: they can’t use existing AI toolsets from the very narrow set of computer vision and NLP solutions. There are several popular workarounds, such as transforming any hexadecimal data into an image, resize it and feed into computer vision tools, but they do not help much.
At this point, those who are persistent enough to seek a solution have no other option but to go to the deepest level of AI, the level of theory. This root of the AI ecosystem offers many findings on how deep neural networks work, how different architectures affect their behavior, how different activation functions interrelate with particular data distributions, and so on. In other words, here live the tools that allowing you to create your own toolbox, applicable to the industrial application you care about.
This is a tough journey that requires time, deep expertise, dedication, and a bit of luck, but in the end, you will develop a brand-new Technical problem level in the AI ecosystem. Despite being built for a particular industrial application, this new toolset enables a whole range of others, just like solving image recognition, opened way for a wide variety of products and product prototypes, starting from roentgenology analysis and ending up at the self-driving systems such as the Tesla Autopilot.
Working on technical problems for computer vision and NLP is a very secure, predictable, and safe path. There are lots of research groups, startups, and established companies working in these fields. The largest of them offer engineers a fortune straight out of college for joining their AI force. Specializing in computer vision or NLP also guarantees you access to great tools: datasets, GPU technologies, and frameworks along with tons of open source repositories, complete with samples, libraries, benchmarks, and other useful resources. They make our work much less arduous and much more productive. Perhaps, this explains the clustering of AI talents in these two particular areas.
The quest for your own AI toolbox for astronomy, genetics, chemistry, material science, geoscience, or economics, on the other hand, is a challenging, sometimes frustrating, lonely journey in which you can rely only on yourself and your team. However, the prize it promises is the entire field, large enough to build another billion-dollar company or a whole research institution.
There are hundreds and hundreds of vitally important and yet unresolved questions humanity face right now. For many of them, brave pioneering researchers have already collected more data than they can even analyze. They have a narrow purpose, collect data, and move on. These data are there, in open access, waiting for someone to make sense out of them, sometimes years later. Many of those questions remain unanswered because they have proven to be impossible to solve explicitly. However, AI technologies are famous exactly for that, for being able to learn how to solve explicitly unsolvable problems.
Away from crowded trails, there are entire worlds, overlooked by the AI community and waiting for their pioneers for decades. Unmapped and unexplored, they promise all their treasures to those who take this quest to its end.