Is Massively Unsupervised Language Learning Dangerous?

Photo by Cristofer Jeschke on Unsplash

Ilya Stuskever, who leads OpenAI research, has a simple theory about how to reach AGI. He argues that all that is needed is massively large deep learning models. Stuskever has publicly remarked that there is a good chance that AGI is achievable in five years. I wrote previously that generative or organic programming leads to systems that achieve competence without comprehension. Under this assumption, Ilya Stuskever’s brute force computation leads to innovation method is an approach that one cannot dismiss.

Yann LeCun has a well-known analogy of intelligence and a cake:

If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake.

So the prevailing consensus belief is that “we still don’t know how to make the cake”. This perhaps is not even true. This is because our understanding of unsupervised learning may, in fact, require simply a change of perspective and not any real technological change. In my 2019 predictions for Deep Learning, I wrote that unsupervised learning was indeed solved. Like many extremely hard problems, a solution can be found with a simple change of perspective:

Unsupervised learning isn’t as valuable as one may think. I found this perspective from a very insightful set of essays about reinforcement learning by Ben Recht.

But in the chart above, what makes a method more valuable? Also, what makes a method more difficult? A method becomes more valuable is if it can lead to action. From this perspective, unsupervised learning is at the bottom of the food chain in value in that comprehension free re-organization of data adds little contribution to action that requires comprehension. The difficult dimension refers to sample efficiency. Unsupervised learning is extremely sample efficient and therefore not difficult. In contrast, reinforcement learning requires interaction with virtual worlds (or even worse physical words) to achieve learning. Current deep learning methods are data hungry and this feature is detrimental to progress in reinforcement learning.

Supervised learning in the chart above is a solved problem. Deep Learning has solved this problem. The major flaw, however of Deep Learning is the requirement for massive amounts of data. The brute force approach is to use humans to collect and curate massive amounts of date. The ideal situation is a more efficient supervised learning method. Which in a naive way, points back to unsupervised learning as a solution. If you don’t require human curators and mechanical comprehension free ones can do, then the economic hurdle is solved.

Natural Language Processing (NLP) practice has employed what is known as neural embeddings to train neural networks. Neural embeddings are language models (or world representations) that are created through unsupervised learning. The earliest kind is known as Word2Vec that was invented at Google in 2013. There have been alternative language models or neural embeddings that have been introduced over the years. Examples include Glove, Sense2Vec, Doc2Vec, Category2Vec, Sentence2Vec, Anything2Vec, Shape2Vec, DocTag2Vec, HyperEdge2Vec and Concept2Vec). I’m absolutely sure I’ve missed many others.

In 2018, there has been an explosion of better language models that significantly improved NLP in deep learning. Sebastian Ruder wrote in mid-2018 that NLP’s Imagenet moment has arrived. ImageNet refers to the benchmarks in image recognition that was a catalyst for the explosion of deep learning image processing. Ruder was pointing out that the same explosion is happening in NLP. In 2018, new methods such as ELMo (from AllenAI), ULMFiT (Ruder and FastAI), BERT (from Google) and GPT (from OpenAI). Many of these methods are a consequence of a new kind of neural network architecture that was introduced in a very provocative titled paper, “Attention is All You Need”, in 2017. This new kind of neural network architecture, known as an inconspicuous named “Transformer”, belongs to a new class dubbed as relational neural networks.

A few days ago ( February 2019), OpenAI unveiled a massive new neural network christened as GPT-2. This massive network is in line with Ilya Stuskever hypothesis that bigger is better. GPT-2 is based on the Transformer network model and its the largest version of the model has 1.5 billion parameters and 48 layers. By comparison, the BERT model (October 2018) had 340 million parameters. GPT-2 is four times larger and this is just 4 months later.

To illustrate the capability of GPT-2, it (without comprehension) wrote the following:

This is a very impressive ‘autocomplete’ capability! GPT-2 has extremely high competence in English text generation, but to be clear, it is without any comprehension of what it generates. It’s analogous to the GAN based generator of highly realistic faces. These are systems with high competence in their narrow tasks but without comprehension.

Here’s the kicker though, OpenAI has decided that the 1.5 billion parameter model is too dangerous to release into the wild and released instead of a smaller model that is less than one-tenth the size (117 million parameters). OpenAI left the following statement:

Today, malicious actors — some of which are political in nature — have already begun to target the shared online commons, using things like“robotic tools, fake accounts and dedicated teams to troll individuals with hateful commentary or smears that make them afraid to speak, or difficult to be heard or believed”. We should consider how research into the generation of synthetic images, videos, audio, and text may further combine to unlock new as-yet-unanticipated capabilities for these actors, and should seek to create better technical and non-technical countermeasures

The question though is if this is the state-of-the-art in February 2019 then what will be the state-of-the-art four months from now. Will we see models with 7.5 billion parameters? How competent is a 7.5 billion parameter network?

What is the real dangerous problem? The problem is that human cognitive bias assigns agents that have shown competence in one domain to having competence in an entirely different domain. We do this all the time, we elect leaders who inherited their wealth (a bad proxy for financial competence) assuming that the same competence will be valuable for governance. If automation constructs a massive number of statements (without grammatical errors or typos), then a human reader might assign comprehension to that agent even if it has none. An agent that is capable of generating text (without comprehension), leveraging a strategy of information overload, to overwhelm a human population’s cognitive defenses. Is that not what is already happening today?

I worry that information warfare is asymmetric. It is always easier to pollute and increase divisiveness than it is to enrich and increase unity.

P.S. If you see errors in English, that’s because I have yet to incorporate GPT-2 in my workflow. ;-)

Further Reading

Exploit Deep Learning: The Deep Learning AI Playbook