Why Neural Networks are Good for (Language) Science
I’d like to know how humans learn language and use it to share information. I also want to build computational models that replicate this ability. So I am trying to be a language scientist and a language engineer. This blog is about how these two ambitions interact. In short, I will try to explain the long-held view that they are not mutually complementary (or even incompatible) is wrong. Separating language science and language engineering is a mistake, for both language scientists and language engineers.
I think the emergence of artificial neural networks as a tool for processing and understanding massive data sets should be an important catalyst behind the greater integration of language science and engineering. That neural networks are good for engineering is incontrovertible. They have advanced the state-of-the-art in various tasks that require learning from large datasets, including computer vision, character and handwriting recognition and speech recognition. And the range of practical language processing tasks on which neural networks beat the alternatives is also growing quickly.
How can neural networks bring science and engineering closer together? For a long time they were considered plain bad for science because they are ‘black boxes’. In other words, they might be good at predicting some natural data, but because so few of the design details of the trained model were specified by humans who designed the model, we never really known how they do this prediction. Consequently, when we use them can’t learn anything about the natural phenomenon to that the data pertains to.
Given that neural networks are much better understood now than a few years ago, I’m sure the number of people who agree with this reasoning is dwindling. In particular, there are now numerous effective means for interrogating and visualising neural nets to better understand the knowledge they encode, and how this helps them perform their objective (here, here and here, for a start). However, for me, that’s not even the most important reasons why the ‘neural nets are bad for science’ claim is wrong. The main reason neural nets are good for science is that they learn in a comparatively unbiased and objective way from raw data reflecting natural phenomena. Therefore they are resistant to (most of) the subjectivity associated with human data processing and analysis.
In the context of language science this is especially important; because language data is particularly human, it is very hard (if not impossible) for humans themselves to objectively analyse it.
This blog is in three parts. If you want to save your time and not read any more, then just finish this sentence: neural networks are good for science, particularly language science. If you want to know more of the gory details (including some of the most spectacular scientific car-crashes of the last 70 years), read on.
In the first part, I’ll explain why language science, perhaps more than any other science, needs some good news right now. The idea that it is worth studying language in a scientific way (with typical scientific objectives etc.) is actually not well established, and certainly not universally accepted. And even when people try to study language scientifically, ghosts from the past keep clouding their objectivity. So any help from new techniques, methods and researchers with fresh perspectives is great news for language science.
In the second part, I’ll show some unfortunate consequences of studying language in an unscientific way. Attempts to build technology that understands and processes language have been greatly influenced by unscientific research on language, even with no clear evidence that it helps. Who said science and technology go hand in hand?
In the third part, I’ll explain a bit about how neural networks have (relatively) recently been applied to large-scale language data. In many cases where there is a lot of data to learn from, these models are the best way to get a machine understanding language a bit like a human does. Maybe that shouldn’t be a surprise. One feature of neural nets is that they combine multiple layers of (objective) data processing and analysis before making decisions or predictions. It’s no great shock to me that a model whose structure is determined by an objective analysis of the data (i.e. scientifically) beats one whose structure isn’t.
Finally, in the fourth part, I’ll show how these advances in (language) technology can also be hugely beneficial for (language) science. By creating models whose objectives are to do things with language that humans find easy, and then allowing them to learn from lots of raw data (even if not as much as humans get), we can for the first time get amazing insights into the nature of language and how humans learn and use it.
I. Language and Science: an Awkward Relationship
A lot of research in linguistics in the second half of the twentieth century was characterised by symbols and mathematical-style notation. This notation, popularised by, among others, Noam Chomsky, made linguistics look like science. However, if a science is an academic field that follows the scientific method, then linguistics was not a science. Chomsky certainly didn’t believe in testing linguistic hypotheses with experiments. He thought introspection on the part of the researcher was sufficient, saying things like
linguistic work, at what I believe to be its best, lacks many of the features of the behavioral sciences
I have no doubt it would be possible to devise operational and experimental procedures that could replace the reliance on introspection with little loss, but it seems to me in the present state of the field, this would simply be a waste of time and energy.
Unfortunately for those interested in studying language scientifically, Chomsky was very influential. He convinced linguists to adopt new theories and pseudo-scientific notation, while simultaneously dissuading anyone from checking them empirically. This is like a populist leader coming to power and immediately pronouncing that the independent judiciary is actually a waste of time and money given the current state of the country.
Did nobody think this was a bit fishy? Well, my guess is that linguists loved the new-found respect (and funding) associated with having blackboards covered in maths-like notation (especially in a humanities faculty). It’s also very hard to prove people wrong if they don’t believe in scientific validation, which is ironic given Chomsky’s non-academic views on the dangers of established power structures.
Anyway, whatever the cause, linguists around the world began to adopt a Chomskian idea of language. For his part, Chomsky wrote a set of rules that attempted to determine which combinations of nouns, verbs, adjectives etc. constitute acceptable sentences in all of the world’s languages. These rules were quickly developed and extended by others. Chomsky in his writings often mentioned people like Newton or Einstein, which gave the impression that linguists writing the rules of grammar was like physicists discovering the laws of physics.
But there‘s a big difference. The theories of Newton and Einstein were evaluated by how well they predicted the outcome of experiments. Newton’s laws could explain some of the observed data; later, Einstein’s could explain a little more. For Chomsky, introspection was enough.
As time passed (and scepticism grew), there were some attempts to validate Chomsky’s ideas with experiments. But Chomsky’s proposals refer to abstract linguistic notions that are not manifested in physical reality, so it is very hard to test them objectively.
For instance, to test a rule that a certain sequence of nouns, verbs, adjectives etc. is acceptable but that a different sequence is not, we probably want to take sentences that exemplify both cases and measure how people process or interpret them. But even finding sentences that exemplify the two sequences is a hugely subjective exercise. Categories like noun, verb etc. (parts-of-speech) are not pre-defined by a language authority. In fact, even after detailed instructions, English speaking university graduates schooled in western linguistic conventions still disagree on the tag for almost 10% of English words. What’s more, even assuming a satisfactory way of checking if a single given sentence did or did not conform to Chomsky’s predictions, nobody seemed at all inclined to calculate what proportion of all sentences in real language data conformed to Chomsky’s predictions (you know, like a statistic). Einstein would not be impressed.
Of course, there is some subjectivity in almost all science, and natural categories cannot always be ‘defined’. In a biological theory about hearts it might not be possible to define where a heart stops and the rest of an organism starts, or whether an organ in an organism is a heart or not. Any particular (subjective) classification of hearts might cause results to favour one theory over another. That said, over time, widespread experimentation smooths out the effects of this subjectivity. And the theories that ultimately prosper are those that explain the highest proportion of (everyone’s) empirical observations.
But forgetting all of that, to me it just doesn’t seem likely that a researcher (or even a team of researchers) could ‘discover’ a set of rules that describe the world’s languages. The number of external (cultural, geographical and inter-personal) and internal (~80 billion neurons) factors or variables that language depends on is enormous. I’d be similarly sceptical if someone told me he had discovered the rules of the world’s weather, by paying particular attention to it. Or if someone told me he had discovered a pen that never runs out.
We are a long way from being able to connect the causal interactions between masses of subatomic particles (in the cells of the brains of interacting humans) with effects at a higher level of analysis (like the behaviour of those humans). So it seems to me we should think of language, and cognitive sciences in general, as complex phenomena (like climate science or financial economics), rather than law-based systems (like those found in chemistry or physics). That’s not to say that the the physical properties of the brain do not constrain these possible interactions and shape the nature of language. But, the fact that the earth’s physical geography constrains the weather doesn’t make scheduling a wedding in the UK much less of a lottery.
II. Science and Mythology in Language Technology
Why am I harping on about Chomsky when other language tech researchers are talking about new stuff like twitter, or singularities? After all, most researchers in these fields acknowledge that he is sort-of old news. Nevertheless, Chomsky’s ideas still pervade both cognitive science and language technology. And it’s currently Summer 2015.
Some of this influence is undeniably positive. Chomsky promoted an approach to psychology sometimes called cognitivism. He argued that we will not understand how humans behave without trying to understand how the brain works. That might seem obvious, but at the time it was not obvious to many psychologists. And while Chomsky’s particular method of understanding how the brain works (by thinking for a while and then writing down how the brain works) doesn’t really meet the standards of science, the influence of his cognitivism shifted focus (and money) towards cognitive science, cognitive and computational neuroscience, AI and other research fields whose importance is now clear.
However, there is a downside. It is often said that engineers and those working in language technology are sceptical about the benefits of integrating linguistic research into their systems. Nevertheless, without considering the influence of Chomsky it’s hard for me to understand why so much of language technology research in the last 40 years has focused on parsing. Parsing is the enterprise of turning sentences (like the boy with red shorts kicked the ball and scored a goal) into tree-like structures like this one.
The task of mapping sentences to tree structures is a hard problem to define and a hard thing to do automatically. Thanks to a huge investment of effort we now have numerous automatic systems that are very good at it (at least for certain languages). However, this came at the cost of lots of time spent by lots of smart people. And many graduate students in linguistics had to take courses on how to construct such trees consistently to produce the training data needed to train models to do it automatically. That’s a big investment. But how did we know it was the right thing to do?
Let’s put this another way. Assume a language understanding system (like question answering, translation, search etc.) is any engine that takes language input and responds with some behaviour (see the diagram below). This behaviour might involve generating language (as with machine translation) or some other response (as with a voice-controlled phone system) — it’s not really important. On the way to the desired behaviour the system will probably have to transform or represent the input in some way other than simple plain text. That’s pretty uncontroversial. But what is the evidence that representing the input with a tree-like structure is necessary (or even helpful) for getting the desired behaviour?
Of course, if you ask around it is possible to find examples of language understanding systems (English to Chinese translation, for example) where the state-of-the-art model does rely on such a representation. But that does not mean a better system could not be constructed without it. And, more tangibly, any systems (like English-to-Chinese translators) that need a parser necessarily rely on a lot of expensive annotation.
Tree-like representations for Chinese and English designed by trained annotators almost certainly encode connections and delimitations that correlate to some degree with the semantics of the sentences (as defined by neural processes or activations, say). So it is unsurprising that applying a parser might work better than doing nothing. But is this really the best use of expensive and highly educated humans if our goal is to build the best possible system? There are lots of other ways in which humans could label those sentences that might also help to improve the performance of systems. And, on the scientific side, does all the tree stuff have anything to do with what a human brain does when it solves the same problem?
Despite the influence of Chomsky on research in language technology, syntax and other linguistic formalities don’t actually form part of the most widely used language technologies, like Google search or speech recognition, a fact sometimes bemoaned by linguists. It’s been suggested that this is because people working in technology don’t understand the subtleties of language and don’t see the importance of getting language completely right. But I don’t think that’s the reason at all.
If adding linguistics to a language understanding system actually worked — i.e. led to measurable improvements in how they perform— then tech investors and Buzzfeed journalists would be all over it, as would other engineers. The fact that it doesn’t shouldn’t be a surprise though. Linguistics would help language technology if it explained important factors of variation in language data. There is no reason to expect it to do so, however, if linguistic ‘facts’ have not been tested for how well (in terms of statistics) they reflect actual language.
III. Neural Networks for Language
The recent achievements of Deep Learning (the name given to neural networks trained on lots of data) have undeniably changed AI, and an increasing number of these improvements relate to language technology. But even trying to look beyond the hype, I still think deep learning is still great news for both language science and language technology. To see why, it’s best to look at how neural networks (of varying depth) have been applied to some specific language engineering problems. If you don’t like too many details, just skim this section. And if you know a lot about deep learning already, it won’t harm to skip it altogether.
The first neural language model
Perhaps the first application of deep learning to large-scale language engineering came in 2001. Yoshua Bengio and collaborators applied a neural network to the established problem of language modelling. Language modelling is the task of assigning probabilities or likelihoods to sequences of words (or sentences). For example, in English the sequence of words it’s raining in London is likely to occur but the sequence it’s in sunny London is less likely. A model that knows whether sequences are likely or not can be used to improve any systems that produce language as output, like Google Translate or Cortana.
The obvious way to attack this problem is to simply check huge corpora of English text. If a sentence occurs many times, we conclude that its probability is high, and if it occurs fewer times its probability is lower. However, the statistics of language mean that, even with today’s massive corpora, many sentences that we are interested in might never occur in our dataset. So previous methods used to break sentences down into smaller chunks, or sequences of words, and see how often they occur in the corpus, before merging these statistics together to get an estimate for the whole sentence.
Bengio designed a model that doesn’t explicitly count how often sequences of words occur in a corpus, but instead attacks the problem much more like a human would (intuitively at least). If I gave you the word sequence the pig rolled over in the… you could make a decent guess at what the next word is likely to be. But it seems unlikely to me that you would do it by thinking ‘well I’ve read a lot of books and whenever I see the words over in the together the next word was puddle 7 times and mud 9 times’. I don’t know how the brain works, but it is certain to be very inefficient to store an explicit record of these sorts of statistics for all the bazillions of possible different word sequences in a language. That’s more of a fact about physics than brains.
When I try and solve a language modelling problem, my thought process seems to be more like ‘we’re on the subject of pigs, and one thing I know about pigs is that they like mud, and what they tend to do with mud is roll in it’. In other words, I seem to use the meaning of the word sequence as a whole, built up from the meaning of the individual words, to guess the next word. Bengio’s model does exactly the same.
The model starts with the aim of learning what every word in English ‘means’, so that these meanings can be used when guessing the next word. It will store each word as a vector (just a list) of numbers (e.g. [a1, a2 …] for the word pig in the diagram above), so that each word lives at a point in a multi-dimensional space.
When training begins, the model has no idea what any words mean, and the points for each word are scattered randomly all over the space. The model reads the first n words in the training corpus, and retrieves its vector for each of the words. These n vectors are then combined into single vector [s1 s2 …] by a mathematical function, which is determined by another set of numbers called weights (w1, w2 etc. in the figure). Again, at first the model doesn’t know what the best function is; these weights (and therefore the function itself) are set at random and unlikely to be correct.
Once the model has the vector [s1 s2 … ] representing the first n words, it uses this vector (and its current knowledge of the meaning of all words in the language) to make a guess at the next word. This guess is then checked against the actual next word in the training corpus. If the guess is correct, nothing happens- the model doesn’t learn anything from correct guesses. However, if the guess is wrong, the model updates its understanding of language in order to make a better guess the next time. Specifically, it updates its understanding of what the words like pig and rolled mean (the vectors [a1 a2 ..], [b1 b2 .. ] etc) and it also updates its understanding of how to combine word meanings to represent the whole phrase of n words (the weights [w1 w2 ..]). I won’t go into the details of how this updating works, but if you are interested you could try looking here or here.
The model starts with an entirely random (and probably incorrect) idea of what each word in English means and of how to combine them, so its first guesses at the next word will be terrible. However, after repeating the updating procedure for many sentences it starts to show signs of really understanding language. Firstly, it becomes very good at predicting the next word. For a fixed amount of English text to learn from, it predicts the next word better than the counting approach I sketched out above.
Second, the way the model understands words really matches how humans understand them. Remember, the model stores a vector for each word in English that encodes its meaning in a high-dimensional space. Here is a 2-d projection of where words appear in such a space after a neural language model has been trained.
If you check out the space in more detail here, you will see that classes of words people would naturally group together, such as pronouns, professions, countries or days of the week, are all clustered together by the model. In fact, the idea that these models learn word meanings that align with human semantic memory has been confirmed in numerous empirical tests. I think it’s fascinating that a model trained to get better at a simple language exercise (trying to guess the next word in a sentence) also learns to understand the meanings of individual words in a similar way to humans. Maybe that can give a clue about how children learn language so efficiently without anyone telling them explicitly what words mean — a question that has baffled philosophers for decades (no, centuries). After seeing such a model, it certainly seems unnecessary to conclude that all word meanings must be innate.
Neural machine translation
The model described above is almost 15 years old, and predicting the next word in a sentence may seem quite an esoteric objective. So let’s look at a new language task whose application should be clear to anyone — translating from one language to another.
Researchers have been trying to build systems to translate between languages since the 1950s. From the late 1980s, the most effective way for systems to translate a sentence was by using already-translated documents to look up the correct translation of that sentence. The availability of documents translated by humans is growing all the time, particularly thanks to international organisations like the EU or UN (and Google persuading people in Kyrgyzstan to translate more stuff). This approach of using these documents to look up translations of phrases and sentences is usually called (phrase-based) statistical machine translation.
This phrase-based lookup approach led to big improvements in machine translation and is still the basis for systems like BabelFish and Google Translate. Even so, if you click the ‘translate’ button on Facebook for a sentence like I can’t believe that moron photobombed my brother’s wedding, you will notice that the results aren’t always perfect. That’s because the sentence probably hasn’t been translated previously in any of the documents that the system has access to. So the system will try to break the sentence down into chunks that it has seen translated already.
For instance, it will probably have seen I can’t believe in many already-translated sentences in its data, so it can try to work out which words in those translations correspond to I can’t believe. Once it has done the same for all the different chunks in the sentence (e.g. [this fatty], [photobombed], [my brother’s wedding]) it will try to stick the translations together in way that seems natural to produce a whole translated sentence. (In fact, a language model like the one I described in the previous section can really improve this sticking-together). However, at each stage of the process there are decisions to be made (what are the best chunks to break the sentence into? which parts of the translated sentence in the training data correspond to each of these chunks? etc.), and it is very hard for the system to get all of this right. The result can often be a frustratingly-cryptic translation that is unlikely to impress the Laotian tour guide from your gap year.
The neural network approach to translation, called neural machine translation takes a different approach. While there are various neural machine translation architectures, in each case, as with the (monolingual) neural language model, the idea is to get the model to learn an approximation of the meaning of words and how to combine them. However, for neural machine translation, the model must learn words and composition functions in two languages simultaneously. As before, when training begins the model has no idea about either language, and these word meanings and composition functions are just random.
The image shows a neural machine translation model based on recurrent neural networks. The bottom half of the model is called the encoder. Its job is to ‘read’ the input sentence (in English in this case) and summarise the important information in a single vector. The top half of the model is called the decoder. Its job is to decode the information in the vector into an output sentence (in French in this case) in a way that preserves all of the information in the vector and respects the rules of the output language.
To train a neural translation model you need similar training data to a statistical translation model; that is, a large corpus of sentences that have already been translated by humans (so let’s hope they don’t stop in Kyrgyzstan just yet). As with the neural language model, all words (in both languages) are represented as vectors in a multi-dimensional space. For a sentence and its correct translation in the training data, the model starts by taking the vector of the first word and applies a projection function to it (represented by the gold arrow) to compute a new vector — the first ‘hidden state’ of the encoder (hidden states are shown as a blue boxes). A composition function (represented by the green arrow) then combines this hidden state with the projection of the next input word to compute the second hidden state. This process is repeated until the whole English sentence has been ‘read’ and a sentence-final ‘hidden state’ has been computed.
As before, the projection function and the composition function are not fixed in advance. They are determined by a set of weights that are initially random (the model knows nothing) and updated during training. If the model can learn good values for the weights in these functions and the values in the word vectors, it should be able to compute a final hidden state that encodes all of the important syntactic and semantic information in the input sentence.
The decoder then uses the encoded sentence vector to generate a sentence with the same meaning but in French. Like the encoding, this process happens one word at a time. To do it well, the model must learn good representations of the meaning of all French words (stored as vectors again, at the top of the diagram). By considering the meaning vectors of all possible output words together with the encoded sentence vector, the decoder makes a prediction for the first word in the output (French) sentence. During training, the model knows the correct translation, so it compares its own prediction with the correct first word, noting if it makes a mistake. The encoder then updates its own hidden state via a composition function (brown arrow) and by projecting the correct first word (blue arrow). Based on this updated hidden state it makes a prediction for the second word in the output sentence. This process repeats until the output sentence is complete. Phew.
While encoding and decoding a sentence, the model has access to the correct translation so knows if it makes an error in its prediction for the next translated word. These errors are accumulated, and are important information for telling the model how to update both its weights and its word representations so that the errors will gradually become less common.
Once the model has been trained (which can take several days), the encoder should know what each word in English means and how to combine these word meanings to build a representation of the meaning of entire sentences. And the decoder should know what each word in French means, and how to update its current knowledge based on one French word so it can best predict the next French word. More importantly, by applying the trained encoder and decoder to entirely new English sentences and sampling its predictions, the model can translate accurately into French. Although training the model takes a long time, using a trained model to get translations is really quick — you can try it here.
Other neural language models
In the last year or so, quite a few other applications of neural language models have emerged. For instance, if you give a neural net lots of images with accompanying captions to learn from, it can become very good at describing in words exactly what a computer sees in an image. In fact, you can now upload a photo and let a neural language model describe what it sees.
In a different project, KyungHyun Cho and I trained a recurrent neural language model on the information encoded in dictionaries (real dictionaries, like the Oxford English Dictionary). It’s easy to look up words in a dictionary by alphabetical order, but what should we do if we can describe the word we are looking for but can’t remember it? By training a neural language model on a dictionary, we can easily create a ‘reverse dictionary’, where users provide a definition and the model tells you what word best fits the definition. I’m not sure if I’m totally convinced, but people who try to sell commercial reverse dictionaries are adamant that they are really useful for professional writers who keep losing words on the tip of their tongue.
Interestingly, we found that after training on just six publicly-available dictionaries, our neural language model works about as well as a commercial reverse dictionary (not using neural language models) that has access to over 1000 published dictionaries. We also found another advantage of using neural language models for this application — they are flexible enough to transfer the knowledge encoded in dictionaries and encyclopedias to the task of answering general knowledge crosswords. When we train our model on dictionaries and some entries from Wikipedia, it answers general-knowledge crossword questions better than any commercial crossword-answering system that we could find. And it only took us a month or so to put everything together. Check out our demo and see if it can solve your crossword questions here (but don’t expect it to cope with cryptic clues).
IV. Why Neural Networks are Good for Science (as well as technology)
So it’s pretty clear that neural networks are important for current language technology and engineering. But, as I mentioned at the start, neural networks are sometimes considered bad for science. This is because they are ‘black boxes’. In other words, they don’t typically have a design or features that are specific to the type of data or real-world domain that they are being applied to, and so don’t tell us anything about the world, even if they work well. I think this criticism is completely wrong. The fact that they do not encode such domain-specific rules is precisely why neural networks are better for science than other approaches to computational modelling of natural phenomena.
Let’s take some examples. Traditional language understanding systems encoded hard rules like ‘an adjective cannot come immediately before a verb’, which were prescribed by language experts. Similarly, an approach to building a model for learning the weather might be to specify some latent variables, which, on the advice of geographers, are organised such that the chance of sunshine is bigger after a red sky the previous evening. Suppose also that these models can both predict new language (or weather) with 80% accuracy. Does this mean that the principles we encoded into the design of our model are what exists in the world (i.e. in weather, or language)? We have no idea whether there was some better model out there that could describe the data with 90% or 100% accuracy, and which relied on totally different hard-rules and assumptions.
With a neural network, we don’t encode any hard principles. The model infers the important structures, properties and relationships directly from raw data, in a way that allows it to best describe achieve its objective. Suppose we built two very general neural networks in this way and trained them to predict language or weather data. Even if we observed that they both got only 80% accuracy (i.e. the same accuracy as the more structured models), we could still interrogate the neural networks to answer questions like how did they learn to achieve this accuracy?
Further, numerous methods are emerging that allow us to better understand how neural networks learn to do what they do. It’s certainly easier to work out what is going on in an artificial neural network than what is going on in a real one. When we interrogate an artificial network to see how it works, we learn about the strategies that a very simple computational learning machine uses to transform and combine observed data so as to better make the predictions it needs to. This in turn helps us to understand the most salient principles characterizing variation in the data, which is, after all, the object of science. And because all of the computations performed by the model are simple and generic, this insight is (relatively) objective and (relatively) unbiased by established wisdom. Given that traditions, paradigms and dogmas can exert great influence on the established wisdom at any particular point in time (especially when it comes to little-understood things like brains) this can only be good for science.
In fact, as algorithms develop and access to data grows, I think that a big part of science will be concerned with understanding (at different levels of explanation) the internal machinations of man-made models and machines that successfully analyse, model or interact with naturally occurring data.
But what about language?
So what does this all mean for language science and technology? My view is that neural language models will lead to a convergence of these two disciplines.
For the last 30–40 years, engineers working in language technology have focused on improving the performance of models on certain tasks. These tasks included things like parsing — translating sentences into the tree-like structures I discussed before, or word sense disambiguation — working out which sense of a word a particular usage of that word corresponds to. Although a system that could resolve such a task would have very few applications in itself, people worked on such systems because, according to classes and textbooks, they were necessary steps on the way to a complete language understanding system. (They are also, incidentally, tasks that untrained humans find difficult or impossible, and even trained humans don’t often agree on the correct answers.) However, now that neural language models generally outperform systems that have these modular building blocks, I think language engineers should be more interested in understanding the diversity of statistical patterns, trends and variation that characterise the world’s language data. There should be more demand in the world of language technology for those who are experienced at dealing with language data, and who know how it can be used to construct learning objectives and training scenarios from which models can learn human-like linguistic behaviours.
So knowing about language can benefit engineering, but what can neural language models really teach us about language? I’ve already explained why I think a neural network approach to modelling is good for science in general. When it comes to understanding language, one of the key benefits of neural models is that they are typically trained to do something a human language-learner can generally do (like guessing the next word, or describing an image), or at least to do something a human will find immediately useful (like translation). In the former case, we can consider neural language models to be computational models of learning and cognition. As with any cognitive model, they can therefore provide insight into the nature of the objective and how humans might resolve the analogous real-word problem. In the latter case, while the network might not reflect the way in which a human tackles the problem, experiments with neural language models can still teach us about the nature of language data and canonical linguistic problems (such as the mapping between languages).
Let’s take machine translation as an example. Translation isn’t something that every language user can do, so a model of translation isn’t really a canonical cognitive model. Nevertheless, to my mind at least, neural machine translation models attack the problem of translation a bit more like a human would than statistical translation models (at least if the human hadn’t been taught any special technique). That is, they read the sentence in the source language, represent what it means and then figure out the best way to convey that meaning in the target language. And by understanding more about this process in the model, we can learn about the languages in question and get some insights into how humans might learn to translate between them.
For instance, by interrogating trained neural machine translation models, we can see that a crucial step for learning to translate seems to be establishing an accurate sense of which concepts in the world are similar to each other and which are different. We can do other things with trained translation models too. On this website you can get a neural model to translate from English to French and at the same time check which English words are most important for the model when it generates its guess for each of the French words. Such output could help us to understand better the diverse strategies employed by languages to encode information.
Numerous other insights about the nature of language and language learning can also be gleaned from neural language models. We have already seen how learning rich semantic representations of concepts seems to go hand-in-hand with predicting which word is coming next in a sentence. For instance, analysis of trained neural language models suggests that, given a good enough understanding of the meanings of individual words in English, the order of words is comparatively unimportant for computing accurate representations of phrases or sentences. Of course, this won’t be true for all sentences and all language tasks (in our experiments, it doesn’t seem to be true for translation). Nevertheless, the extent to which it is true is surprising (at least to me) and suggests that the real heavy-lifting in the linguistic brain may be done by a high resolution conceptual system rather than a bespoke module for getting words in the right order.
On that note, and as explained in the first part of this blog, many linguists like to analyse the connections between different words in a sentence. Instead of connections like those in the syntax trees mentioned above, others believe in a more general type of connection between words, known as a dependency. For instance, in the sentence green was the colour of his hat, there might be dependency between green and hat (because one is the colour of the other) and also one between his and hat (because one is the determiner of the other). The notion of dependency always seemed pretty vague and unscientific to me, but neural language models can help to make things a lot clearer.
Let’s suppose we are interested in the question of which words are ‘connected’ in a sentence, but we don’t want to rely on a textbook or annotation guidelines telling us which words are connected. After training a language model to do something with language, we can interrogate its weights to see how strongly it learns to connect different words. Of course, by doing this we can’t conclude that two words in a sentence are strongly connected, only that it helps to connect them in order to do X, where X is some function, like classifying sentiment, translating, predicting words or whatever. However, if the same two words are connected by various types of language model, that would suggest that it is useful to connect them when understanding language in general, and would thus be circumstantial evidence that humans make the same connection when they understand language.
Phew — that was a long one, so I’m grateful if you’ve got this far. I hope I’ve explained a bit about neural networks and how they are now being applied to interesting problems in language. I’ve also mentioned why I think the neural language model has the potential to be one of the most important developments in recent language science. Along the way, I’ve argued that language and science didn’t really have much to do with each other for most of the twentieth century. And the unhelpful influence of non-empirical linguistics on language technology remains acute to this day.
Having said that, I’m convinced that language people have a critical role to play in language technology. The insight and conclusions from a science of language that is actually scientific (i.e. one with experiments on large-scale, unbiased linguistic and behavioural data) might be essential to the development systems that understand and use language like humans do. I hope language people like myself can contribute by analysing and interpreting both the output and the internal states of models that learn to process language, and relating this to child and adult language processing. We can also determine the best objectives and data for training these models so that their linguistic behaviour resembles humans as much as possible.
There is still some way to go, of course. In some recent papers, neural language models have been applied to the problem of parsing, that is, translating sentences into tree-like structures. These models learn rich representations of the meaning of words and the sentences in which they appear (representations of the sort that could be decoded to answer a question or produce a translation), but then use them to decide which of several possible trees is best. Given that humans do not understand trees (but they do understand answers, and translations), this seems to me to be completely missing the point of developing a language understanding system.
Many apparent facts about language are not facts at all, but simply the result of cultural, educational or intellectual traditions and paradigms. They can be so conventionalized, however, that even those working with language every day don’t notice. The criteria for ascribing ontological status to any cognitive or linguistic phenomenon must be empirical. And we must not forget that questions like ‘what is a word?’ or ‘what are the different types of word?’ would have quite different answers among speakers of languages like English and those from very different linguistic cultures.
Ironically, it was Chomsky’s criticism of BF Skinner’s behaviourism that began the cognitive revolution and led to the first neurally-inspired computational models of cognition. However, following Chomsky’s input, Skinner’s methodology was replaced by an approach that was equally limited. Although Chosmky was right that we will not fully understand human behaviour or language unless we try to understand the mind and brain, he was wrong in his refusal to accept the converse. It is equally improbable that we will ever understand the mind or the brain without considering the behaviours and functions for which those mechanisms were optimized. Thankfully, when it comes to understanding human linguistic behavior, at least, the amount of available data is big and getting bigger all the time.
What do I know anyway?
I’m doing a PhD trying to get computers to understand language at Cambridge University. I speak a couple of languages quite well and a couple more quite badly. I find English the easiest. After studying pure maths, which definitely isn’t a language, for my undergrad and masters, and then failing to discover myself, I read some books by Steven Pinker, because he knew how the mind worked, and that got me pretty interested in language. Members of the philosophy faculty at UNAM convinced me that Pinker was totally right, before Ricardo Maldonado, who was one of Ronald Langacker’s students, convinced me he was in fact totally wrong. Around this time I stood on a Yellow Pages and had my picture taken with George Lakoff, which you should be able to see below.
I then studied some psycholinguistics with nice people like John Williams and Napoleon Katsos who didn’t much like having arguments, and instead taught me to do experiments. During this time, I spent a lot of time agreeing with Christian Bentz, who does like an argument but pretends not to. After that, Douwe Kiela and Anna Korhonen taught me how to use a computer, which it turns out are borderline essential for computational linguistics. Then Yoshua Bengio made it clear I should use neural networks (I believe his exact words were “you should use neural networks”) before he and KyungHyun Cho taught me how. I started to think about why neural nets work so well, noticed the comparatively objective way in which they learn from data, then realised how much maths I had forgotten and then realised that luckily neural nets are mercifully simple (locally, at least).
Now, I try to work with as many different language-inclined people as possible, including some who I hope have an inkling how the brain works, like Brian Murphy and Niko Kriegeskorte, and some who may not, but are are building cutting edge language technology, like Antoine Bordes and Jason Weston.
Thanks to Douwe Kiela, Ivan Vulic, Christian Bentz and KyungHyun Cho for drinking enough beer with me to make writing this seem like a good idea.