For the third part of our ACL in Review series, we will be recapping the second day of the ACL 2019 conference (for reference, here are the first part and second part of this series). This section of the conference is titled “Machine Learning 3” and contains several interesting new ideas in different areas of study, three of them united by the common theme of interpretability for models based on deep neural networks and distributed representations. I will provide ACL Anthology links for all papers discussed, and all images in this post are taken from the corresponding papers unless otherwise specified.
For this part, a major recurring theme is interpretability: how do we understand what the model is “thinking”? In NLP, attention-based models have a kind of built-in interpretability in the form of attention weights, but we will see that it is not the end of the story. Also, this time there was no doubt which paper is the highlight of the section: we will see a brand new variation of the Transformer model, Transformer-XL, which can handle much longer context than previous iterations. But we will leave this to the very end, as we are again proceeding in the order in which papers were presented on ACL. Let’s begin!
FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms
Throughout machine learning, but especially in NLP, it can be hard to differentiate between the latest models: the results are close, you have to vary both random seeds in the model and train/test splits in the data, and the variance may be quite large. And even after that you still have a very hard decision. Here is a striking real life example from an NLP model with two different character representation approaches:
Which representation should you choose? And how many experiments is it going to take for you to make the correct decision?
To overcome this difficulty, Henry Moore et al. (ACL Anthology) propose to use a well-known technique from the field of reinforcement learning: multi-armed bandits. They are widely used for precisely this task: making noisy comparison choices under uncertainty while wasting as few evaluations as possible. They consider two settings: fixed budget, where you want to make the best possible decision under a given computational budget (given the number of evaluations of various models), and fixed confidence, where you want to achieve a given confidence level for your choices as fast as possible. The authors compare classical algorithms for multi-armed bandits based on Thompson sampling and indeed show that the result becomes better.
To me, using multi-armed bandits for comparison between a pool of models looks like a very natural and straightforward idea that is long overdue to become the industry standard; I hope it will become one in the near future.
Is Attention Interpretable?
Interpretation is an important and difficult problem in deep learning: why did the model make the decision it made? In essence, a neural network is a black box, and any kind of interpretation is often a difficult problem in itself. In this work, Sofia Serrano and Noah A. Smith (ACL Anthology) discuss how to interpret attention in NLP models, where it has been a core mechanism for many recent advances (Transformer, BERT, GPT… more about this below). Attention is very tempting to use as a source of interpretation, as it provides a clear numerical estimate of “how important” a given input is, and this estimate often matches our expectations and intuition. If you look in the papers that use attention-based models, you will often see plots and heatmaps of attention weights that paint a very clear and plausible picture.
But, as the authors show in this work, things are not quite that simple. There are a number of intuitions that may go wrong. For example, we may think that higher-weight representations should be more responsible for the final decision than lower-weight representations. But that’s simply not always true: even the simplest classifier may have different thresholds for different input features. The authors present several tricks designed to overcome this and other similar problems and to test for real interpretability. As a result, they find that attention does not necessarily correspond to importance: even the highest attention weights often do not correspond to the most important sets of inputs, and the set of inputs you have to flip to get a different decision may be too large to be useful for a meaningful interpretation.
The conclusion is unsurprising: with attention, as with mostly everything else, you should proceed with caution. But, to be honest, I don’t really expect this paper to change the way people feel about attention weights: they are still very tempting to interpret, and people still will. A little bit more carefully, maybe.
Correlating Neural and Symbolic Representations of Language
In a way, we continue the theme of interpretability with this paper. How can we understand neural representations of language? One way is to use features such as word embeddings, sentence embeddings, and the like to train diagnostic models, simple classifiers that recognize information of interest (say, parts of speech) and then analyze how these classifiers work. This is well and good when the information of interest is simple, such as a part-of-speech tag, but what if we have a more complex structure such as a parse tree?
Grzegorz Chrupala and Afra Alishahi (ACL Anthology) present a work where they propose to better understand neural representations of language with the help of Representation Similarity Analysis (RSA) and tree kernels. The idea of RSA is to find correspondences between elements of two different data domains by correlating the similarities between them in these two domains, with no known mapping between the domains. Here is an explanatory illustration from a very cool paper where Kriegeskorte et al. used RSA to find relations between multi-channel measures of neural activity, which is about as unstructured a source of data as you can imagine:
RSA can be applied given a similarity/distance metric within spaces A and B, and when there is no need for a metric between A and B. This makes it suitable for our situation, where A is, e.g., the space of real-valued vectors and B is the set of trees.
To validate this approach, the authors use a simple synthetic language with well-understood syntax and semantics that it is possible to fully learn by an LSTM, namely the language of arithmetic expressions with parentheses. To measure the similarity between trees, they use so-called tree kernels, where similarity is based on the amount of common substructures within the trees. As a result, they get a reasonably well-correlated correspondence between representation spaces and parse trees. In general, it seems that this work brings to NLP and interpretation of neural networks a new tool, RSA, and this tool may prove to be very promising in the future.
Interpretable Neural Predictions with Differentiable Binary Variables
Here again, we continue our investigations into interpretability. Joost Bastings et al. (ACL Anthology) ask the following question: can we make text classifiers more interpretable by asking them to provide a rationale for their decisions? By a rationale they mean highlighting the most relevant parts of the document, a short yet sufficient part that defines the classification result. Here is a sample rationale for a beer review:
Having this rationale, you can, e.g., manually validate the quality of the classifier much faster than you would if you had to read the whole text.
Formally speaking, there is a Bernoulli parameter for every word that corresponds to whether it gets included into the rationale, and the classifier itself works on the masked input after sampling these Bernoulli variables. Sampling is a non-differentiable operation, however, and a classical way to get around this would be to use the reparametrization trick: get samples separately, transform them into Bernoulli variables, and then the gradient can flow through the reparametrization part. To achieve this, the authors use the stretch-and-rectify trick: starting from the Kumaraswamy distribution, which is very similar to the Beta distribution, first stretch it to include 0 and 1 in the support and then rectify by passing it through a hard sigmoid. The transition penalties are still non-differentiable but now admit a Lagrangian relaxation; I will not go into the mathematical details and refer to the paper for all the details.
Suffice it to say, the resulting model does work, can produce rationales for short texts such as reviews with the sentiment classifier on top, and the results look quite interpretable. They also show how to use this as a hard attention mechanism that can replace standard attention in NLP models. This results in a very small drop of accuracy (~1%) but very sparse attention weights (~8.5% nonzero attention weights), which is also very good for interpretability (in particular, most of the critiques from the paper we discussed above do not apply now).
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
And finally, the moment we have all been waiting for. In fact, Transformer-XL is old news: it first appeared on arXiv on January 9, 2019; but the submission and review process takes time, so the paper by Zhang Dai et al. (ACL Anthology) is only just appearing as an official ACL publication. Let’s dig into some details.
As is evident from the title, Transformer-XL is a new variation of the Transformer (Vaswani et al., 2017), the self-attention-based model that has (pardon the pun) completely transformed language modeling and NLP in general. Basically, Transformer does language modeling by the usual decomposition of the probability of a sequence into conditional probabilities of the next token given previous ones. The magic happens in how exactly Transformer models these conditional probabilities, but this will remain essentially the same in Transformer-XL so we won’t go into the details now (someday I really have to write a NeuroNugget on Transformer-based models…).
What’s important now is that in the basic Transformer, each token’s attention is restricted so that it does not look “into the future”, and Transformer also reads the segments (say, sentences) one by one with no relation between segments. Like this:
This means that tokens at the beginning of every segment do not have sufficient context, and the entire context is anyway limited to segment length, which is capped at the available memory. Moreover, each new segment requires to recompute the model from the very beginning; basically we start over on every segment.
The key ideas of Transformer-XL designed to solve these problems are:
- use segment-level recurrence, where the model caches and reuses hidden states from the last batch; now, each segment can capture the context in the features from previous segments; this extra long context still does not fit into memory, so we cannot propagate gradients back to previous segments, but this is still much better than nothing:
- moreover, this idea means that on the evaluation phase, where we do not have to backpropagate the gradients, we are free to use extra-long contexts and do not have to recompute everything from scratch! This speeds things up significantly; here is how it works at the evaluation phase:
There is a new problem now, however. Transformer relies on positional encodings to capture the positions of words in the segment. This stops working in Transformer-XL because now the extra long context relies on words from previous segments, so their positional encodings would be the same, and chaos would ensue.
The solution of Dai et al. is quite interesting: let’s encode distances on edges rather than absolute positions! When the token attends to its immediate predecessor, we add an embedding of distance 0, when it attends to the previous token we add the “distance 1” embedding, and so on.
This allows for much longer contexts, with 4.5x more tokens in the context than the regular Transformer and 80% longer contexts than state of the art RNNs; at the same time, Transformer-XL is up to 1800x (!) times faster than vanilla Transformers. It significantly improves state of the art language modeling results in terms of perplexity, and it is able to generate coherent texts with thousands of tokens.
Finally, let’s see a sample text from the Transformer-XL arXiv paper. First, the seed; Dai et al. initialize Transformer-XL with a long context to provide a lot of information and let it shine by using the whole thing:
= Battle of Dürenstein =
The Battle of Dürenstein (also known as the Battle of <unk>, Battle of <unk> and Battle of <unk>; German: <unk> bei <unk> ), on 11 November 1805 was an engagement in the Napoleonic Wars during the War of the Third Coalition. Dürenstein (modern <unk>) is located in the <unk> Valley, on the River Danube, 73 kilometers (45 mi) upstream from Vienna, Austria. The river makes a crescent-shaped curve between <unk> and nearby Krems an der Donau and the battle was fought in the flood plain between the river and the mountains. At Dürenstein a combined force of Russian and Austrian troops trapped a French division commanded by Théodore Maxime Gazan. The French division was part of the newly created VIII Corps, the so-called Corps Mortier, under command of Édouard Mortier. In pursuing the Austrian retreat from Bavaria, Mortier had over-extended his three divisions along the north bank of the Danube. Mikhail <unk> Kutuzov, commander of the Coalition force, enticed Mortier to send Gazan’s division into a trap and French troops were caught in a valley between two Russian columns. They were rescued by the timely arrival of a second division, under command of Pierre Dupont de l ’Étang. The battle extended well into the night. Both sides claimed victory. The French lost more than a third of their participants, and Gazan’s division experienced over 40 percent losses. The Austrians and Russians also had heavy <unk> to 16 <unk> perhaps the most significant was the death in action of Johann Heinrich von Schmitt, one of Austria’s most capable chiefs of staff. The battle was fought three weeks after the Austrian capitulation at Ulm and three weeks before the Russo-Austrian defeat at the Battle of Austerlitz. After Austerlitz Austria withdrew from the war. The French demanded a high indemnity and Francis II abdicated as Holy Roman Emperor, releasing the German states from their allegiance to the Holy Roman Empire.
= = Background = =
In a series of conflicts from 1803–15 known as the Napoleonic Wars, various European powers formed five coalitions against the First French Empire. Like the wars sparked by the French Revolution (1789 ), these further revolutionized the formation, organization and training of European armies and led to an unprecedented militarization, mainly due to mass conscription. Under the leadership of Napoleon, French power rose quickly as the Grande Armée conquered most of Europe, and collapsed rapidly…
This is a real Wikipedia page, you can check it out here and see how it really continues. But here are the next 1000 tokens generated by Transformer-XL (yes, we do need to have such lengthy examples, the whole point is to see a very long context preserved for a very long time):
after the French conquest of Italy. This victory marked the beginning of the Second Coalition. Napoleon’s rapid advance caused Austria — Russia, Britain and Russia to make peace negotiations. The Russian army under Prince Mikhail Mikhailovich Mikhailovich Chaffee, commander of the Imperial Russian Army in Switzerland, was able to drive off the Austrians in the Battle of Stockach (1805) and to take Pressburg (modern Ö3 Austria) from the Austrians. At an early stage the Austro-Hungarian fleet had been damaged in a storm off Cape Matapan; this allowed the Allies to launch a full-scale invasion of Italy in February 1805. The Battle of Austerlitz was the decisive French victory against Napoleon and the largest naval battle in history, involving more modern-day European navies. The French military situation worsened as Napoleon faced several threats related to his newly formed Army of Europe, particularly Russia itself, which was now on the verge of collapse. The Russian general and friend of Napoleon, Alexander, had been dissatisfied with the conduct of the Austrians; he was still in conflict with Emperor Napoleon, the French Republic’s king, who had declared war on Austria during the summer of 1804. With the war over, Napoleon decided to launch a second invasion of Italy in October 1805.
= = Prelude = =
In July 1805, the French 1st Army entered southern Italy. The army, under the command of Marshal Marmont, were reinforced by a few battalions of infantry under Claude General Auguste de Marmont at the town of Philippsburg and another battalion at Belluno. On 17 September 1805, the army marched from Belluno towards Krems. By 29 September, they had reached Belluno and conducted its advance against a small Austrian force. By 31 September, the whole force had been reinforced by a brigade from the Army of Tyrol under the command of Pierre Augereau. The Austrians were now under the command of Marshal Jean Victor Marie Moreau, a member of the Directory. Moreau had taken command of the Austrian invasion force in the spring of 1805. His command included the VI Corps commanded by Jean Baptiste Drouet de Ney and the VI Corps commanded by Generals Jean Victor Marie Moreau and Joseph Souham. Ney’s corps consisted of the III. Corps and VI. Corps, which consisted of the III Corps and VI. Corps, located in the Austrian Netherlands, was commanded by Friedrich Joseph, Count Baillet de Latour. Moreau’s army consisted of six divisions and several associated brigades.
= = Aftermath = =
= = = First Coalition forces = = =
On 9 October 1805 the French Army of the Danube was attacked by an Austrian army under Archduke Charles at the Battle of Austerlitz. Although Charles and Charles had not had much time to regroup, on 10 October, he launched his attack on the Polish forces under Friedrich Joseph, Count of Lauenburg. After three days, Charles’ army captured Lauenburg. The French forces pursued the Austrians to the Silesian border, where they encountered strong Austrian resistance. These conflicts forced the Austrians to retreat into Tyrol and Austria agreed to a truce. The Austrian army, commanded by Wenzel Anton Karl, Count of Merveldt, was reduced to around 10,000 men. It was initially planned that Archduke Charles would launch a counter-attack against the French army on the same day, as Napoleon had hoped, but this was not carried out. On 25 October, Merveldt left Styria for Tyrol. On the same day, Austria launched its new offensive against the French at Ulm. Charles withdrew his army from the region for a third time at the Battle of Elchingen, under the overall command of the Austrian generals, Ferdinand and Friedrich Wilhelm of Jülich-Cleves-Berg. To prevent Archduke Charles from escaping from the battlefield, the commander of the Habsburg army, Archduke Charles, planned to occupy the fortress Linz; instead, he decided to force Franz von Hipper to surrender the city. However, as Charles moved to the south, Moreau arrived on the scene with additional soldiers — including the entire Imperial Guard — and defeated the Austrians at the Battle of Hohenlinden on 28 October. The loss of Linz resulted in Austria’s complete defeat at Hohenlinden. In the meantime, the French Army of Observation and Preparedness was reorganized into the Army of the Danube under Feldzeugmeister (Colonel-General) Friedrich Freiherr von Hotze. The army was composed of the I, IV, VI, VI, VII, VIII and IX Corps. With reinforcements from Italy and France, it formed new battalions, companies, and squadrons in the Austrian army. On 17 November 1804, at the Battle of Jena-Auerstadt the Army of Silesia and the Army of Silesia joined forces, but by the time that the after the disastrous invasion of Russia in 1812. Napoleon’s empire ultimately suffered complete military defeat in the 1813–14 campaigns, resulting in the restoration of the Bourbon monarchy in France. Although Napoleon made a spectacular return in 1815, known as the Hundred Days, his defeat at the Battle of Waterloo, the pursuit of his army and himself, his abdication and banishment to the Island of Saint Helena concluded the Napoleonic Wars.
= = Danube campaign = =
From 1803–06 the Third Coalition fought the First French Empire and its client states (see table at right ). Although several naval battles determined control of the seas, the outcome of the war was decided on the continent, predominantly in two major land operations in the Danube valley: the Ulm campaign in the upper Danube and the Vienna campaign, in the middle Danube valley. Political conflicts in Vienna delayed Austria’s entry into the Third Coalition until 1805. After hostilities of the War of the Second Coalition ended in 1801, Archduke <unk> emperor’s <unk> advantage of the subsequent years of peace to develop a military restructuring plan. He carefully put this plan into effect beginning in 1803–04, but implementation was incomplete in 1805 when Karl Mack, Lieutenant Field Marshal and Quartermaster-General of the Army, implemented his own restructuring. Mack bypassed Charles ’ methodical approach. Occurring in the field, Mack’s plan also undermined the overall command and organizational structure. Regardless, Mack sent an enthusiastic report to Vienna on the military’s readiness. Furthermore, after misreading Napoleon’s maneuvers in Württemberg, Mack also reported to Vienna on the weakness of French dispositions. His reports convinced the war party advising the emperor, Francis II, to enter the conflict against France, despite Charles ’ own advice to the contrary. Responding to the report and rampant anti-French fever in Vienna, Francis dismissed Charles from his post as generalissimo and appointed his <unk> brother-in-law, Archduke Ferdinand, as commander. The inexperienced Ferdinand was a poor choice of replacement for the capable Charles, having neither maturity nor aptitude for the assignment. Although Ferdinand retained nominal command, day-to-day decisions were placed in the hands of Mack, equally ill-suited for such an important assignment. When Mack was wounded early in the campaign, he was unable to take full charge of the army. Consequently, command further devolved to Lieutenant Field Marshal Karl Philipp, Prince of Schwarzenberg, an able cavalry officer but inexperienced in the command of such a large army.
= = = Road to Ulm = = =
The campaign in the upper Danube valley began in October, with several clashes in Swabia. Near the Bavarian town of Wertingen, 40 kilometers (25 mi) northwest of Augsburg, on 8 October the 1st Regiment of dragoons, part of Murat’s Reserve Cavalry Corps, and grenadiers of Lannes ’ V Corps surprised an Austrian force half its size. The Austrians were arrayed in a line and unable to form their defensive squares quickly enough to protect themselves from the 4,000 dragoons and 8,000 grenadiers. Nearly 3,000 Austrians were captured and over 400 were killed or wounded. A day later, at another small town, <unk> south of the Danube <unk> French 59th Regiment of the Line stormed a bridge over the Danube and, humiliatingly, chased two large Austrian columns toward Ulm. The campaign was not entirely bad news for Vienna. At Haslach, Johann von Klenau arranged his 25,000 infantry and cavalry in a prime defensive position and, on 11 October, the overly confident General of Division Pierre Dupont de l’Étang attacked Klenau’s force with fewer than 8,000 men. The French lost 1,500 men killed and wounded. Aside from taking the Imperial Eagles and <unk> of the 15th and 17th Dragoons, Klenau’s force also captured 900 men, 11 guns and 18 ammunition wagons. Klenau’s victory was a singular success. On 14 October Mack sent two columns out of Ulm in preparation for a breakout to the north: one under Johann Sigismund Riesch headed toward Elchingen to secure the bridge there, and the other under Franz von Werneck went north with most of the heavy artillery. Recognizing the opportunity, Marshal Michel Ney hurried the rest of his VI Corps forward to re-establish contact with Dupont, who was still north of the Danube. In a two-pronged attack Ney sent one division to the south of Elchingen on the right bank of the Danube. This division began the assault at Elchingen. At the same time another division crossed the river to the east and moved west against Riesch’s position. After clearing Austrian pickets from a bridge, the French attacked and captured a strategically located abbey at French approached Vienna, the Prussians had already surrendered. As the Austrians did not want to allow the war to continue, they decided to abandon their territories in the north and move their army to the north and west, cutting off Charles from Vienna. The Battle of Warsaw was fought on 23 November 1805 between the French Army of the Danube and the Austrian Army of Styria in the vicinity of Warsaw and Pressburg (modern Trnava, Slovakia). At that time Habsburg forces…
Cool, right? That’s light years beyond the classical “The meaning of life is the tradition of the ancient human reproduction”, and most sentences look extremely human-like… but it is still painfully clear that the model does not really understand what it’s talking about.
Transformer-XL is one of the newest state of the art language models, with an attention-based architecture that improves over but also closely follows the original Transformer. This line of models (also including BERT, GPT, and GPT-2) has resulted in unprecedented text generation quality (check out GROVER!), but it’s still clear that there is no true “understanding”, whatever that means. To have that, we need to imbue our models with common sense, that is, some kind of understanding of the world around us. I’ve seen some very smart people at ACL 2019 say that they are working on this problem exactly. Let’s hope they succeed, and see you next time for our next installment!
Chief Research Officer, Neuromation