Deep Learning Achievements Over the Past Year

Great developments in text, voice, and computer vision technologies

Eduard Tyantov
Cube Dev
19 min readDec 21, 2017

--

At Statsbot, we’re constantly reviewing the deep learning achievements to improve our models and product. Around Christmas time, our team decided to take stock of the recent achievements in deep learning over the past year (and a bit longer). We translated the article by a data scientist, Ed Tyantov, to tell you about the most significant developments that can affect our future.

1. Text

1.1. Google Neural Machine Translation

Almost a year ago, Google announced the launch of a new model for Google Translate. The company described in detail the network architecture — Recurrent Neural Network (RNN).

The key outcome: closing down the gap with humans in accuracy of the translation by 55–85% (estimated by people on a 6-point scale). It is difficult to reproduce good results with this model without the huge dataset that Google has.

1.2. Negotiations. Will there be a deal?

You probably heard the silly news that Facebook turned off its chatbot, which went out of control and made up its own language. This chatbot was created by the company for negotiations. Its purpose is to conduct text negotiations with another agent and reach a deal: how to divide items (books, hats, etc.) by two. Each agent has his own goal in the negotiations that the other does not know about. It’s impossible to leave the negotiations without a deal.

For training, they collected a dataset of human negotiations and trained a supervised recurrent network. Then, they took a reinforcement learning trained agent and trained it to talk with itself, setting a limit — the similarity of the language to human.

The bot has learned one of the real negotiation strategies — showing a fake interest in certain aspects of the deal, only to give up on them later and benefit from its real goals. It has been the first attempt to create such an interactive bot, and it was quite successful.

Full story is in this article, and the code is publicly available.

Certainly, the news that the bot has allegedly invented a language was inflated from scratch. When training (in negotiations with the same agent), they disabled the restriction of the similarity of the text to human, and the algorithm modified the language of interaction. Nothing unusual.

Over the past year, recurrent networks have been actively developed and used in many tasks and applications. The architecture of RNNs has become much more complicated, but in some areas similar results were achieved by simple feedforward-networks — DSSM. For example, Google has reached the same quality, as with LSTM previously, for its mail feature Smart Reply. In addition, Yandex launched a new search engine based on such networks.

2. Voice

2.1. WaveNet: A generative model for raw audio

Employees of DeepMind reported in their article about generating audio. Briefly, researchers made an autoregressive full-convolution WaveNet model based on previous approaches to image generation (PixelRNN and PixelCNN).

The network was trained end-to-end: text for the input, audio for the output. The researches got an excellent result as the difference compared to human has been reduced by 50%.

The main disadvantage of the network is a low productivity as, because of the autoregression, sounds are generated sequentially and it takes about 1–2 minutes to create one second of audio.

Look at… sorry, hear this example.

If you remove the dependence of the network on the input text and leave only the dependence on the previously generated phoneme, then the network will generate phonemes similar to the human language, but they will be meaningless.

Hear the example of the generated voice.

This same model can be applied not only to speech, but also, for example, to creating music. Imagine audio generated by the model, which was taught using the dataset of a piano game (again without any dependence on the input data).

Read a full version of DeepMind research if you’re interested.

2.2. Lip reading

Lip reading is another deep learning achievement and victory over humans.

Google Deepmind, in collaboration with Oxford University, reported in the article, “Lip Reading Sentences in the Wild” on how their model, which had been trained on a television dataset, was able to surpass the professional lip reader from the BBC channel.

There are 100,000 sentences with audio and video in the dataset. Model: LSTM on audio, and CNN + LSTM on video. These two state vectors are fed to the final LSTM, which generates the result (characters).

Different types of input data were used during training: audio, video, and audio + video. In other words, it is an “omnichannel” model.

2.3. Synthesizing Obama: synchronization of the lip movement from audio

The University of Washington has done a serious job of generating the lip movements of former US President Obama. The choice fell on him due to the huge number of his performance recordings online (17 hours of HD video).

They couldn’t get along with just the network as they got too many artifacts. Therefore, the authors of the article made several crutches (or tricks, if you like) to improve the texture and timings.

You can see that the results are amazing. Soon, you couldn’t trust even the video with the president.

3. Computer vision

3.1. OCR: Google Maps and Street View

In their post and article, Google Brain Team reported on how they introduced a new OCR (Optical Character Recognition) engine into its Maps, through which street signs and store signs are recognized.

In the process of technology development, the company compiled a new FSNS (French Street Name Signs), which contains many complex cases.

To recognize each sign, the network uses up to four of its photos. The features are extracted with the CNN, scaled with the help of the spatial attention (pixel coordinates are taken into account), and the result is fed to the LSTM.

The same approach is applied to the task of recognizing store names on signboards (there can be a lot of “noise” data, and the network itself must “focus” in the right places). This algorithm was applied to 80 billion photos.

3.2. Visual reasoning

There is a type of task called visual reasoning, where a neural network is asked to answer a question using a photo. For example: “Is there a same size rubber thing in the picture as a yellow metal cylinder?” The question is truly nontrivial, and until recently, the problem was solved with an accuracy of only 68.5%.

And again the breakthrough was achieved by the team from Deepmind: on the CLEVR dataset they reached a super-human accuracy of 95.5%.

The network architecture is very interesting:

  1. Using the pre-trained LSTM on the text question, we get the embedding of the question.
  2. Using the CNN (just four layers) with the picture, we get feature maps (features that characterize the picture).
  3. Next, we form pairwise combinations of coordinatewise slices on the feature maps (yellow, blue, red in the picture below), adding coordinates and text embedding to each of them.
  4. We drive all these triples through another network and sum up.
  5. The resulting presentation is run through another feedforward network, which provides the answer on the softmax.

3.3. Pix2Code

An interesting application of neural networks was created by the company Uizard: generating a layout code according to a screenshot from the interface designer.

This is an extremely useful application of neural networks, which can make life easier when developing software. The authors claim that they reached 77% accuracy. However, this is still under research and there is no talk on real usage yet.

There is no code or dataset in open source, but they promise to upload it.

3.4. SketchRNN: teaching a machine to draw

Perhaps you’ve seen Quick, Draw! from Google, where the goal is to draw sketches of various objects in 20 seconds. The corporation collected this dataset in order to teach the neural network to draw, as Google described in their blog and article.

The collected dataset consists of 70 thousand sketches, which eventually became publicly available. Sketches are not pictures, but detailed vector representations of drawings (at which point the user pressed the “pencil,” released where the line was drawn, and so on).

Researchers have trained the Sequence-to-Sequence Variational Autoencoder (VAE) using RNN as a coding/decoding mechanism.

Eventually, as befits the auto-encoder, the model received a latent vector that characterizes the original picture.

Whereas the decoder can extract a drawing from this vector, you can change it and get new sketches.

And even perform vector arithmetic to create a catpig:

3.5. GANs

One of the hottest topics in Deep Learning is Generative Adversarial Networks (GANs). Most often, this idea is used to work with images, so I will explain the concept using them.

The idea is in the competition of two networks — the generator and the discriminator. The first network creates a picture, and the second one tries to understand whether the picture is real or generated.

Schematically it looks like this:

During training, the generator from a random vector (noise) generates an image and feeds it to the input of the discriminator, which says whether it is fake or not. The discriminator is also given real images from the dataset.

It is difficult to train such construction, as it is hard to find the equilibrium point of two networks. Most often the discriminator wins and the training stagnates. However, the advantage of the system is that we can solve problems in which it is difficult for us to set the loss-function (for example, improving the quality of the photo) — we give it to the discriminator.

A classic example of the GAN training result is pictures of bedrooms or people

Previously, we considered the auto-coding (Sketch-RNN), which encodes the original data into a latent representation. The same thing happens with the generator.

The idea of generating an image using a vector is clearly shown in this project in the example of faces. You can change the vector and see how the faces change.

The same arithmetic works over the latent space: “a man in glasses” minus “a man” plus a “woman” is equal to “a woman with glasses.”

3.6. Changing face age with GANs

If you teach a controlled parameter to the latent vector during training, when you generate it, you can change it and so manage the necessary image in the picture. This approach is called conditional GAN.

So did the authors of the article, “Face Aging With Conditional Generative Adversarial Networks.” Having trained the engine on the IMDB dataset with a known age of actors, the researchers were given the opportunity to change the face age of the person.

3.7. Professional photos

Google has found another interesting application to GAN — the choice and improvement of photos. GAN was trained on a professional photo dataset: the generator is trying to improve bad photos (professionally shot and degraded with the help of special filters), and the discriminator — to distinguish “improved” photos and real professional ones.

A trained algorithm went through Google Street View panoramas in search of the best composition and received some pictures of professional and semi-professional quality (as per photographers’ rating).

3.8. Synthesization of an image from a text description

An impressive example of GANs is generating images using text.

The authors of this research suggest embedding text into the input of not only a generator (conditional GAN), but also a discriminator, so that it verifies the correspondence of the text to the picture. In order to make sure the discriminator learned to perform his function, in addition to training they added pairs with an incorrect text for the real pictures.

3.9. Pix2pix

One of the eye-catching articles of 2016 is, “Image-to-Image Translation with Conditional Adversarial Networks” by Berkeley AI Research (BAIR). Researchers solved the problem of image-to-image generation, when, for example, it was required to create a map using a satellite image, or realistic texture of the objects using their sketch.

Here is another example of the successful performance of conditional GANs. In this case, the condition goes to the whole picture. Popular in image segmentation, UNet was used as the architecture of the generator, and a new PatchGAN classifier was used as a discriminator for combating blurred images (the picture is cut into N patches, and the prediction of fake/real goes for each of them separately).

Christopher Hesse made the nightmare cat demo, which attracted great interest from the users.

You can find a source code here.

3.10. CycleGAN

In order to apply Pix2Pix, you need a dataset with the corresponding pairs of pictures from different domains. In the case, for example, with cards, it is not a problem to assemble such a dataset. However, if you want to do something more complicated like “transfiguring” objects or styling, then pairs of objects cannot be found in principle.

Therefore, authors of Pix2Pix decided to develop their idea and came up with CycleGAN for transfer between different domains of images without specific pairs — “Unpaired Image-to-Image Translation.”

The idea is to teach two pairs of generator-discriminators to transfer the image from one domain to another and back, while we require a cycle consistency — after a sequential application of the generators, we should get an image similar to the original L1 loss. A cyclic loss is required to ensure that the generator did not just begin to transfer pictures of one domain to pictures from another domain, which are completely unrelated to the original image.

This approach allows you to learn the mapping of horses -> zebras.

Such transformations are unstable and often create unsuccessful options:

You can find a source code here.

3.11. Development of molecules in oncology

Machine learning is now coming to medicine. In addition to recognizing ultrasound, MRI, and diagnosis, it can be used to find new drugs to fight cancer.

We already reported in detail about this research. Briefly, with the help of Adversarial Autoencoder (AAE), you can learn the latent representation of molecules and then use it to search for new ones. As a result, 69 molecules were found, half of which are used to fight cancer, and the others have serious potential.

3.12. Adversarial-attacks

Topics with adversarial-attacks are actively explored. What are adversarial-attacks? Standard networks trained, for example, on ImageNet, are completely unstable when adding special noise to the classified picture. In the example below, we see that the picture with noise for the human eye is practically unchanged, but the model goes crazy and predicts a completely different class.

Stability is achieved with, for example, the Fast Gradient Sign Method (FGSM): having access to the parameters of the model, you can make one or several gradient steps towards the desired class and change the original picture.

One of the tasks on Kaggle is related to this: the participants are encouraged to create universal attacks/defenses, which are all eventually run against each other to determine the best.

Why should we even investigate these attacks? First, if we want to protect our products, we can add noise to the captcha to prevent spammers from recognizing it automatically. Secondly, algorithms are more and more involved in our lives — face recognition systems and self-driving cars. In this case, attackers can use the shortcomings of the algorithms.

Here is an example of when special glasses allow you to deceive the face recognition system and “pass yourself off as another person.” So, we need to take possible attacks into account when teaching models.

Such manipulations with signs also do not allow them to be recognized correctly.

A set of articles from the organizers of the contest.
• Already written libraries for attacks: cleverhans and foolbox.

4. Reinforcement learning

Reinforcement learning (RL), or learning with reinforcement is also one of the most interesting and actively developing approaches in machine learning.

The essence of the approach is to learn the successful behavior of the agent in an environment that gives a reward through experience — just as people learn throughout their lives.

RL is actively used in games, robots, and system management (traffic, for example).

Of course, everyone has heard about AlphaGo’s victories in the game over the best professionals. Researchers were using RL for training: the bot played with itself to improve its strategies.

4.1. Reinforcement training with uncontrolled auxiliary tasks

In previous years, DeepMind had learned using DQN to play arcade games better than humans. Currently, algorithms are being taught to play more complex games like Doom.

Much of the attention is paid to learning acceleration because experience of the agent in interaction with the environment requires many hours of training on modern GPUs.

In his blog, Deepmind reported that the introduction of additional losses (auxiliary tasks), such as the prediction of a frame change (pixel control) so that the agent better understands the consequences of the actions, significantly speeds up learning.

Learning results:

4.2. Learning robots
In OpenAI, they have been actively studying an agent’s training by humans in a virtual environment, which is safer for experiments than in real life.

In one of the studies, the team showed that one-shot learning is possible: a person shows in VR how to perform a certain task, and one demonstration is enough for the algorithm to learn it and then reproduce it in real conditions.

If only it was so easy with people. :)

4.3. Learning on human preferences

Here is the work of OpenAI and DeepMind on the same topic. The bottom line is that an agent has a task, the algorithm provides two possible solutions for the human and indicates which one is better. The process is repeated iteratively and the algorithm for 900 bits of feedback (binary markup) from the person learned how to solve the problem.

As always, the human must be careful and think of what he is teaching the machine. For example, the evaluator decided that the algorithm really wanted to take the object, but in fact, he just simulated this action.

4.4. Movement in complex environments

There is another study from DeepMind. To teach the robot complex behavior (walk, jump, etc.), and even do it similar to the human, you have to be heavily involved with the choice of the loss function, which will encourage the desired behavior. However, it would be preferable that the algorithm learned complex behavior itself by leaning with simple rewards.

Researchers managed to achieve this: they taught agents (body emulators) to perform complex actions by constructing a complex environment with obstacles and with a simple reward for progress in movement.

You can watch the impressive video with results. However, it’s much more fun to watch it with a superimposed sound!

Finally, I will give a link to the recently published algorithms for learning RL from OpenAI. Now you can use more advanced solutions than the standard DQN.

5. Other

5.1. Cooling the data center

In July 2017, Google reported that it took advantage of DeepMind’s development in machine learning to reduce the energy costs of its data center.

Based on the information from thousands of sensors in the data center, Google developers trained a neural network ensemble to predict PUE (Power Usage Effectiveness) and more efficient data center management. This is an impressive and significant example of the practical application of ML.

5.2. One model for all tasks

As you know, trained models are poorly transferred from task to task, as each task has to be trained for a specific model. A small step towards the universality of the models was done by Google Brain in his article “One Model To Learn The All.”

Researchers have trained a model that performs eight tasks from different domains (text, speech, and images). For example, translation from different languages, text parsing, and image and sound recognition.

In order to achieve this, they built a complex network architecture with various blocks to process different input data and generate a result. The blocks for the encoder/decoder fall into three types: convolution, attention, and gated mixture of experts (MoE).

Main results of learning:

  • Almost perfect models were obtained (the authors did not fine tune the hyperparameters).
  • There is a transfer of knowledge between different domains, that is, on tasks with a lot of data, the performance will be almost the same. And it is better on small problems (for example, on parsing).
  • Blocks needed for different tasks do not interfere with each other and even sometimes help, for example, MoE — for the Imagenet task.

By the way, this model is present in tensor2tensor.

5.3. Learning on Imagenet in one hour

In their post, Facebook staff told us how their engineers were able to teach the Resnet-50 model on Imagenet in just one hour. Truth be told, this required a cluster of 256 GPUs (Tesla P100).

They used Gloo and Caffe2 for distributed learning. To make the process effective, it was necessary to adapt the learning strategy with a huge batch (8192 elements): gradient averaging, warm-up phase, special learning rate, etc.

As a result, it was possible to achieve an efficiency of 90% when scaling from 8 to 256 GPU. Now researchers from Facebook can experiment even faster, unlike mere mortals without such a cluster.

6. News

6.1. Self-driving cars

The self-driving car sphere is intensively developing, and the cars are actively tested. From the relatively recent events, we can note the purchase of Intel MobilEye, the scandals around Uber and Google technologies stolen by their former employee, the first death when using an autopilot, and much more.

I will note one thing: Google Waymo is launching a beta program. Google is a pioneer in this field, and it is assumed that their technology is very good because cars have been driven more than 3 million miles.

As to more recent events, self-driving cars have been allowed to travel across all US states.

6.2. Healthcare

As I said, modern ML is beginning to be introduced into medicine. For example, Google collaborates with a medical center to help with diagnosis.

Deepmind has even established a separate unit.

This year, under the program of the Data Science Bowl, there was a competition held to predict lung cancer in a year on the basis of detailed images with a prize fund of one million dollars.

6.3. Investments

Currently, there are heavy investments in ML as it was before with BigData.

China invested $150 billion in AI to become the world leader in the industry.

For comparison, Baidu Research employs 1,300 people, and in the same FAIR (Facebook) — 80. At the last KDD, Alibaba employees talked about their parameter server KungPeng, which runs on 100 billion samples with a trillion parameters, which “becomes a common task” ©.

You can draw your own conclusions, it’s never too late to study machine learning. In one way or another, over time, all developers will use machine learning, which will become one of the common skills, as it is today — the ability to work with databases.

Link to the original post.

YOU’D ALSO LIKE:

--

--