Learn about depth God offers 26 lessons Yoshua Bengio

Lei feng’s network: the translator Liu Xiangyu, General software development engineers, focusing on machine learning, neural networks, pattern recognition.

1, distributed (distributed representations) needs

Yoshua Bengio began to lecture, he said, “this is what I have focused on the slide.” Below are the slides:

Suppose you have a classifier, need categories people are male or female, wearing glasses or not to wear glasses, tall or short. If using a non-distributed, you in dealing with 2*2*2=8 people. High accuracy classifier for training, you need to collect enough training data for the class 8. However, if you are using distributed, every property will show in other dimensions. This means that even if the classifier without running into tall glasses, it can successfully identify them separately because it learned from other samples to learn to recognize gender, wearing glasses or not, and height.

2, the local minimum in the higher dimensions is not a problem Disney case

Yoshua Bengio’s team found that when optimizing a high dimensional neural network parameter, there is no local minima. In contrast, in some dimension of saddle point, which is a local minimum, but that is not a global minimum. This means that in these little train will slow down a lot, until the network knows how to get out of these points, but we would like to wait for a long enough time, the network will always find a way.

Following figure shows the network training process, the vibration of the two States: near the saddle point and leave the saddle point.

Given a specified dimension, the possibility of small probability p indicates that the point is a local minimum, but not on this dimension of the global minimum. In 1000-dimensional space point is not local minimum probability and is, this is a very small value. However, in some dimensions, this point is a local minimum probability is actually higher. And when we get dimensions of minimum value, the train may stop until he found the right direction.

In addition, when losses when they approach the global minimum of the function, probability p will increase. This means that if we find a true local minimum, then it will be very close to the global minimum, the difference is irrelevant.

3, derivative, derivative, derivative

Leon Bottou lists some useful tables, activation function, loss of function, and their corresponding functions. I’ll put them here for later use.

Update: based on comments that slope of Min-Max function in a formula should be substituted.

4, weight initializing policy

Currently recommended weights in a neural network to initialize policy is normalized to a value range [-b,b],b as follows:

Recommended by Hugo Larochelle, Glorot and Bengio publications (2010).

5, training tips

Hugo Larochelle gives some practical advice:

Normalization of real-time data. Minus the mean, divided by the standard deviation.

Reduce learning rate during the training.

Update using small-batch data, gradient would be more stable.

Using momentum, by stasis.

6, gradient detection

If you realized the back-propagation algorithm, but it does not work, then 99% may be the gradient in the calculation Bug. Then use gradient detection to locate the problem. The main idea is the use of gradients to define: If we add a weight value, error of the model will change much.

Here is a more detailed explanation: Gradient checking and advanced optimization.

7, action tracking

Human motion tracking can reach a very high level of precision and accuracy. Below is from people like Graham Taylor (2010) published papers Dynamical Binary in Latent Variable Models for 3D Human Pose Tracking examples. This method uses the conditional restricted Boltzmann machine.

8, the use of grammar or syntax is not used? (“Need to think about grammar? ”)

Chris Manning and Richard Socher has invested a lot of effort to develop the model, it will be embedded neural combined with more traditional methods. This Recursive Neural Tensor Network this study reached the extreme, which uses interactive meaning addition and multiplication combined with grammar parse trees.

Then, the model is Paragraph vector (Le Mikolov,2014) beat (by a considerable margin), Paragraph vector on sentence structure and grammar don’t know. Chris Manning called the results “to create ‘ good ‘ a failure of combination of vectors”.

Recently, however, more and more using the syntax work changed the result of the parse tree. Cardie and Irsoy (NIPS,2014) used in the multi-dimensional deeper network succeeded in defeating the Paragraph vector. Finally, people such as Tai (ACL,2015) the LSTM networks together with grammar parse trees, further improvement of results.

Stanford 5 emotional data sets of these models in the accuracy of the results is as follows:

So far, grammar parse trees model than the simple method of superior. I’m curious about the next non-grammar based approach when it appears, it will be on how to take this race. After all, the syntax of the underlying goal of many neural models are not discarded, but implicitly it is captured in the same network.

9, distribution and allocation

Chris Manning himself clarified the distinction between the two words.

Distributed: in continuous activation of several elements. Such as embedded dense vocabulary, rather than the 1-hot vector.

Distribution: using context. Word2vec is the distribution, when we use the word context to modeling semantics, based on counting words-vector is assigned.

10, dependency analysis

Penn Treebank dependency parser:

Final results from Google “pull out all stops” get, will be trained at Stanford mass data-derived parser.

11、Theano

I know about Theano before, but I am in summer school to learn more. And it is so great.

Because of Theano originated in Montreal, directly consult Theano is useful to developers.

Most of the information can be found on the Web about it, in the form of an interactive Python tutorial.

12、Nvidia Digits

NVIDIA has a kit called the Digits, it training and visualization of complex neural networks model without the need to write any code. And they are selling DevBox, this is a custom machine, you can run the Digits and other deep learning software (Theano,Caffe). It has 4 Titan x GPU, is priced at $ 15,000.

13、Fuel

Fuel is a data management tool set iteration, it can cut into a number of small data sets, a shuffle operation, perform a variety of pre-processing steps. For some established data set has preset functions, such as MNIST,CIFAR-10 and Google’s 1 billion words corpus. It is primarily used in conjunction with Blocks, Blocks are using Theano simplifies network configuration tool.

14, model of linguistic rules

Remember “the King-Queen = men + women”? In fact pictures could agree with processing (such as Kiros, 2015).

15, the Taylor series approximation

When we are in the points, to move, so we can calculate the derivative function to estimate the value of a function at a new location, we will use the Taylor series approximation:

Similarly, when we update the parameter when, we can estimate the loss function: Disney iPhone 6 Case

Where g is the derivative of θ, h is the Hessian of second order derivative of θ.

This is the second order Taylor approximation, but accuracy can be increased through the use of higher order derivatives

16, computing-intensive

Adam Coates presented the policy analysis matrix operations on the GPU speed. This is a simplification of the model, you can display a spent more time reading memory or calculation. Assuming you can calculate these values, then we know that part of the more time consuming.

Suppose we multiply the matrix and a vector:

If M=1024,N=512, then we need to read and store the number of bytes is:

4 bytes ×(1024×512+512+1024)=2.1e6 bytes

Evaluations are:

2×1024×512=1e6 FLOPs

If we have a block 6TFLOP/s GPU, memory bandwidth 300GB/s, the total running time is:

max{2.1e6 bytes /(300e9 bytes/s),1e6 FLOPs/(6e12 FLOP/s)}=max{7μs,0.16μs}

This means that the process is from memory to copy or write to the memory consumption of 7 μ s, and use a faster GPU will not increase speed. You can probably guess, matrix-matrix operations, matrix/vector when the situation will improve.

Adam the same algorithm to compute the intensity of operations is given:

Strength = (# arithmetic operation)/(# of bytes loaded or stored number)

In the previous scenario, strength is like this:

Strength = (1e6 FLOPs)/(2.1e6 bytes) = 0.5FLOPs/bytes

Low intensity means that the system memory size check, high intensity means that the GPU speed and expediency. This can be visualized, to determine which areas should be improved to increase overall system speed, and can observe the best points position.

17, small batch

Went on to say that calculation of strength, one way to increase the strength of the network (the calculation rather than memory limit) is that the data into small batches. This can avoid some memory operations, GPU is also good at processing large matrices in parallel computing.

However, increasing the batch size may affect the training algorithm, and need more time for the merger. It is important to find a good balance in order to get best results in the shortest amount of time.

18, against the sample training

According to the latest information shows that neural networks can easily be teasing against samples. In this case, the picture on the left is the correct classification of Cheng Jinyu. However, if we add the intermediate picture noise model, get the picture on the right, classifier believes this is a picture of a Daisy. Picture from the Andrej KarPathy blog “Breaking Linear Classifiers on ImageNet”, you can learn more from that.

Noise mode are not randomly selected, but to tease through carefully calculated. But the question remained: the image on the right is clearly a goldfish instead of daisies.

Clearly, integration model, Pan vote and unsupervised training policies cannot resolve this vulnerability. Height using regularization will help, but it will affect the accuracy of judgment does not contain noise image.

Ian Goodfellow concept of the training samples of these confrontations. They can be automatically generated and added to the training set. The following results show that, in addition to outside help against sample, which also improves the accuracy of the original sample.

Finally, we can punish the original distribution and KL divergence between against the distribution of samples to further improve the results. This will optimize the network to make it more robust, and similar (confrontation) images similar to the predicted class distribution.

19, everything for language modeling

Phil Blunsom suggested that almost every model of NLP can be built into the language. This way, we can achieve, input and output connections, and try to predict the probability of the entire series.

Translation:

P(Les chiens aiment les os || Dogs love bones)

Questions and answers:

P(What do dogs love? || bones .)

Dialogue:

P(How are you? || Fine thanks. And you?)

The latter two must be established on the basis of known things to know in the world. Part two can not even words, or it can be some tag or structured output, such as dependencies.

20, SMT are difficult

When Frederick Jelinek and his team at IBM on statistical machine translation was submitted in 1988, one of the first papers, their anonymous referees to the following:

As the authors noted, as early as 1949 Weaver confirmed the statistics (information theory) the effectiveness of the methods of machine translation. In 1950, is generally considered to be wrong (see Hutchins, MT — Past, Present, and Future, Ellis Horwood, 1986, p. 30ff and references). Violence is not a computer science. The article is beyond the scope of COLING.

21, the neural machine translation (Neural Machine Translation) status

Obviously, a very simple neural network model can yield surprisingly good results. Below is a photo of Phil Blunsom slides, examples of Chinese translated into English:

In this model, the character vector vector simply added together to form a statement. Decoder contains a conditional language model, the statement vector and two recently generated vector combination of English words, and then build the next word in the translation.

However, the neural model traditional machine translation is not to maximize system performance. But they’re pretty close. Sutskever, and others (2014) in “Sequence to Sequence Learning with Neural Networks,” results in:

Update: @stanfordnlp noted that some recent results showed that neural model does the traditional machine translation to maximize system performance. View this paper “Effective Approaches to Attention-based Neural Machine Translation” (Luong and others, 2015)

22, great examples of classification

Richard Socher demonstrates great character image example, you can upload images to your own training. I trained to identify Thomas Edison and Albert Einstein (can’t find enough Tesla personal photo) classifier. Each class has 5 sample picture, for each type of test output image. Seems to work well.

23, gradient optimization update

Mark Schmidt presented two reports on numerical optimization under different circumstances.

In the deterministic methods of our gradient is calculated over the entire data set, and then update it. Cost of iteration is a linear relationship with the data set size.

Stochastic gradient method, we calculate gradient on the one data point, and then update it. Iterative costs has nothing to do with the data set size.

Each iteration of stochastic gradient descent much faster, but it typically requires more iterations to train network, as shown in the following figure:

In order to achieve the best results the two, we can use batch processing. Specifically, we can conduct a data set stochastic gradient descent, to quickly reach the right part, and then begin to increase batch size. Gradient errors decrease as the size increases, however the final iteration cost will still depend on the data set size.

Random average gradient (SAG) to avoid such a situation, only 1 gradients of each iteration to get linear convergence. Unfortunately, this is not feasible for large-scale neural networks, because they need to remember that each of the data points of the gradient updates, this can consume a lot of memory. Stochastic volatility reduced gradient (SVRG) can reduce the memory consumption of the situation, and each iteration (plus the occasional pass) is only need two gradient.

Mark says, one of his students to achieve various optimization methods (AdaGrad,momentum,SAG). When asked what system he will use black-box neural network method, the student is given two ways: Streaming SVRG (Frostig et al, 2015), and a method they haven’t published.

Analysis of 24, Theano

If you get “profile=true” assigned to the THEANO_FLAGS, it will analyze your program, and then displays the time spent in each operation. Very helpful for finding performance bottlenecks.

25, adversarial network framework

After following Ian Goodfellow lecture on antagonistic samples, Yoshua Bengio spoke about the two systems competing cases.

Identification of system d is a system, its purpose is classifying real data and synthetic data generated.

System g is a generating system, it d error of attempting to build a system to classify data into real-world data.

When we train a system, another system needs better. In experiments it is effective, but must be kept very small, so that system d a g on speed. Here is “the Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks” in the examples–a more advanced version of this model, it is attempting to build a picture of the Church.

Number 26, arXiv.org

ArXiv number contains a paper submitted by year and month, followed by a serial number, such as paper 1508.03854 # 3854 thesis submitted in August 2015. Glad to know this.

Via graphics

Lei Feng network Note: this network authorized by the CSDN Lei Feng (search for “Lei feng’s network”, public interest), if reproduced please contact authorize.

Like this:

Like Loading…

Related


Originally published at rod407.wordpress.com on October 11, 2016.