Understanding Artificial Intelligence
And what you can do with it.
We want you to understand the core ideas in AI today so you can start thinking about the problems you want to solve and how existing ML techniques could be applied to them. There are so many exciting and important problems that have yet to be approached because those who intimately understand them and the people able to structure them so they can be approached technically don’t overlap. If you’re already involved with AI on the nontechnical side, this should give you a better look under the hood. It’s hard to know which are the core ideas and where they’re best explained — that’s what we’re here for.
For those with technical skills interested in getting involved in one of the most exciting fields today, we hope this guide takes you from 0 to 1, and gives you the resources to go much further. At the end we’ve written out a curriculum of the best resources out there so you can take yourself all the way from basic programming literacy to understanding the most cutting edge research in AI.
To make this tutorial more digestible, we’ve broken this tutorial into sections:
- The gist of it — a high level background of AI
- Neural Networks — what is a neural network and how does it work?
- Convolutional Neural Networks
- Recurrent Neural Networks
- Generative Adversarial Networks
- Reinforcement Learning
- Natural Language Processing
- Useful Tools — Transfer Learning and Visualization
- How can I use this?
- AI curriculum and resource list
The gist of it
At the very highest level, AI is about creating machines capable of solving problems like we do, through reasoning, intuition and creativity. In the past, this was approached by writing complex sets of rules that the machine would follow, however these systems totally failed to generalise to new situations or inputs like we can.
Machine learning (ML) is the process of teaching machines to solve problems by showing them huge amounts of examples, and letting them infer their own patterns of thinking. This is a process much more similar to how we learn.
At its core, it is all about taking some input, feeding it through a model and taking the output which appears on the other side. You’re input could be anything from a camera’s view of the road, an English sentence, a DNA sequence or the joint positions of a robot. The corresponding outputs could be a desired steering wheel angle, the Mandarin translation of that sentence, unusual genes and the next torques to be applied to the motors in our robot’s joints.
Most of the success in ML today has been driven by labelled data. What this means is that for each input, you know what the output should have been. The error is how different the output your model produced was from the label. This allows you to adjust the model so that it is more likely to produce the correct output next time — in a process called training. Learning with labelled data is called supervised learning. An example of labelled data is a set of sentences with their corresponding translations, or pictures of the road with the corresponding next wheel angle from a human. Later we’ll deal with unsupervised learning, where you don’t know how to classify the data, but want the algorithm to find structure and groups in the data by itself.
For our non-technical readers, understanding what data you’re putting in, what you want out and how much you need is by far the most important issue. We don’t actually feed our models photos, sentences and DNA sequences, we have to represent those with numbers. Over the course of this guide, you’ll see how that’s done in many different domains.
Neural Networks (NNs)
Neural networks lie at the heart of all this, they do exactly what we just discussed — they learn the relationship between certain inputs and outputs. For example, they may map how height and weight relate to the likelihood a person plays basketball. The numbers which make up our inputs and outputs are arranged in vectors, rows of numbers.
There is no better explanation of how NNs work than this video by 3Blue1Brown, where the animation above comes from. But how do they learn? The input is transformed into the output by numbers within the model called weights. The process of determining the influence of each weight on the final output and adjusting them to ensure that the NN is more correct next time is called backpropogation, which the following videos in his 4 part series delve into.
We can represent many things as a vector. For example, images (which are really just a grid of numbers representing pixel values) can be stretched out into a vector as seen in the video. Audio signals can be represented as a vector, with each number corresponding to the loudness of the signal at a particular frequency. Everything from words, to coordinates can be vectors! Even you are probably represented as a vector somewhere, where each number corresponds to whether or not you’ve visited a certain sight, or follow a certain profile.
Convolutional Neural Networks (CNN)
CNNs are specifically designed for working with images. Through them, our AIs can now see and understand the world around them. They reduce an image down into a much smaller vector, but a more meaningful one. Then you can use a much smaller NN with this ‘feature vector’ as input, and your original output (for example, what class the image is). This means that the NN trains faster because it is smaller, requiring less mathematical operations by the computer. But it is also more accurate because the CNN is a great for reducing an image into only the essential information and cutting out the noise.
Remember, as we saw in the 3Blue1Brown video,images are represented as grids of numbers — representing the intensities of the pixels values in either black and white, or Red Green Blue (RGB).
CNNs use two insights.
- Pixels close to each other are often arranged in features like lines, edges or dots. Those low level features are themselves arranged more complex shapes, and so on. A normal neural network has to learn that things close to each other in the image are very likely to have meaningful information about eachother, wheras the fact that CNNs use ‘filters’ is built with this in mind.
- The same object in different parts of the image should be treated the same. NNs have to learn about each part of the frame separately, because different connections go to every part of the input image. CNNs take the same ‘filters’ and apply them all over the image.
But what are these filters, and how does it actually work? The following video is an excellent guide to the concept. Alternatively, this blog post is also superb guide.
To interactively play with a convolutional layer, go here
In fact, in theory now you have all you need for a car to drive itself! A CNN can convert a camera’s image into features, which are interpreted by an ANN which outputs a number representing how much to turn the wheel. In fact, thousands of students have already done this as the project -Teaching a Machine to Steer by Udacity . This will get your car to drive correctly 80–90% of the time, unfortunately that isn’t quite good enough for a 1 ton hunk of steel at around 70 km/hr. Its that final 10% of unusual situations that are currently occupying the many teams working on self-driving cars.
The most exciting applications of CNNs are in extremely accurate image classification, image segmentation to really understand a scene (by introducing the idea of region proposals) and image captioning by using them in combination with the recurrent neural networks to be introduced in the next section.
The key thing to remember is that CNNs take an input image as a grid of numbers, and either output another grid of numbers which encodes more meaningful information about what is at that location in the image, or reshape the new grid into a vector to feed to a NN.
Recurrent Neural Networks (RNNs)
Everything we’ve dealt with so far has been a single input — what about when we want a model to base its reaction to the current input based on the recent past? This is sequence data. Sentences are sequences of words, videos are sequences of images and the real world is a sequence of sensory inputs!
Once again, the input is vector of numbers, but we’re also introducing another vector which we will call the ‘state’, a representation of the past. We’ll combine these two vectors end-end into one long vector which we feed into a neural network. The output of this neural network is the new state. In this way, the new state is based on both the old state and the current input. Every time the network sees a new input, it recalculates the state (by combining the previous state and the current input).
RNNs give our ML models the ability to do things like converse , translate and write
Hopefully its clear that a recurrent neural network isn’t at its core any different to a normal neural network, but the input that we give it is different. Let use a sentence as an example, as our model is given the vector for each word (word vectors are explained in the NLP section), it updates the state by considering both the old state, and the word it just read.
Recurrent neural networks can either give an output at each step, or only after ingesting the entire sequence — this output is usually calculated by feeding the state at that step into another ‘output’ neural network.
What does this let us do? As simple case, it could let us classify each frame of a video using the information not only from that frame, but frame every frame before. In this case, the input at each step would be the frames of the video.
Another simple case is generating text by using the last word that was generated as an input, and the entire past sentence represented as a state.
In order to do more complex tasks like translation or responding to sentences, RNNs need to be modified to give an output sequence after the input sequence. This is a sequence to sequence RNN. At their most simple, this takes the final state after reading over the input sentence as a representation of the entire sentence, it begins to generate the translation after receiving an ‘End of Input’ signal, by generating a translation in the same way as text generation — continuously updating the state but using the last word outputted as the next input. Once the state indicates it has translated the entire sentence or made a full response it will stop of its own accord.
Unfortunately, RNNs actually have difficulty with long sequences because they recalculate the entire state at every time step. LSTMs successfully handle this issue — and this is a beautiful explanation of them.
There are 3 really important ideas in RNNs, all of which dramatically extend their abilities.
- Attention allows it to selectively refer back to other parts of the input sequence when desired. This revolutionised translation.
- External Memory allows it to store information in a memory as it is operating, allowing to to remember more specific information and answer detailed questions, like the entire London Subway map.
- Variable Compute Time means the RNN has the ability to decide how long it spends computing each input, taking multiple timesteps on more complex inputs.
Generative Adversarial Networks (GANs)
So far, our neural networks have primarily been used for inferring information about the input, whether that be classifying what it is or making some decision based on it. GANs are a clever idea which empowers our networks to create . They’ve been described as potentially ‘the most interesting idea in the last ten years in machine learning’.
Think back to CNNs, they take a lot of information (an image), and distill it into a small feature vector. Could we build a network which did the opposite — generated an image from either a feature vector or a class specification (i.e, create an image of a cat).
How would we know how to improve our model? We can’t compare how different it is to our existing images of cats, because then it will only directly recreate things its seen before.
However, we can make another neural network whose job it is to determine whether or not it is looking at a real or generated (fake) images. If we have a dataset of real images, we can try generating a few images, and then train this second network on how accurately it can tell real from fake.
The first network is the generator , the second is the discriminator . The generator’s error signal comes from whether it fools the discriminator, while the discriminator loss signal comes from telling real from fake.
As it turns out, a CNN is perfect for the discriminator — it is just a supervised classification task. The reverse of a CNN, a deconvolutional network is the generator. By competing against each other, they both improve!
There are a number of small modifications needed to make this elegant idea work in practice, the most significant of which is WGANS. This is explained here, but the article will take an existing technical background to understand. For a more intuitive explanation with superb code examples, dive into the fast.ai courses linked in the curriculum at the end.
Of course, setting two networks against each other in adversarial competition is applicable not just to images, but text as well. Things start getting really exciting when the generated image has to be based on some constraints which a user can define — this has incredible potential in next generation creative tools.
The techniques outlined so far are capable of astounding things that we would definitely describe as requiring intelligence. Driving a car is a complex task, translating language even more so! Despite all this, can you imagine a CNN or RNN ever being conscious? Even GANs are closer to Photoshop than Picasso. There is a reason most ML techniques are called ‘narrow AI’, they just perform a pretty well defined task really well, mapping input to output.
Reinforcement Learning offers a path to intelligence more similar to our own. As humans, we’re constantly observing our environment and making decisions about what actions to take. Throughout this entire process we’re thinking, we’re curious and we’re conscious. Reinforcement learning is all about an agent (an intelligence), which gets some observation of the world which it is in at every timestep (in our case through the senses), and makes a set of actions in response (in our case, what we do with our muscles). It does this to maximize rewards it receives (this is up to you! Happiness?).
This sounds a lot like mapping input to output! We haven’t quite managed to find where the line between a sophisticated neural net and human consciousness lies. But hypothetically, if we were a neural network, the loss would be how far from ideal our actions were at each time step.
If you were told at every moment how correct the actions you are taking were, then we’d be dealing with a pretty easy supervised learning problem, just like teaching that car how to drive itself when we know what the next steering angle should have been based on the input image.
Instead, in real life it can take years for actions to have consequences. Reinforcement learning is the study of how to train an agent to go from observation to action when it is only sporadically rewarded. It’s the question of how to assign credit or blame to decisions from many time-steps ago. It is also one of the most exciting fields in AI, because its our best guess of how to develop a mind like ours. Researchers are developing reinforcement learning algorithms that learn to succeed in ever more complicated simulated environments, from Atari Games to Go and complicated Robotics simulations. The hope is that eventually we’ll be able to deploy them in the incredibly complex real world, and they’ll learn to hold their own here.
Reinforcement learning algorithms centre around ‘Policy Gradients’ and ‘Q Learning’. Policy Gradients attempt to map what they observe to actions, concretely this may mean that a vector of coordinates is mapped to a vector of numbers signifying how much force to apply to each motor. Q learning instead tries to learn the expected reward for each of it’s possible actions at the moment(e.g, in a game up, down, left, right), and then chooses between the actions.
Simple ANNs are perfect for this tasks of mapping input to output, CNNs can be introduced if the input is visual, and RNNs if the output strongly relies on recent history. The rewards received are just a number recorded at each time-step, e.g whether it successfully achieved a goal. These rewards are then added to the number recorded for previous time-steps in progressively smaller amounts under the reasonable assumption that more recent actions contributed most to receiving the reward. The loss of the Q learning network is how far its estimation was from the true reward, while the loss of the policy network is more complex, and typically calculated in variants of the Actor-Critic architecture.
Q learning has achieved incredible performance in Atari games. Policy Networks are now Go champions, and are beginning to show huge promise in robotics. Unfortunately despite how exciting RL is, its enormously difficult and often unreliable. This is described comprehensively here, Deep Reinforcement Learning Doesn’t Work Yet . That being said, it’s getting better — fast. It won’t be long before it seriously impacts how we live and work.
Given a little technical background, these are a set of exceptionally good resources Introducing Reinforcement Learning, Detailing the background behind a state of the art approach to policy gradients, and introducing Q learning. For a more comprehensive look, this guides you through all the core RL ideas by programming them and this covers the latest research ,taught by the people who did most of it. This is a list of some exciting results in RL that should show promising progress.
What about when we don’t have labels, but still want to analyse or cluster our data? This is unsupervised learning! We saw before that NNs and CNNs progressively distill their bigger input into a richer feature vector. The idea behind autoencoders is one network distills the input, the other attempts to reconstruct the original input from this distillation. The loss is how close to the original it is — in this way they cooperate to build a very informative, but much smaller distillation.
What kind of network would work to reconstruct the input? If we’re using images, then the very same style of network which we use to generate a fake image in GANs, deconvolutional networks. Otherwise, we can use normal NNs for inputs represented as vectors.
An image which could be composed of thousands of pixel values can be reduced to just a few hundred values from which the original image is reconstructed with great accuracy. Algorithms like T-SNE can then be used to plot these representations in two or three dimensions, revealing a lot about our data through the clusters that we can see.
For more on using these representations and T-SNE to understand your data and cluster it, see Visualizing Representations.
Natural Language Processing
Working out how to represent images or coordinates as numbers is easy, images are a grid of numbers, coordinates already are usually in vector form. But what about text? Assuming we want to use NNs or RNNs, we need words as vectors. Every word is distinct, so early implementations use the ‘one-hot’ encoding we saw in class labels before. Here, the word vectors are as long as the vocabulary of the network needs to be (which could be tens of thousands of words), with an entry corresponding to that word only.
Unfortunately, this doesn’t tell us anything about the meaning of the word, theres no similarity between the vectors for similar words — we’d rather something like this!
Enter Word2Vec. Words that are used in similar contexts (i.e both cat and dog will often be seen in the presence of fluffy), will end up having similar vectors, this encodes meaning into the vector representation, and allows the use of much smaller vectors of a few hundred numbers thus enabling usage of far smaller neural nets. This is a superb explanation of word vectors.
When you take the dimension of the word vectors down to two, some seriously interesting spatial relationships emerge — they really do capture the meaning inherent in words.
Now that sentences are sequences of vectors, RNNs can easily be used. But what does this mean for data? Translation makes a lot of sense, an input sequence of vectors naturally converts to an output sequence of vectors that are very closely related. On the other hand, question answering is much harder. Given some question (a sequence of vectors), the text the answer is included in (a much longer sequence of vectors) and need to produce an answer which is, as you might be able to guess by now — another sequence of vectors!
At the moment, the most commonly used dataset for this has extractive answers. That means the answer is somewhere, verbatim, in the text. This means your intelligence devolves to sophisticated sentence matching with the question. True abstractive answers where the model synthesizes and combines information in the text for it’s response is still a few breakthroughs away.
This shows where neural networks find something that is relatively easy for us, reading comprehension, much harder than something quite hard for us, learning a foreign language.
Some Interesting Recent Research in NLP
Two recent papers really stood out to us at the intersection of NLP and machine learning.
Generating Wikipedia by Summarizing Long Sequences is a step towards abstractive models, because it learns to successfully summarise and write Wikipedia articles by using the sources which they cite and the top Google Search results for that topic as inputs, and either the actual article itself or the introduction as the label. This is a neat usage of data.
Transformer Networks. The model introduced here is beating all state of the art recurrent models in language translation by appropriating a trick developed for RNNs — attention.
The datasets needed to train these models to be truly accurate are huge, so what can we do? It turns out that its very easy to adapt a model that’s already been trained on some tasks, and retrain it for another or extend it to include more range (for example more classes). The more similar the tasks are, the less retraining you’ll need to do.
This makes sense. It’s much harder for a human to learn how to catch a ball from scratch than it is to learn to catch a differently shaped ball. Similarly, an RNN which has learnt to translate English to Mandarin has already built an understanding of English sentences as an input, so it will be much faster to retrain it as an all English chatbot than it would be to train a new RNN from scratch. As a result, if you have limited data then just find a similar task — often there will be existing pre-trained models! At this point, barely any CNNs are trained from scratch because understanding images translates so well between tasks.
How do you understand whats going on under the hood?
It can be frustrating to work out why neural networks are failing to learn, but it can be just as fascinating to see what they actually have learned. We can’t just ask them about their thought processes, but we can interrogate them in a way.
The first is colloquially called ‘Deep Dreaming’, after Google used it to create art. Here, we modify the input image rather than the model itself to better activate either individual neurons in the model, or classes in the final layer. This lets us see what that model has learned quintessentially represents a certain output. This is most intuitive in dealing with images, but has applications in NLP as well. To truly understand how to implement this, check the fast.ai course linked in the curriculum section. This blog post details the implications and interesting findings.
The second is using the methods we saw earlier with regard to Autoencoders to examine what is going on in the internal layers of our neural networks — how much does each layer contribute to the end output? This is intuitive when we think about classifying our input as one of several classes, at each layer — how distinct are the class clusters? For more, read here and here
How can I use this?
It all comes down to the data.
Firstly, what problem are you trying to tackle? Break that down into the specific inputs and outputs you need. Often, the problem you’re trying to tackle will need to be broken down into a set of tasks. Initially, it will be much easier to augment people by automating specific and contained parts in the process of creating an output, rather than automating them completely.
How easy these tasks will be relies on whether you can find sufficient amounts of supervised data, and whether there are similar enough models to use transfer learning. Many problems related to interpreting and analysing images are either solved or rapidly becoming solvable through the use of CNNs. Consider these data and label pairings.
- Medical Scans — Type of disorder
- Production Line Images — Presence of defects etc
- Climate or Economic Analysis through Satellite Images — Positions of vegetation, cars, roads etc
Tasks involving sequences, like processing textual information are beginning to fall to RNNs.
- French Sentences — Russian Translations of those sentences
- Images — Captions. Aside from any analytical uses, this is amazing for the vision impaired.
- Time Series Forecasting — The series itself can be used as the label, if you only take one section then try predict that ‘future’.
- Gene Sequences — Physical characteristics like height
Without labels, the task becomes harder. The unsupervised learning methods seen before can assist human analytics experts in finding patterns in data, but aren’t yet able to reliably perform the whole decision making pipeline. Reinforcement learning is still focused on academic performance.
What lies ahead?
AI is currently limited mostly to the digital world of information, but RL promises to bring it into the physical world and automate away many of the physical tasks that occupy us today. This currently rests on progress in a few sub fields, namely learning a reward function via demonstration or criticism (likely Inverse RL), faster learning algorithms, and work which generalizes performance in simulation to the real world.
As NLP gets more advanced, it promises to automate away many of the information workers of today. However, there are a few key breakthroughs that have yet to be made. Language translation is extremely good, as is basic textual analysis like sentiment analysis. Finding related information and classifying similar pieces of information in a search-like fashion are strong. However, complex question answering and synthesis of information is still in it’s infancy, which is why machine learning powered chatbots are currently toys unless focused and trained on extremely specific situations. However, the rate of progress is dramatic.
A Full Curriculum
If these ideas interest you, and you want to get to the point where you can put them into code and start really tackling problems, here’s a suggestion of how to go about it. If you don’t yet understand how to code, find a course on python which appeals and begin there. Then work through this guide!
Firstly, build your intuition by watching 3Blue1Brown again, but this time make sure to watch all 4 videos in the series. In an hour, you’ll have a baseline understanding of how training a model works. The maths in the 4th video may be unfamiliar for the moment, thats alright!
Take the two fast.ai 7 week courses and you’ll get to the point where you can write world class machine learning models. They beautifully simplify concepts which others can make extremely complicated.
These courses teach from the top down, and you’ll be implementing exciting ML models within a week, this is a much better place to start than with the theoretical fundamentals. They also introduce you to Pytorch, arguably the most intuitive machine learning framework, and guide you through your technical set up. One note, when cross-entropy is introduced as a loss function, watch this video to properly understand it.
Now you’re ready to explore any sub fields you’re particularly interested in — vision, NLP, RL, art or even how we might develop an AI with consciousness, an artificial general intelligence (AGI). Here are the best in class courses in each of those topics. If everything seems equally interesting, progress in the following order.
The assignments here will also give you a superb understanding of backpropogation, how machine learning learns, because you’re guided through writing it yourself! Why should you know how to do this when many machine learning libraries will do it for you? This is why.
If you find the learning curve a little steep in CS231N, look here first. Deep Learning is taught in the opposite direction to the fast.ai courses, from the fundamentals up. This makes it a useful complement, but if you find you can do CS231N’s assignments after fast.ai then you don’t really need to spend the time here as CS231N covers the same theory.
This is the first in 8 articles which guide you through the intuition of the fundamental ideas in reinforcement learning by coding them yourself. Doing this before the Berkeley course will mean you understand it much faster.
This course is taught by the people who are driving a lot of the exciting progress in reinforcement learning today and covers the cutting edge of research.
Artificial General Intelligence
This is a fun lecture series discussing some of the big ideas in artificial intelligence — because we still have no idea how to approach AGI! It’ll break up this set of intensive subjects.
This is an interesting set of articles by top researchers at the intersection of AI and neuroscience. The best model of intelligence we have is our own, it makes sense to try copy what we know.
Natural Language Processing
Deep Learning For Art
If you can work through and genuinely understand the courses above, then you should be able to keep up with most of the current research and definitely tackle any practical problems you’d like to. There are a few mathematical concepts which will pop up from time to time in papers which will be unfamiliar. A good way to get a grounding in these is to tackle a theoretical machine learning course which focuses on trying to prove why machine learning works and uncovering the fundamental laws underlying ML models. Unfortunately, the theories we have at the moment have little ability to explain the success of most techniques developed in the last decade. A strong mathematical background in not only linear algebra and statistics, but also analysis will be required.
This course covers the concepts well, but the notes assume a fair amount of mathematical literacy.
This blog covers many of the concepts that will be found in the course above, but builds up the intuition and mathematics step by step so you don’t get lost.
However, a lot of innovative research is currently done with little reference to theoretical machine learning mathematics. This is the subject of an interesting debate amongst researchers. The current state of research has been likened to alchemy because we don’t truly understand the fundamentals of why everything works. There’s no equation we can solve which tells us how many layers we need to use, or how much data we’ll need for a new architecture to learn. We have no way of proving if a new idea will work or not, we can only try them. It’s a bit like designing bridges without knowing physics. The alchemists shrug and keep using intuition to make astounding progress. Luckily, we can simulate all our ideas so it doesn’t matter how many times the bridge falls down.
Both alchemists and their critics are important. Better knowledge of the underlying mathematics would be a huge boon to research, but consistent improvements and fresh ideas grounded in intuition and creativity are just as necessary.
How to Keep Up to Date
Machine Learning is a fast moving field, here’s where you should look for the latest advances, and who you should follow for great explanations.
- The latest interesting papers are often discussed in some depth here.
- Introduces recent interesting papers in a fun way.
- Stay on top of recent work with your own paper reccomendation engine.
- Gorgeous visual explanations of some interesting ML concepts.
- Good ML blog.
- Whilst the author has mostly moved to distill, it includes several best in class explanations of fundamental concepts.
- The blog of one of the best educators in AI today, and also runs the self driving program at Tesla.
- OpenAI, DeepMind , and BAIR publish a lot of good RL related work on their blogs.
- Aurelien is off to a promising start of making ideas extremely easy to grasp, can’t wait to see what he explains next.
- Follow the ICML, NIPS AND ICLR conferences and what is released at them.
- Check out a bit of the seminal work up to 2016 that the field is based on today.
We hope you learned a lot from this tutorial and we wish you the best for your further journey into the rich and fascinating field of AI. We hope we were able to share our passion and excitement for AI with you so that you continue down this path and contribute to the AI community in the future. If you found this resource useful, please share it with others. The more people that learn about AI the better. The world needs more people who understand AI so that we can be better informed about its benefits as well as its dangers, and so that ultimately we can use this technology to build a better world.
Co-authored by Sholto Douglas and Nick White at Zeroth.AI