AI Society - Medium

GANs from Scratch 1: A deep introduction. With code in PyTorch and TensorFlow

Diego Gomez Mosquera — Fri, 02 Feb 2018 15:03:50 GMT

Generative Networks Explained

“The coolest idea in deep learning in the last 20 years.” — Yann LeCun on GANs.

TL;DR #ShowMeTheCode

In this blog post we will explore Generative Adversarial Networks (GANs). If you haven’t heard of them before, this is your opportunity to learn all of what you’ve been missing out until now. If you’re not familiar with GANs, they’ve been hype during the last few years, specially the last semester. Though they’ve existed since 2014, GANs have already become widely known for their application versatility and their outstanding results in generating data.

They have been used in real-life applications for text/image/video generation, drug discovery and text-to-image synthesis. Just to give you an idea of their potential, here’s a short list of incredible projects created with GANs that you should definitely check out:

Image-to-Image Translation using GANs. [1]

In this blog post we’ll start by describing Generative Algorithms and why GANs are becoming increasingly relevant. An overview and a detailed explanation on how and why GANs work will follow. Finally, we’ll be programming a Vanilla GAN, which is the first GAN model ever proposed! Feel free to read this blog in the order you prefer.

For demonstration purposes we’ll be using PyTorch, although a TensorFlow implementation can also be found in my GitHub Repo github.com/diegoalejogm/gans. You can check out some of the advanced GAN models (e.g. DCGAN) in the same GitHub repository if you’re interested, which by the way will also be explained in the series of posts that I’m starting, so make sure to stay tuned.

Output of a GAN through time, learning to Create Hand-written digits. We’ll code this example!

1. Introduction

Generative Adversarial Networks (or GANs for short) are one of the most popular Machine Learning algorithms developed in recent times.

For those new to the field of Artificial Intelligence (AI), we can briefly describe Machine Learning (ML) as the sub-field of AI that uses data to “teach” a machine/program how to perform a new task. A simple example of this would be using images of a person’s face as input to the algorithm, so that a program learns to recognize that same person in any given picture (it’ll probably need negative samples too). For this purpose, we can describe Machine Learning as applied mathematical optimization, where an algorithm can represent data (e.g. a picture) in a multi-dimensional space (remember the Cartesian Plane? That’s a 2 dimensional field), and then learns to distinguish new multi-dimensional vector samples as belonging to the target distribution or not. For a visual understanding on how machines learn I recommend this broad video explanation and this other video on the rise of machines, which I were very fun to watch. Though this is a very fascinating field to explore and discuss, I’ll leave the in-depth explanation for a later post, we’re here for GANs!

Google Trend’s Interest over time for term “Generative Adversarial Networks”

What’s so magical about GANs?

In short, they belong to the set of algorithms named generative models. These algorithms belong to the field of unsupervised learning, a sub-set of ML which aims to study algorithms that learn the underlying structure of the given data, without specifying a target value. Generative models learn the intrinsic distribution function of the input data p(x) (or p(x,y) if there are multiple targets/classes in the dataset), allowing them to generate both synthetic inputs x’ and outputs/targets y’, typically given some hidden parameters.

In contrast, supervised learning algorithms learn to map a function y’=f(x), given labeled data y. An example of this would be classification, where one could use customer purchase data (x) and the customer respective age (y) to classify new customers. Most of the supervised learning algorithms are inherently discriminative, which means they learn how to model the conditional probability distribution function (p.d.f) p(y|x) instead, which is the probability of a target (age=35) given an input (purchase=milk). Despite the fact that one could make predictions with this probability distribution function, one is not allowed to sample new instances (simulate customers with ages) from the input distribution directly.
Side-note: It is possible to use discriminative algorithms which are not probabilistic, they are called discriminative functions.

GANs they have proven to be really succesfull in modeling and generating high dimensional data, which is why they’ve become so popular. Nevertheless they are not the only types of Generative Models, others include Variational Autoencoders (VAEs) and pixelCNN/pixelRNN and real NVP. Each model has its own tradeoffs.

Some of the most relevant GAN pros and cons for the are:

They currently generate the sharpest images
They are easy to train (since no statistical inference is required), and only back-propogation is needed to obtain gradients
GANs are difficult to optimize due to unstable training dynamics.
No statistical inference can be done with them (except here):
GANs belong to the class of direct implicit density models; they model p(x) without explicitly defining the p.d.f.

So.. why generative models?

Generative models are one of the most promising approaches to understand the vast amount of data that surrounds us nowadays. According to OpenAI, algorithms which are able to create data might be substantially better at understanding intrinsically the world. The idea that generative models hold a better potential at solving our problems can be illustrated using the quote of one of my favourite physicists.

“What I cannot create, I do not understand.” — Richard P. Feynman

(I strongly suggest reading his book “Surely You’re Joking Mr. Feynman”)

Generative models can be thought as containing more information than their discriminative counterpart/complement, since they also be used for discriminative tasks such as classification or regression (where the target is a continuous value such as ℝ). One could calculate the conditional p.d.f p(y|x) needed most of the times for such tasks, by using statistical inference on the joint p.d.f. p(x,y) if it is available in the generative model.

Though generative models work for classification and regression, fully discriminative approaches are usually more successful at discriminative tasks in comparison to generative approaches in some scenarios.

Use Cases

Among several use cases, generative models may be applied to:

Generating realistic artwork samples (video/image/audio).
Simulation and planning using time-series data.
Statistical inference.
Machine Learning Engineers and Scientists reading this article may have already realized that generative models can also be used to generate inputs which may expand small datasets.

I also found a very long and interesting curated list of awesome GAN applications here.

2. Understanding a GAN: Overview

Global concept of a GAN

Generative Adversarial Networks are composed of two models:

The first model is called a Generator and it aims to generate new data similar to the expected one. The Generator could be asimilated to a human art forger, which creates fake works of art.
The second model is named the Discriminator. This model’s goal is to recognize if an input data is ‘real’ — belongs to the original dataset — or if it is ‘fake’ — generated by a forger. In this scenario, a Discriminator is analogous to an art expert, which tries to detect artworks as truthful or fraud.

How do these models interact? Paraphrasing the original paper which proposed this framework, it can be thought of the Generator as having an adversary, the Discriminator. The Generator (forger) needs to learn how to create data in such a way that the Discriminator isn’t able to distinguish it as fake anymore. The competition between these two teams is what improves their knowledge, until the Generator succeeds in creating realistic data.

Mathematically Modeling a GAN

Though the GANs framework could be applied to any two models that perform the tasks described above, it is easier to understand when using universal approximators such as artificial neural networks.

A neural network G(z, θ₁) is used to model the Generator mentioned above. It’s role is mapping input noise variables z to the desired data space x (say images). Conversely, a second neural network D(x, θ₂) models the discriminator and outputs the probability that the data came from the real dataset, in the range (0,1). In both cases, θᵢ represents the weights or parameters that define each neural network.

As a result, the Discriminator is trained to correctly classify the input data as either real or fake. This means it’s weights are updated as to maximize the probability that any real data input x is classified as belonging to the real dataset, while minimizing the probability that any fake image is classified as belonging to the real dataset. In more technical terms, the loss/error function used maximizes the function D(x), and it also minimizes D(G(z)).

Furthermore, the Generator is trained to fool the Discriminator by generating data as realistic as possible, which means that the Generator’s weight’s are optimized to maximize the probability that any fake image is classified as belonging to the real dataset. Formally this means that the loss/error function used for this network maximizes D(G(z)).

In practice, the logarithm of the probability (e.g. log D(…)) is used in the loss functions instead of the raw probabilies, since using a log loss heavily penalises classifiers that are confident about an incorrect classification.

Log Loss Visualization: Low probability values are highly penalized

After several steps of training, if the Generator and Discriminator have enough capacity (if the networks can approximate the objective functions), they will reach a point at which both cannot improve anymore. At this point, the generator generates realistic synthetic data, and the discriminator is unable to differentiate between the two types of input.

Since during training both the Discriminator and Generator are trying to optimize opposite loss functions, they can be thought of two agents playing a minimax game with value function V(G,D). In this minimax game, the generator is trying to maximize it’s probability of having it’s outputs recognized as real, while the discriminator is trying to minimize this same value.

Value Function of Minimax Game played by Generator and Discriminator

Training a GAN

Since both the generator and discriminator are being modeled with neural, networks, agradient-based optimization algorithm can be used to train the GAN. In our coding example we’ll be using stochastic gradient descent, as it has proven to be succesfull in multiple fields.

Algorithm on how to train a GAN using stochastic gradient descent [2]

The fundamental steps to train a GAN can be described as following:

Sample a noise set and a real-data set, each with size m.
Train the Discriminator on this data.
Sample a different noise subset with size m.
Train the Generator on this data.
Repeat from Step 1.

3. Coding a GAN

Finally, the moment several of us were waiting for has arrived. 🙌

We’ll implement a GAN in this tutorial, starting by downloading the required libraries.

pip install torchvision tensorboardx jupyter matplotlib numpy

In case you haven’t downloaded PyTorch yet, check out their download helper here. Remember that you can also find a TensorFlow example here.

We’ll proceed by creating a file/notebook and importing the following dependencies.

import torch
from torch import nn, optim
from torch.autograd.variable import Variable
from torchvision import transforms, datasets

To log our progress, we will import an additional file I’ve created, which will allow us to visualize the training process in console/Jupyter, and at the same time store it in TensorBoard for those who already know how to use it.

from utils import Logger

You need to download the file and put it in the same folder where your GAN file will be. It is not necessary that you understand the code in this file, as it is only used for visualization purposes.

The file can be found in any of the following links:

Preview of the file we will use for logging.

Dataset

MNIST Dataset Samples

The dataset we’ll be using here is LeCunn’s MNIST dataset, consisting of about 60.000 black and white images of handwritten digits, each with size 28x28 pixels². This dataset will be preprocessed according to some useful ‘hacks’ proven to be useful for training GANs.

**Specifically, the input values which range in between [0, 255] will be normalized between -1 and 1. This means the value 0 will be mapped to -1, the value 255 to 1, and similarly all values in between will get a value in the range [-1, 1].

def mnist_data():
    compose = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((.5, .5, .5), (.5, .5, .5))
        ])
    out_dir = './dataset'
    return datasets.MNIST(root=out_dir, train=True, transform=compose, download=True)

# Load data
data = mnist_data()

# Create loader with data, so that we can iterate over it
data_loader = torch.utils.data.DataLoader(data, batch_size=100, shuffle=True)
# Num batches
num_batches = len(data_loader)

Networks

Next, we’ll define the neural networks, starting with the Discriminator. This network will take a flattened image as its input, and return the probability of it belonging to the real dataset, or the synthetic dataset. The input size for each image will be 28x28=784. Regarding the structure of this network, it will have three hidden layers, each followed by a Leaky-ReLU nonlinearity and a Dropout layer to prevent overfitting. A Sigmoid/Logistic function is applied to the real-valued output to obtain a value in the open-range (0, 1):

class DiscriminatorNet(torch.nn.Module):
    """
    A three hidden-layer discriminative neural network
    """
    def __init__(self):
        super(DiscriminatorNet, self).__init__()
        n_features = 784
        n_out = 1
        
        self.hidden0 = nn.Sequential( 
            nn.Linear(n_features, 1024),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3)
        )
        self.hidden1 = nn.Sequential(
            nn.Linear(1024, 512),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3)
        )
        self.hidden2 = nn.Sequential(
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3)
        )
        self.out = nn.Sequential(
            torch.nn.Linear(256, n_out),
            torch.nn.Sigmoid()
        )

    def forward(self, x):
        x = self.hidden0(x)
        x = self.hidden1(x)
        x = self.hidden2(x)
        x = self.out(x)
        return x

discriminator = DiscriminatorNet()

We also need some additional functionality that allows us to convert a flattened image into its 2-dimensional representation, and another one that does the opposite.

def images_to_vectors(images):
    return images.view(images.size(0), 784)

def vectors_to_images(vectors):
    return vectors.view(vectors.size(0), 1, 28, 28)

On the other hand, the Generative Network takes a latent variable vector as input, and returns a 784 valued vector, which corresponds to a flattened 28x28 image. Remember that the purpose of this network is to learn how to create undistinguishable images of hand-written digits, which is why its output is itself a new image.

This network will have three hidden layers, each followed by a Leaky-ReLU nonlinearity. The output layer will have a TanH activation function, which maps the resulting values into the (-1, 1) range, which is the same range in which our preprocessed MNIST images is bounded.

class GeneratorNet(torch.nn.Module):
    """
    A three hidden-layer generative neural network
    """
    def __init__(self):
        super(GeneratorNet, self).__init__()
        n_features = 100
        n_out = 784
        
        self.hidden0 = nn.Sequential(
            nn.Linear(n_features, 256),
            nn.LeakyReLU(0.2)
        )
        self.hidden1 = nn.Sequential(            
            nn.Linear(256, 512),
            nn.LeakyReLU(0.2)
        )
        self.hidden2 = nn.Sequential(
            nn.Linear(512, 1024),
            nn.LeakyReLU(0.2)
        )
        
        self.out = nn.Sequential(
            nn.Linear(1024, n_out),
            nn.Tanh()
        )

    def forward(self, x):
        x = self.hidden0(x)
        x = self.hidden1(x)
        x = self.hidden2(x)
        x = self.out(x)
        return x

generator = GeneratorNet()

We also need some additional functionality that allows us to create the random noise. The random noise will be sampled from a normal distribution with mean 0 and variance 1 as proposed in this link.

def noise(size):
    '''
    Generates a 1-d vector of gaussian sampled random values
    '''
    n = Variable(torch.randn(size, 100))
    return n

Optimization

Here we’ll use Adam as the optimization algorithm for both neural networks, with a learning rate of 0.0002. The proposed learning rate was obtained after testing with several values, though it isn’t necessarily the optimal value for this task.

d_optimizer = optim.Adam(discriminator.parameters(), lr=0.0002)
g_optimizer = optim.Adam(generator.parameters(), lr=0.0002)

The loss function we’ll be using for this task is named Binary Cross Entopy Loss (BCE Loss), and it will be used for this scenario as it resembles the log-loss for both the Generator and Discriminator defined earlier in the post (see Modeling Mathematically a GAN). Specifically we’ll be taking the average of the loss calculated for each minibatch.

loss = nn.BCELoss()

Binary Cross Entropy Log. Mean is calculated by computing sum(L) / N.

In this formula the values y are named targets, v are the inputs, and w are the weights. Since we don’t need the weight at all, it’ll be set to wᵢ=1 for all i.

Discriminator Loss:

Discriminator’s Loss.

If we replace vᵢ = D(xᵢ) and yᵢ=1 ∀ i (for all i) in the BCE-Loss definition, we obtain the loss related to the real-images. Conversely if we set vᵢ = D(G(zᵢ)) and yᵢ=0 ∀ i, we obtain the loss related to the fake-images. In the mathematical model of a GAN I described earlier, the gradient of this had to be ascended, but PyTorch and most other Machine Learning frameworks usually minimize functions instead. Since maximizing a function is equivalent to minimizing it’s negative, and the BCE-Loss term has a minus sign, we don’t need to worry about the sign.

Additionally, we can observe that the real-images targets are always ones, while the fake-images targets are zero, so it would be helpful to define the following functions:

def ones_target(size):
    '''
    Tensor containing ones, with shape = size
    '''
    data = Variable(torch.ones(size, 1))
    return data

def zeros_target(size):
    '''
    Tensor containing zeros, with shape = size
    '''
    data = Variable(torch.zeros(size, 1))
    return data

By summing up these two discriminator losses we obtain the total mini-batch loss for the Discriminator. In practice, we will calculate the gradients separately, and then update them together.

def train_discriminator(optimizer, real_data, fake_data):
    N = real_data.size(0)
    # Reset gradients
    optimizer.zero_grad()
    
    # 1.1 Train on Real Data
    prediction_real = discriminator(real_data)
    # Calculate error and backpropagate
    error_real = loss(prediction_real, ones_target(N) )
    error_real.backward()

    # 1.2 Train on Fake Data
    prediction_fake = discriminator(fake_data)
    # Calculate error and backpropagate
    error_fake = loss(prediction_fake, zeros_target(N))
    error_fake.backward()
    
    # 1.3 Update weights with gradients
    optimizer.step()
    
    # Return error and predictions for real and fake inputs
    return error_real + error_fake, prediction_real, prediction_fake

Generator Loss:

Generator’s Loss

Rather than minimizing log(1- D(G(z))), training the Generator to maximize log D(G(z)) will provide much stronger gradients early in training. Both losses may be swapped interchangeably since they result in the same dynamics for the Generator and Discriminator.

Maximizing log D(G(z)) is equivalent to minimizing it’s negative and since the BCE-Loss definition has a minus sign, we don’t need to take care of the sign. Similarly to the Discriminator, if we set vᵢ = D(G(zᵢ)) and yᵢ=1 ∀ i, we obtain the desired loss to be minimized.

def train_generator(optimizer, fake_data):
    N = fake_data.size(0)

    # Reset gradients
    optimizer.zero_grad()

    # Sample noise and generate fake data
    prediction = discriminator(fake_data)

    # Calculate error and backpropagate
    error = loss(prediction, ones_target(N))
    error.backward()

    # Update weights with gradients
    optimizer.step()

    # Return error
    return error

Testing

Last thing before we run our algorithm, we want to visualize how the training process develops while our GAN learns. To do so, we will create a static batch of noise, every few steps we will visualize the batch of images the generator outputs when using this noise as input.

num_test_samples = 16
test_noise = noise(num_test_samples)

Training

Now that we’ve defined the dataset, networks, optimization and learning algorithms we can train our GAN. This part is really simple, since the only thing we’ve got to do is to code in python the pseudocode shown earlier on traning a GAN (see Training a GAN).

We’ll be using all the pieces we’ve coded already, plus the logging file I asked you to download earlier for this procedure:

# Create logger instance
logger = Logger(model_name='VGAN', data_name='MNIST')

# Total number of epochs to train
num_epochs = 200

for epoch in range(num_epochs):
    for n_batch, (real_batch,_) in enumerate(data_loader):
        N = real_batch.size(0)

        # 1. Train Discriminator
        real_data = Variable(images_to_vectors(real_batch))

        # Generate fake data and detach 
        # (so gradients are not calculated for generator)
        fake_data = generator(noise(N)).detach()

        # Train D
        d_error, d_pred_real, d_pred_fake = \
              train_discriminator(d_optimizer, real_data, fake_data)

        # 2. Train Generator

        # Generate fake data
        fake_data = generator(noise(N))

        # Train G
        g_error = train_generator(g_optimizer, fake_data)

        # Log batch error
        logger.log(d_error, g_error, epoch, n_batch, num_batches)

        # Display Progress every few batches
        if (n_batch) % 100 == 0: 
            test_images = vectors_to_images(generator(test_noise))
            test_images = test_images.data

            logger.log_images(
                test_images, num_test_samples, 
                epoch, n_batch, num_batches
            );
            # Display status Logs
            logger.display_status(
                epoch, num_epochs, n_batch, num_batches,
                d_error, g_error, d_pred_real, d_pred_fake
            )

And that’s it, we’ve made it! 🎊

Results

In the beginning the images generated are pure noise:

But then they improve,

Until you get pretty good syntethic images,

It is also possible to visualize the learning process. As you can see in the next figures, the discriminator error is very high in the beginning, as it doesn’t know how to classify correctly images as being either real or fake. As the discriminator becomes better and its error decreases to about .5 at step 5k, the generator error increases, proving that the discriminator outperforms the generator and it can correctly classify the fake samples. As time passes and training continues, the generator error lowers, implying that the images it generates are better and better. While the generator improves, the discriminator’s error increases, because the synthetic images are becoming more realistic each time.

Generator’s Error through Time

Discriminator’s Error through Time

You can also check out the notebook named Vanilla Gan PyTorch in this link and run it online. You may also download the output data.

runs/ folder contains the tensor board data.
data/ folder contains the images generated through time and the already trained neural network models.

Conclusions

In this blog post I have introduced Generative Adversarial Networks. We started by learning what kind of algorithms they are, and why they’re so relevant nowadays. Next we explored the parts that conform a GAN and how they work together. Finally we finished linking the theory with the practice by programming with a fully working implementation of a GAN that learned to create synthetic examples of the MNIST dataset.

Now that you’ve learned all of this, next steps would be to keep on reading and learning about the more advanced GAN methods that I listed in the Further Reading Section. As mentioned earlier, I’ll keep writing these kind of tutorials to make it easier for enthusiasts to learn Machine Learning in a practical way, and learning required maths in the way.

References

[1] Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros, Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, 2018, https://arxiv.org/abs/1703.10593

[2] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks, 2014, https://arxiv.org/abs/1406.2661

GANs from Scratch 1: A deep introduction. With code in PyTorch and TensorFlow was originally published in AI Society on Medium, where people are continuing the conversation by highlighting and responding to this story.

What happens next? (an opinion on AI and jobs)

Diego Montoya Sefair — Tue, 27 Jun 2017 19:28:21 GMT

When you actually sit to think about it, it turns out to be really complex, especially as the effects are starting to be visible right now. I'm talking about Artificial Intelligence (AI) and its effects on jobs.

Being immersed in these kinds of topics makes you realize that it’s not insane to come to the following conclusion: Sooner or later, a machine will be able to perform any job better than a human. A machine could become a better doctor, a better chef, a better psychologist —yes, a better psychologist — , a better lawyer, or a better programmer.

The reason I believe this is relatively simple. Firstly, it is worth mentioning the remarkable success that AI has had recently in a large number of fields. Cars can now drive themselves, computers are learning to play some Atari games better than humans, and even read lips better than humans. Moreover, computers are very good at doing mathematical computations, and a lot of the core of machine learning is mathematics. An optimization problem. And while brains are incredible biological machines, a computer really has the potential to take every single variable into account and calculate outputs that are closer to a global optimum than the brain would.

Now, the key difference between yesterday’s and today’s techniques is that instead of explicitly writing programs to solve a problem, we are writing programs that learn to solve the problem. Instead of explicitly designing a program to recognize handwritten digits we design a program which, by seeing examples of a lot of different pictures and what the number is, learns a method (a-priori unknown to the programmer) to recognize them with near-perfect accuracy. This is an incredibly remarkable achievement. Computers are learning to solve problems, rather than following a pre-crafted recipe for solving them. This resembles the way our brains learn, and it’s producing absolutely impressive results so far.

But the purpose of this article is not to convince you that computers will most likely surpass humans at almost every task. So if you are not convinced yet about this there are plenty of other resources than could help. You could also take a look at the following good video from CGP Grey to have a broader view of the situation. Either way, I would also recommend watching it since it complements this article's introduction.

https://medium.com/media/bb25538df5ca862ff143e95d613f674d/href

In summary, I see no reason to doubt that AI will continue to evolve at an ever increasing (exponential) pace. In fact, AI can also help the very research of AI since Machine Learning is becoming one of the core components of Data Science, the art of processing and making sense of data.

The ultimate industrial revolution

As AI evolves and becomes more capable it starts to displace humans at the tasks where they start to do better (or comparatively good). Why? For one, machines work 24 hours, 365 days a year, without complaining. And, if they do the job better and faster than humans, it's the perfect combination for it to be a huge deal for companies. From their point of view they are being a lot more profitable, they reduce costs and increase their quality (and quantity if applicable). Overall, they are offering better products or services at lower prices. This is not a new thing, as the Industrial Revolution showed us. However, now it's the time for intelectual jobs rather than technical ones.

This, by itself, is not bad. We are producing better things, and even better knowledge. Just imagine putting machines to think about the most difficult problems humanity has right now, 24/7. The largest and most efficient research center driven by machines that don't sleep, working for the well-being of us and asking nothing in return. We would just set them and wait to see what they discover. Physics, chemistry, medicine, mathematics. In general, any area of knowledge. Envision AI super-computers dedicated to finding the cure to cancer, or to solving the most challenging questions of the universe.

Due to these amazing possibilities I would say research on AI is not going to stop. We are approaching (and already began) a phase of profound change. We already went through the phase of seeing how machines can perform a lot of repetitive jobs better or as good as humans during the Industrial Revolution. But now we are beginning to see how machines can also outperform humans (or do equally or sufficiently good) at tasks we wouldn’t have ever thought, intellectual tasks. As this process happens we start loosing jobs, only this time the new jobs that replace (temporarily) the lost ones will requiere a high level of knowledge, education.

In the beginning new jobs emerge, as we are seeing today and as we saw during the Industrial Revolution. New jobs in the fields of data science and machine learning have been created recently. However, as progress is made more and more jobs also disappear (exponentially), and at a faster rate than the ones emerging. And this is not something that will happen, it's something that is happening right now as you have probably seen in the news and in your everyday life. Eventually, and as we have discussed earlier, we can get to the point where there are little to no jobs a machine cannot perform better than a human. If that happens, that would be it: the ultimate revolution.

The collapse of the monetary system

Independently of what happens —i.e. if machines outperform humans in any intellectual task— the economic system is going to suffer. Jacque Fresco, the proposer of something big he called The Venus Project —we’ll get to this later—, had a position on this. In one of his talks he remarked precisely on how machines are replacing jobs in multiple sectors of the industry. As we discussed earlier, he said this is going to happen naturally since industries find it far more profitable to have as less humans as possible. No wages or any other legal requirement attached to regular contracts, much faster (and usually better) work, and to add to that, no 9–5 journeys during weekdays but 24/7 ones 365 days a year.

As jobs get replaced humans start to make their way out, and as this happens productivity will rise substantially while purchasing power falls drastically: the collapse of the money system. Production is off the charts, but no one has money to buy things. What should we do then? This is my big question and is what gives this article it’s title. It’s almost certain that the economy will collapse, what happens next? We are clearly not prepared for this, and if you think about it, any attempt to prevent it would probably not get very far.

One option would be to implement laws to regulate companies and force them to hire people. How should we handle jobs that are no longer doable by humans because of their effort or their precision? How to be fair to all companies? Maybe we come up with some solution to make this work, but, what are we accomplishing? Does it even make sense to force companies to be more inefficient? In the end, machines give companies the potential to produce better products and services —also at a lower price but that's irrelevant since costumers wouldn't be able to afford it.

Maybe another solution then is to stop research, literally make research on AI illegal. This could certainly stop progress and probably save the economic system. But again, what would we be accomplishing? What is it worth more to us, money, or progress? Should we stop the potential of discovering the cure for cancer in order to save the economic system? Should we agree to let people die in countless driving accidents caused by human error in order to prevent driverless vehicles from taking over the transportation industry? Should we make people waste time in lines in supermarkets to preserve human cashiers? Should we reduce the potential of understanding the biggest mysteries of the universe to prevent AI from evolving?

So, we will be faced with some controversial decisions, and this is the main point I want to make here. A crisis will come, but this will not be like the rest of the economic crises of the past. This time we will reach a point where money itself stops making sense. Money rewards work, but if there are few to no humans doing work then money loses its very reason of existence. Machines work for us and they don’t ask for anything in return — hopefully we are smart enough so this doesn’t change.

If we continue with things as they are today research will be made, technology will improve, and so will services, products, and health. But many jobs will cease to exist and we are already seeing the effect. Naturally, repetitive jobs will be the first to suffer. Technology like Google's Waymo project (Google's version of a self-driving car) and autopilots will displace every human from the transportation industry. And those are not technologies far in the future. We are talking 1–2 years before we see comercial, fully autonomous vehicles out there, and you guessed it, they are very profitable for the transportation industry. If you are not convinced, you might take a look at what Uber has already in the works.

Just taking away transportation jobs from humans is enough to create a giant disruption in the system. Where are these people going to go? During the economic crisis that is beginning to happen new, temporary, jobs will be created (temporary since they’ll most likely be replaced later as well). The difference is that these jobs will all have a common denominator: they will be intellectual and require education. Repetitive jobs will come to an end and people will be forced to seek education if they want to survive.

If you ask me, stopping progress is not a very clever solution, and I would think governments will probably agree. If that is so, an economic crisis will happen, and it will be the ultimate crisis since it will take the monetary system down. I will not disagree, the transition will be very hard, specially for those doing the jobs that will disappear in the first wave. I'm eager to see what people figure out to ease this transition, but something big is about to happen.

What happens next?

And so we get to our ultimate question. What happens next? What happens after the crisis? I can't answer this question since only time will tell, but I would say our way of living will see a dramatic change in the coming decades. I think money will lose all of its fundamental value, and that's when we can remove it altogether and start to give things away for free.

If you want an idea of a solution, Jacque Fresco thought about one for some time during his life. His response was The Venus Project, a resource-driven economy. He focuses precisely on the idea of removing money from the system and moving on to a way of life in which we are given things for free. Everyone would have a very high standard of life, and this would be possible since machines do mostly everything. In fact, one of Fresco’s points is that after removing money we remove most of the problems humanity faces right now: poverty, corruption, most forms of violence and war, among many other problems which derive from money. We could stop worrying about work and focus on doing other things. Is this the right path? Who knows, but what is? We would have to take a lot of care, for example, with respect to the psychological implications this would have on people.

But my point isn't to discuss The Venus Project in detail. I just referenced it so you could get an idea of what could be next. In summary, the point that I want to make could be summarized as follows. Something big will happen in the coming decades, and this is because it doesn't make much sense to even try to prevent progress. If we have the potential to save lives, shouldn't we? If we can make key advances in medicine, shouldn't we? If we have the potential to solve the most challenging mysteries of the universe, shouldn't we? As a side-effect, money will most certainly stop making sense, and yes, the transition to a money-less system will be hard. Humanity will probably undergo the biggest change in history, and in some decades from now life will be nothing like it currently is. In my humble opinion, this is something worth thinking from now since something big is about to happen, and its going to happen sooner than we realize.

Why Convolutional Neural Networks are a Great Architecture for Machine Translation

Esteban Vargas — Wed, 07 Jun 2017 22:48:35 GMT

Facebook AI Research recently posted a paper in which a Convolutional Neural Network architecture is proposed for machine translation instead of a Recurrent Neural Network architecture, which has been the convention until now. In this post we will explain why a CNN-based architecture might become the standard for machine translation (and even other NLP tasks) in the feature.

Convolutional Neural Networks basics
Let’s first understand how CNNs work in order to explain why they have certain advantages over RNNs.
The basic concept underlying CNNs is that we compute a vector for every possible phrase. For instance take the string “my favorite AI blog”. Then we compute the word vector representation for: “my favorite, favorite AI, AI blog, my favorite AI, favorite AI blog”. Once we have that, we compute all the bigram vectors until we reach a top vector.

Example of the computation of the bigram vectors for a very simple sentence until the top vector is reached. Retrieved from the Association for Computational Linguistics.

The computation of these bigrams (or n-grams in more complex but convenient CNN architectures) can be parallelized, the time-step-based computations in a RNN architecture can’t. This data structure has been really successful in classifying images; for instance let’s take the image of a cat as an example; it first processes clusters of pixels, then recognizes shapes, then recognizes parts of the image (ears, legs, tail, etc.) and it finally recognizes there is a cat in the picture.

Convolutional Sequence to Sequence Learning
Gehring et al., (2017)proposed using CNNs because contrary to RNNs computation can be parallelized, optimization is easier since the number of non-linearities is fixed and independent of the input length and last because they outperform the LSTM accuracy in Wu et al., (2016). In addition to that the algorithm for capturing these dependencies scales in O(n/k) instead of O(n) due to the hierarchical structure.

Despite being known that convolutions have several advantages since the early days, such as the ones presented by Waibel et al., (1989) and LeCun & Bengio, (1995), they solely create representations for fixed sized contexts. RNNs allowed to create representations for variable sized contexts and LSTMs and GRUs tackled the problem of RNNs not capturing long-range dependencies. These were the reasons why RNNs became the standard for machine translation and CNNs became more widely adopted in fixed sized contexts such as image processing. But the convolutional architecture presented by Gehring tackles these problems and outperforms RNNs.

Convolutional Architecture
In this section of the article we’ll explain in a summarized way what the fully convolutional architecture proposed by Gehring consists of; the first step is embedding input elements in a distributional space and giving a sense of order to the model by embedding the absolute position of input elements, then both vectors are combined and we proceed similarly for the output elements that were already generated by the decoder network.

This linear combination represents the input embeddings in the source language.

Based on these input elements, intermediate states are computed both for the encoder and decoder networks. The computation of each of these states is called a block, and each block contains a one-dimensional convolution followed by a non-linearity. Gated Linear Units, as proposed by Dauphin et al., 2016, are the non-linearity that implement a gating mechanism over the output of the convolution. Ultimately, the softmax activation function is used to compute a distribution over the T possible next target elements.

Each convolution kernel takes X (a concatenation of k input elements embedded in d dimensions) as an input and outputs Y which has twice the dimensionality of the input elements.

The top decoder output is transformed with a linear layer with weights Wo and bias bo respectively.

Then we proceed to compute attention. We combine the current decoder state with an embedding of the previous target element to get the current decoder state summary. The current attention for the current decoder layer, state, and source element is computed by taking the dot-product between the decoder state summary and each output of the last encoder block.

Next, the conditional input for the current decoded layer is computed with a weighted sum of the encoder outputs and the input element embeddings. Once this is computed, it’s added to the output of the corresponding decoder layer.

Finally, a normalization strategy and a careful weight initialization are applied in order to ensure that the variance across the network doesn’t change dramatically which results in a stabilized learning. This way we lastly reach our desired translation in the target language.

Experimental Setup and Results
3 major WMT translation tasks were used by Facebook AI Research to compare both architectures. The BLEU algorithm, which is used to evaluate how much correspondence a machine-translated text has with a professional human translation, was used to benchmark translations. For English-Romanian, the convolutional architecture surpassed by 1.8 BLEU. For English-French the difference was 1.5 BLEU. For English-German it was 0.5 BLEU.

Why Convolutional Neural Networks are a Great Architecture for Machine Translation was originally published in AI Society on Medium, where people are continuing the conversation by highlighting and responding to this story.

Recurrent Neural Networks for Language Translation

Esteban Vargas — Mon, 13 Mar 2017 13:35:31 GMT

Tools that allow any person to communicate with any other person truly make the world a better place. The Rosetta Stone was the first of such tools, evolving to dictionaries and eventually to sophisticated systems such as a language translator that you’ve probably used before, Google Translate. Deep Learning will soon change how these systems work, and the models that will enable such thing have all kinds of applications in NLP even outside the realm of machine translation (such as building an opinion generator, which part of the AI-Society team will actually hack on, so stay tuned).

The Rosetta Stone is considered the first language translator

Let’s first talk about recurrent neural network (RNN) based language models. Yoshua Bengio proposed using artificial neural network based statistical modeling for computing the probability of a sequence of words occurring. This approach proved to be successful; however, feedforward neural networks don’t allow to receive variable length sequences as an input which limits the power of the model. Since RNNs allow variable length sequences both as an input and as an output, they are naturally the next step in statistical language modeling. [1] The RNN architecture is presented in the diagram below.

Simple Recurrent Neural Network architecture model presented by Mikolov et al.

In this model we are given a set of word vectors as an input, we have t time-steps which are equivalent to the number of hidden layers, these layers have neurons (each performing a linear matrix operation on its inputs followed by a non-linear operation). Time-steps generate the output of the previous step and the next word vector in the text corpus is passed as an input to the hidden layer to generate the prediction of a sequence (conditioning the neural network on all previous words).

Equation for computing the hidden state with a linear neural network at each time-step

Equation for the softmax classifier

These are the basics of RNNs. However simple RNN architectures have problems which were explored by Bengio et al. In practice, simple RNNs aren’t able to learn “long-term dependencies”. Let’s analyze the following example in which we try to predict the last word in a sentence:

“I prefer writing my code in Node JS because I am fluent in ______.”

The blank could probably be filled with a programming language, and if you know about backend development you might know that the answer is JavaScript. In order for a program to know this the program needs some context about Node JS and JavaScript from somewhere else in the text. Two fancy types of RNNs solve this problem, Long Short-Term Memories (LSTMs) and Gated Recurrent Units (GRUs). The TensorFlow documentation has an amazing tutorial on language modeling with LSTMs so for the purpose of this blog post we will do the same thing with GRUs.

Gated Recurrent Units

GRUs were introduced by Cho et al. The main idea behind this architecture is to keep around memories to capture long distance dependencies and to allow error messages to flow at different strengths depending on the inputs.

Instead of computing a hidden layer at the next time-step, GRU first computes an update gate (which is another layer) taking the current word vector and hidden state as parameters.

Update gate

Then a reset gate is computed with the same equation but with different weights.

Reset gate

If the reset gate is 0, it only stores the new word information in the memory (reset).

Memory (reset)

The current time-step combines current and previous time-steps to compute the final memory.

Final memory

Clean illustration of the architecture by Richard Socher

Now that you understand the architecture of a GRU cell, you can do some really interesting things. For instance you could train this model and compare the perplexity that an LSTM yields in comparison with a GRU. You could also modify this piece of code and build your own language translator using GRUs instead of LSTMs.

Final Thoughts

Traditional machine translation models are bayesian and at a very broad scope what they do is that they align the source language corpus with the target language corpus (usually at a sentence or paragraph level), after many repetitions of the alignment process each block (sentence or paragraph) has many possible translations, and finally the best hypothesis is searched with Bayes’ Theorem.

EDIT: The papers cited in this post are from 2015 and before. On March 2017 -date in which this post was written- we’ve been told that these systems are already used in production, replacing the mentioned bayesian systems.

[1] There are other reasons. For example, the RAM requirement only scales with the number of words.

Recurrent Neural Networks for Language Translation was originally published in AI Society on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Lisp approach to AI (Part 1)

Sebastian Valencia — Tue, 28 Feb 2017 16:48:05 GMT

Common Lisp code to create an n-inputs m-units one layer perceptron. Taken from the code of AIMA, a classic textbook in Artificial Intelligence. The whole code here.

If you are a programmer that reads about the history and random facts of this lovely craft, and practice it ad honorem — just for fun — , you have found yourself reading about a programming language called Lisp. Some praise it as a software miracle, as the best tool for programming. Some even dare to call Lisp one of the best programming languages ever invented (even if that doesn’t make sense at all). After all, before Python, Scala, Haskell, there was programming, and before Deep Learning there was Artificial Intelligence. Great hackers that love Lisp:

Paul Graham, co-founder of Y-Combinator is a big Lisp evangelist. He wrote his startup’s code in Lisp. Viaweb — the startup — was co-founded along with Robert Tappan Morris, a legendary hacker who allegedly released the first computer worm accidentally. After being a rehabilitated worm writer (written in C), Robert Tappan wrote the server code of the small company in Common Lisp. Viaweb was sold to Yahoo! in 1998 for $48 million dollars. Of course there’s not enough evidence yet to said that C lead you to jail while Lips makes you a millionaire.
Alan Kay, a pioneer in the practical aspects of OOP and the lead developer of the original Smalltalk (another software miracle?), has called Lisp the greatest single programming language ever designed. He has also compared Lisp with Maxwell’s equations.
Edsger Wybe Dijkstra, a pioneer in the field of formal verification and specification, concurrency theory, and operating systems design, whose most famous work is a greedy one, once said:

Lisp has jokingly been called the most intelligent way to misuse a computer. I think that description is a great compliment because it transmits the full flavor of liberation: it has assisted a number of our most gifted fellow humans in thinking previously impossible thoughts.

Robert Floyd, a Stanford professor without a PhD — I’m just joking , but that’s true— , the designer of Floyd-Warshall algorithm, a pioneer in axiomatic semantics, and the author of the best book to learn mathematical induction from as a hacker, once said:

Although my own previous enthusiasm has been for syntactically rich languages, like the Algol family, I now see clearly and concretely the force of Minsky’s 1970 Turing Lecture, in which he argued that Lisp’s uniformity of structure and power of self reference gave the programmer capabilities whose content was well worth the sacrifice of visual form.

Some CS celebrities that have treated Lisp as a miracle (sometimes). A venture capitalist, a musician, Dijkstra’s algorithm inventor, and Robert Floyd (he was highly appreciated by Donald Knuth).

Three of our luminaries, along with Marvin Minsky (the guy referred by Floyd), and John McCarthy (the inventor of Lisp), were awarded with the Turing Award. So why do many CS celebrities talk so good about a simple programming language? Lisp is famous nowadays because of the things others have said about it, but in the early days of AI, Lisp was the de facto language to express ideas related to natural language processing, computer assisted geometry, text generation, AI planning, and automated theorem proving. Yes, there was AI before Machine Learning, indeed, there was an AI winter before the boom of neural networks and statistical approaches to AI, but that’s a topic that deserves an entire single post.

The Lisp approach to AI

John McCarthy, the inventor of the term “Artificial Intelligence”, the inventor of garbage collection, and the inventor of Lisp. Marvin Minsky, the founder of the AI lab at MIT.

The progress, development, and evolution of Lisp was tightly related to the early progress, development, and evolution of Artificial Intelligence. Two of the guys mentioned before were pioneers in AI. John McCarthy, the creator of Lisp, coined the term Artificial Intelligence, while Marvin Minsky shaped the content of the new field by founding the AI lab at MIT. Many of their students were the developers of the first digital milestones of artificial intelligence.

Programs for natural language understanding and generation, game playing (the link contains a paper from the man who introduced the term Machine Learning), theorem proving, early computer vision, symbolic mathematics (specially integration), problem-solving and knowledge representation, were produced at Stanford and MIT using different dialects of Lisp as a tool to express those ideas in. Was it just a coincidence, or is there something special with the idea (not just the language) of Lisp? This is a list of some classic AI programs that were expressed in Lisp.

ELIZA was a natural language processing computer program originally written by Joseph Weizenbaum at MIT. The program was a simple therapist to interact with, showing a trivial and superficial communication between a human and a machine. It was implemented in a pre-processor of Fortran that mimics Lisp features for list processing and composition of nested expressions. It was later written in Lisp by Bernie Cosell.

A typical conversation between a human and ELIZA. The paper that introduced the program is called ELIZA — A Computer Program for the Study of Natural Language Communication between Man and Machine

MACSYMA (MAC’s symbolic manipulator) was one of the first computer algebra systems originally developed at MIT’s project MAC. MACSYMA was written in a dialect of Lisp called MacLisp, and at that time it was one of the biggest Lisp programs out there. In 1982, MACSYMA was licensed to Symbolics, a computer hardware company whose main focus was the production of machines whose architecture was optimized for the development and interpretation of Lisp programs.

A very simple session in MACSYMA.

SHRDLU, was the dissertation of Terry Winograd (the PhD advisor of Larry Page at Stanford University) at MIT. It was written in the AI lab created by Minsky to demonstrate a dialog with the machine that could lead to actions taken by the machine in a virtual environment both agents (the human, and the machine) were capable to understand. As MACSYMA, SHRDLU was written in MacLisp.

A sample session in SHRDLU. The program was supposed to understand and execute actions told by a human in natural language.

The progress of AI in its early days was not because of Lisp, I do think CS subjects should be agnostic of the language they express their ideas in. Lisp was used on the early days of AI because it was flexible enough to allow quick experimentation and prototyping (REPL), and it introduced fundamental ideas that were cool and fresh at the moment (IF-THEN-ELSE construct, recursion, and Garbage Collection). Those features proved themselves to be useful to express the kind of the ideas AI people needed to express. This innovation, and the rapid adoption of Lisp for AI (in labs and projects) helped the language grow and become a standard AI language.

Of course all these programs could have been written in other languages, but Lisp was an accepted and highly praised vehicle to explore and implement these kind of ideas at the moment.

Lisp in the real world

At this point, you may think that Lisp was just an academic invention to teach and implement symbolic AI programs. But the rapid adoption of Lisp in academia, implied a massive effort to embrace Lisp (or any of its descendants) in real-world production ready software. The following is a collection of some of those programs; most of the programs included in this list are still running on production environments, while the other part of it used to backup large pieces of software in well-known projects or companies.

You need to know that Lisp and its dialects have evolved a lot since McCarthy defined it for the first time, but most of the original idea of Lisp has been preserved in its descendants. Most of the semantics of Lisp has been an invariant in most Lisp’s implementations that were capable to power or support, in one way or another, the operation of the following projects.

Some of the projects/companies whose stack has included Lisp.

After the pain found by Bernie Greenberg and Richard Stallman while implementing a language to manipulate text in the TECO text-editor, Bernie decided to implement a whole new editor (written in MacLisp) and an interpreter that allowed users to manipulate the text being edited. Due to poor portability offered by MacLisp, Richard Stallman decided to implement it in C keeping the interpreted language to customize both the text and the editor, this language is called EmacsLisp.
Douglas Lenat, a persistent believer in the power of symbolic AI, has worked on three famous AI programs through his lifetime. The first one, Automated Mathematician (AM) made heavy use of the Lisp property to represent programs as data (you’ll understand this later) to define a bunch of mathematical concepts that could serve as a basis to solve math problems. Its sequel, Eurisko, written in RLL-1, a language written in Lisp, was looking to extend AM’s potential to other fields by working with heuristics (an abstract concept that’s hard to define using a programming language). The frustration of Lenat, acquired while working on Eurisko, lead Lenat to start his own company, Cycorp, Inc. In 1984 the company started a project to introduce common sense to machines, that was supposed to enable computers to perform human-like reasoning. This effort, often qualified as impractical, was launched in 2014.
Some of our favorite sources to get information from: Reddit and Hacker News were/are at websites powered by Lisp. Reddit was originally written in Common Lisp, the “standard” Lisp dialect, but it was rewritten in Python by 2005. Hacker News is itself powered by Arc, a programming language written by Paul Graham using the Racket programming language (another descendant of Lisp).
Planning and logistics are hard problems due to the size and the number of variables involved in. AI has been capable to deal with those problems by finding “clever” ways to optimize search in complex data structures. With this fact in mind, the U.S. Military choose to simulate the feasibility of strategies for supply or personnel transportation using the DART program written in Common Lisp; DART was used in the Gulf War, where it represented large budget savings.
Besides military usage, planning and scheduling with AI, have found space in industrial software. Routific is a Route Optimization as a Service startup whose routing engine — entirely written in Common Lisp — plans optimal routes for delivery companies optimizing the time and spent fuel. ITA Software, a company acquired by Google 5 years ago, offers to their customers a simple travel search engine to search for cheap air trips taking into account several variables. ITA Software makes use of sophisticated algorithms expressed in Common Lisp.

One of Lisp’s main virtues, is that it enables a programmer to create new linguistic abstractions with ease. So there should be not surprise in the fact that Lisp has influenced many popular programming languages; two of them — very close to the AI/Data Science/ML community (besides from Lisp itself) — , which are R and Julia.

R was originally written as a very simple Lisp interpreter using as reference a chapter of a very popular introductory textbook on computer science, and a really good but surprisingly unknown book on Programming Languages. Lisp held an enormous influence in the development and conception of the first R implementation as documented by Ross Ihaka (the creator of R) many times:

Julia development was heavily inspired by the same Lisp dialect that inspired R. That influence was so big, that the language developers decided to write some parts of the language pipeline in it. The Julia parser is written entirely in Scheme and it’s evaluated using a Lisp dialect written by one of the language designers (femtolisp).
Another language that’s worth mentioning is Lush, a scientific object-oriented programming language designed to prototype numerical analysis, computer vision and machine learning programs. It was designed and implemented by Yann LeCun, the man behind the introduction of Convolutional Neural Networks to Computer Vision (along with Kunihiko Fukushima), and the current director of Facebook’s AI lab.

If Lisp if so great, Why TensorFlow’s main language isn’t Lisp?

Most of the programs mentioned earlier made heavy use of symbolic manipulation. As mentioned by Carlos E. Perez in his post The Many Tribes of Artificial Intelligence, before ML and the Neural Network boom, there were symbolic based approaches to AI that combined symbolic manipulation of some elements, following a collection of rules that were modeled with the purpose to encapsulate the behavior of an intelligent system. The problem those days was not the efficient computation of numerical problems, but the manipulation and synthesis of symbols.

Just as C, C++, and Fortran shine in numerical computation where performance matters the most, Lisp shines in symbolic manipulation. One of Lisp’s greatest strengths is being able to handle efficiently symbols and lists.

Lisp is not a perfect language, it has many flaws (lots of dialects, lack of well-known libraries, weird syntax that does not contribute to attract people in, dynamic typing, etc.), but it was a well-suited tool for the problems AI pioneers were trying to tackle at those days, just the same way C/C++, or Fortran are a perfect choice to implement the underpins of a Deep Learning system (TensorFlow is implemented both in C++ and Python). There’s not a single Swiss army knife programming language, we do need to pick a language that suits the most the particular task we’re approaching.

Exploring AI with Lisp

The whole idea of this series is to use Lisp, more specifically, its dialect Scheme to explore Artificial Intelligence (AI is much more than programming, and AI programming is much more than Lisp) related ideas. The goal is to learn together about classical AI concepts such as general problem solving, text generation, symbolic mathematics problems, knowledge representation, expert systems, search, NLP, logical and stochastic reasoning, game playing, and even “contemporary” stuff such as neural networks using the Scheme programming language to express those ideas.

Let’s begin our journey exploring Artificial Intelligence using Lisp. Your homework for the next post in the series is to install MIT-Scheme on your machine.

The Lisp approach to AI (Part 1) was originally published in AI Society on Medium, where people are continuing the conversation by highlighting and responding to this story.

My first experience with deep reinforcement learning

Diego Montoya Sefair — Tue, 21 Feb 2017 17:10:04 GMT

Image from http://ai.berkeley.edu

Note: This article assumes previous knowledge on the basics behind neural networks and Q-learning

About six months ago I saw myself in the need of deciding a topic for my undergraduate thesis project. Since there wasn’t much of AI in my major’s curriculum I chose to do research in that field to gain some knowledge. Now, I had to decide which AI subtopic I wanted to work on and it quickly became clear to me which one it should be.

I have always been fascinated by neural networks and their ability to learn to approximate any function at all. I have always thought that this is an absolutely remarkable feature since many (if not all) problems can be modeled as a function (i.e. something that takes some input, does some processing, and produces some output). It seems to me that, while we are still far from getting there, neural networks could play a very important role in the path toward the ultimate goal of reaching a general AI.

On the other side, in the last years a small company called DeepMind — now owned by Google — had shown great advances in reinforcement learning, and specifically what it is calling deep reinforcement learning (i.e. combining neural networks with reinforcement learning). In the case of Q-learning the principle behind this is that since neural networks are very good function approximators then, why not use them to approximate the Q-function? Deep learning with Q-learning is a very cool concept since other techniques that were used before to approximate the Q-function quickly became unfeasible once the state representation grew in dimensionality. Using the described technique enabled DeepMind to make an algorithm capable of playing many Atari games better than professional human players while not explicitly coding the logic and rules of each game [1]. In other words, the algorithm learned by itself what it was best to do by just looking at the pixels of the game, the score, and given the ability to choose an action (i.e. manipulate the controls of the game) like any human player would be able to.

But more than this, reinforcement learning is another of the fascinating sides of machine learning since it resembles the way we humans learn. Everything we do in life gets us a reward in return, be it positive or negative. Doing a good job will get us the approval of our colleagues, our boss, money, or even a smile from who we benefited with the job. Those things feel good, our brains release dopamine so we want to do them again, a positive reward. But getting into a car crash doesn’t feel good, so the next time we will try to be more careful since we don’t want that to happen again. We want to maximize our rewards, we learn, and we do it by reinforcement. With experience we get better at doing something, just as reinforcement learning algorithms do.

That said, with both deep learning and reinforcement learning we can model a huge variety of the problems we as humans face every day, and this is what makes them very interesting. These techniques are what power systems like autonomous cars for example. Could they be the answer to achieving a general AI? Only time will tell, but they are certainly getting us to interesting things.

Now, for the project…

From the results that DeepMind published in one of its papers, one of the graphs looked like this:

Comparison of DeepMind's DQN with the best reinforcement learning algorithms in the literature [1]

The above graph shows how DeepMind’s algorithm performed with respect to “the best reinforcement learning methods in the literature”. However, the interesting part is the line in the middle which shows how the algorithm performed in comparison to professional human players. The performance of the algorithms were normalized with respect to the performance of the human players (100% level). As you can see the performance of the algorithm with games like Ms. PacMan was really low. They don’t specifically mention the reason behind this but it seems to be related to the relatively long-term planning that the game requires, combined with the fact that Q-learning as it is commonly implemented is known to have these kinds of temporal limitations.

After reading the publication some questions came to me from the approach that DeepMind was having, specifically with the fact that they were were using the very pixels of the game as the state representation. This is remarkable since it is the same information that our brains receive as input, and it is also very good in the sense that it generalizes very well for other games. However, I had the doubt about what would happen if we gave the agent more “calculated” information, i.e. a state representation composed of information other than the pixels. What kind of impact did the state representation have in the learning process and result? This is when I decided to work with a game (PacMan), write a deep Q-learning agent in Python and look for answers.

To clarify, I wasn’t pretending to improve the performance that DeepMind achieved in the games below the human line. What I’m trying to show is how the questions that turned into my senior thesis came to be, and it was that “human-level” line that sparked those questions in me.

Now, finding what was the impact of changing the state representation was the main objective of the research at first, however, one more question arose during the process that I thought was worth investigating, namely: how did (or if) both, varying the topology of the neural network and having a persisted experience replay file before beginning training affected the learning process and the result of the algorithm.

To expand a little on the second part, experience replay is one of the tricks that has been discovered to be one of the most important optimizations to make that will enable the neural network to learn in a reasonable time (or even converge). This is because this technique breaks the concept of continuity between any two transitions while giving the network a chance to also reinforce its knowledge of previous experiences more efficiently. What I wanted to know was, given that experience replay is helpful, could it be also helpful to have a large pre-populated (persisted) experience replay memory right from the start? Could this help the algorithm to get to convergence faster than having to populate the replay memory each time from zero?

At this point I would be telling you much of what is written in the report [2], so I would encourage you to read the paper if you find the research interesting. In a nutshell though, I discovered that all three aspects affect the learning process considerably. Firstly, I could see that the state representation should be as simple as possible (but complete) since simpler state representations are considerably easier to train on. Secondly, I found that having a pre-calculated, large, persisted replay memory has the potential to improve learning speed notably but one should take some precautions so having one does not bias what the agent learns (e.g. when past experiences greatly outnumber new experiences). Lastly, I could also see that changing the topology of the neural network does have an important impact in the learning result. The hypothesis that I could extract is that larger networks take considerably longer to train and were not able to train on time, so one should choose an appropriate topology (i.e. one that is complex enough to be able to approximate the Q-function for the particular problem, but not more complex than that).

The experience

I had a lot of fun with this project, but what’s more, I learned a lot. Now, from my experience, I think reinforcement learning has the potential to be very powerful, especially in combination with neural networks. However, this combination is what can make the process a little frustrating if you are expecting your model to learn in a matter of a couple of hours and then win every game you play against it. The reality is that neural networks, while very powerful, can require very fine tuning to actually learn something. But more than this, you have to be very careful, for example, with the parameters you choose, the topology, and the activation functions as some of these aspects can represent the difference between a neural network that does a very good job in a reasonable time and a neural network that doesn’t learn anything. In summary, getting a good model can require many optimizations and dedication, however when you achieve one the results can be very surprising (as groups like DeepMind have shown us).

Another aspect that can get difficult to handle is the computational complexity of training a model of this kind. If you have a good GPU sitting on your desk you can improve training time by quite a lot, since neural networks are really benefitted by massive parallelism. However, not many have a GPU to spare, so testing can become a little tedious as each training session can take several hours or even days depending on the problem.

To sum up, you will learn a lot from doing different experiments. Nonetheless, if you plan to get immersed in deep reinforcement learning (puns not initially intended) I would recommend to first have a good understanding of neural networks, some patience, and a good machine / GPU could also be very helpful depending on the problem.

Finally, why PacMan?

Since it’s a game I really like — I mean, who doesn’t?

References

[1] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015

[2] D. Montoya, “Exploring how different state representations and configurations affect the learning process and outcome of deep Q-learning algorithms,” Universidad de los Andes, 2016.

Interesting links

Great how-to on deep reinforcement learning: T. Matiisen, “Guest Post (Part I): Demystifying Deep Reinforcement Learning,” Nervana
Great book on neural networks: M. A. Nielsen, Neural Networks and Deep Learning.
(Video) DeepMind's algorithm playing Atari Breakout
(Video) A short introduction to DeepMind

My first experience with deep reinforcement learning was originally published in AI Society on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hello, Gradient Descent

Juan Camilo Bages Prada — Thu, 16 Feb 2017 21:00:13 GMT

Gradient Descent. Image taken from http://blog.datumbox.com/wp-content/uploads/2013/10/gradient-descent.png

Hi there! This article is part of a series called “Hello, ”. In this series we will give some insights about how different AI algorithms work, and we will have fun by implementing them. Today we are gonna talk about Gradient Descent, a simple yet powerful optimization algorithm for finding the (local) minimum of a given function.

Getting some insight

The inspiration behind Gradient Descent comes directly from calculus. Basically, it states that if we’ve a differentiable function, the fastest way to decrease is by taking steps proportional to the opposite direction of the function’s gradient at any given point. This happens because the gradient points to the steepest direction of the function’s generated surface at the current point.

In other words, think about the function’s surface as a mountain that you are hiking down. You know that your goal is to reach the bottom, and you may think that the fastest way to accomplish this is by proceeding through the path that makes you descend the most. In this case, that path points to the opposite of the steepest mountain direction upwards.

With this in mind, we can repeatedly perform these steps in the appropriate direction and we should eventually converge into the (local) minimum. Following our analogy, this is the equivalent of arriving to the bottom of our mountain.

Hiking down a mountain. Image taken from https://raftrek.com/wp-content/uploads/2015/10/Hiking-down-mountain-ridge.jpg

Calculating the next step

So we’ve been talking about taking steps in the right direction, but how can we calculate them? Well, the answer is in the following equation:

Gradient Descent Step.

This formula needs some clarification. Let’s say we are currently in a position Θ⁰, and we want to get to a position Θ¹. As we said previously, every step is gonna be proportional to the opposite direction of the function’s gradient at any given point. So this definition of step for a given function J(Θ) will be equal to −α∇J(Θ).

Why minus and not plus?

Remember that we take steps in the opposite direction of the gradient. So in order to achieve this, we subtract the step value to our current position.

Okay that sounds good, but what do you mean with α?

The symbol α is called the Learning Rate. This is a value that will force us to take little steps so we don’t overshoot the (local) minimum. A bad choice for α would trap us into one of the following possibilities:

If α is too small, our learning algorithm is gonna take too much time to converge.
If α is too large, our learning algorithm might overshoot the bottom, and even diverge because of an infinite loop.

Take a look at the following examples to see what happens when we make a bad choice for the Learning Rate α:

Bad choices for Learning Rate α. Image taken from https://storage.googleapis.com/supplemental_media/udacityu/315142919/Gradient%20Descent.pdf

Calculating the gradient

The gradient of a function J(Θ) (denoted by ∇J(Θ)) is a vector of partial derivatives with respect to each dimension or parameter Θᵢ. Notational details are given in the equation below:

A little example

To make this definition of gradient clearer, let’s calculate the gradient of the following function:

As we can see, this function contains three parameters or dimensions. Thus the appropriate way to proceed is by calculating the partial derivative with respect to each param:

Now we can group those values and that will give us the function’s gradient:

And that’s it! With this vector, we can get the steepest direction at any given point simply by replacing each parameter with its corresponding value:

Hacking Time

And now, for the grand finale, we will go through a full example and we will code our own algorithm for gradient descent.

Defining the example

In this section, we will apply linear regression in order to find the correct function approximation for a given set of points in a plane. The set of points we are trying to predict looks as follows:

As it’s common, the choice for J(Θ) will be the least-squares cost function for measuring the error of an approximation:

In the equation above:

m is the amount of points in the set.
½ is a convenient constant that will cancel out when we take the gradient of J(Θ). This makes maths nicer and doesn’t affect the result.
y is the real value of the y-coordinate for the ith point.
h is our function approximation. It will give us the predicted y-coordinate for the ith-point using parameters Θ and input x.

Finally, before beginning to code let’s calculate the gradient vector of our function. You can see that as we’ve got two parameters for Θ, we will need to calculate two partial derivatives.

Okay, time to proceed. It’s important to mention that our implementation will be in a vectorized form. This means that we will transform all the formulas mentioned above into matrices operations. The advantages of this implementation are that code will be more concise, and with this our computer can take advantage of advanced underlying matrix algorithms.

To work with the vectorized form, we need to add a dummy variable x0 to each point with a value equal to 1. The reason for this is that when we perform matrix multiplication, the intercept parameter Θ0 will be multiplied with that 1 and it will maintain its value as the defined equations establishes.

Below you can see the vectorized form of the error function J(Θ) and its gradient ∇J(Θ):

Coding time

With every function defined, we can proceed to code our algorithm. The first thing we should do is to declare the points dataset and the Learning Rate α.

https://medium.com/media/7b76e549ff9ea82e518878f950711e0c/href

Now we can proceed by defining the error function J(Θ) and its gradient ∇J(Θ). Remember everything will be defined in a vectorized way.

https://medium.com/media/8279417c1c1efbbce632ceae8b2f2fdf/href

This is the heart of our code. Here we will perform steps that update Θ until we reach the (local) minimum. That is, when all the values of the gradient vector are less than or equal to some specified threshold (1/e⁵ in this case).

https://medium.com/media/b350db4231e90b77e74edf8ff8fc7cd5/href

And we’re done! You can see the complete code in the snippet below:

https://medium.com/media/b2ac5730d00831004b5750c865432582/href

Now we can run our algorithm and it will give us the optimal values for Θ that minimize the error. Below you can see the answers I obtained after running it on my computer:

This is the scatter plot we showed before with the line corresponding to the optimal Θ:

Well, we’ve finished our code and our article. I hope that you’d learned one thing or two about Gradient Descent, and more importantly, that you are now really excited about learning by taking a look at the further reading list.

An Introductory Recommender Systems Tutorial

Sebastian Valencia — Fri, 10 Feb 2017 02:53:40 GMT

A Recommender System predicts the likelihood that a user would prefer an item. Based on previous user interaction with the data source that the system takes the information from (besides the data from other users, or historical trends), the system is capable of recommending an item to a user. Think about the fact that Amazon recommends you books that they think you could like; Amazon might be making effective use of a Recommender System behind the curtains. This simple definition, allows us to think in a diverse set of applications where Recommender Systems might be useful. Applications such as documents, movies, music, romantic partners, or who to follow on Twitter, are pervasive and widely known in the world of Information Retrieval.

SICP is a book about Scheme, PLT, Computer Science, etc. Customers that bought it, also bought (an statistical sample) books about Scheme and Functional Programming. Apparently, Amazon makes use of “similar” users to recommend me items.

Such amazing applications, carry a huge amount of theory behind them. While theory can be a little bit intimidating and dry, basic understanding of data structures, a programming language, and a little bit of abstraction is all you need to understand the basics of recommender systems.

In this tutorial, We will help you gain a basic understanding on collaborative based Recommender Systems, by building the most basic Recommender System out there. We hope that this tutorial motivates you to find out more about Recommender Systems, both in theory and practice. The prerequisites to reading this tutorial are knowledge of a programming language (we’ll use Python, but if you know how do Hash Maps and List works, you’re in good shape), and a little bit of high-school algebra. You do not need to have prior exposure to Recommender Systems.

This tutorial makes use of a class of RS (Recommender System) algorithm called collaborative filtering. A collaborative filtering algorithm works by finding a set of people (assuming persons are the only client or user of a RS) with preferences or tastes similar to the target user. Using this smaller set of “similar” people, it constructs a ranked list of suggestions. There are several ways to measure the similarity of two people. It’s important to highlight that we’re not going to use attributes or descriptors of an item to recommend it, we’re just using the tastes or preferences over that item.

Assuming that our users are people, and our items are simply that: items, we need to organize our data to ease the processing step. We’re assuming that the data fits in memory, and that you can organize the data as follows.

The data structure that we are going to use, consists of people pointing to a dictionary whose keys are the items, and values are the numeric preference of each person on this item. If a person has never ranked the item, C[i, j], is null. In this notation C[i, j] represents the numeric rating of Person j, over the Item i. No matter how the rating is expressed, we need to convert them to numeric values. A sample data structure for our working example is the following definition of a Python dictionary, it includes some ratings of people (if you wonder who these folks are, please click over them. We computer scientists owe much to them) to computer science related books. The whole code for this toy Recommender System is on Github.

data = {
 ‘Alan Perlis’: { 
 ‘Artificial intelligence’: 1.46, 
 ‘Systems programming’: 5.0, 
 ‘Software engineering’: 3.34, 
 ‘Databases’: 2.32
 },

‘Marvin Minsky’: { 
 ‘Artificial intelligence’: 5.0, 
 ‘Systems programming’: 2.54,
 ‘Computation’: 4.32, 
 ‘Algorithms’: 2.76
 },

‘John McCarthy’: { 
 ‘Artificial intelligence’: 5.0, 
 ‘Programming language theory’: 4.72, 
 ‘Systems programming’: 3.25, 
 ‘Concurrency’: 3.61, 
 ‘Formal methods’: 3.58,
 ‘Computation’: 3.23, 
 ‘Algorithms’: 3.03 
 },

‘Edsger Dijkstra’: { 
 ‘Programming language theory’: 4.34, 
 ‘Systems programming’: 4.52,
 ‘Software engineering’: 4.04, 
 ‘Concurrency’: 3.97,
 ‘Formal methods’: 5.0, 
 ‘Algorithms’: 4.92 
 },

‘Donald Knuth’: { 
 ‘Programming language theory’: 4.33, 
 ‘Systems programming’: 3.57,
 ‘Computation’: 4.39, 
 ‘Algorithms’: 5.0 
 },

‘John Backus’: { 
 ‘Programming language theory’: 4.58, 
 ‘Systems programming’: 4.43,
 ‘Software engineering’: 4.38, 
 ‘Formal methods’: 2.42, 
 ‘Databases’: 2.80 
 },

‘Robert Floyd’: { 
 ‘Programming language theory’: 4.24, 
 ‘Systems programming’: 2.17,
 ‘Concurrency’: 2.92, 
 ‘Formal methods’: 5.0, 
 ‘Computation’: 3.18, 
 ‘Algorithms’: 5.0 
 },

‘Tony Hoare’: { 
 ‘Programming language theory’: 4.64, 
 ‘Systems programming’: 4.38,
 ‘Software engineering’: 3.62, 
 ‘Concurrency’: 4.88,
 ‘Formal methods’: 4.72, 
 ‘Algorithms’: 4.38
 },

‘Edgar Codd’: { 
 ‘Systems programming’: 4.60, 
 ‘Software engineering’: 3.54,
 ‘Concurrency’: 4.28, 
 ‘Formal methods’: 1.53, 
 ‘Databases’: 5.0
 },

‘Dennis Ritchie’: { 
 ‘Programming language theory’: 3.45, 
 ‘Systems programming’: 5.0,
 ‘Software engineering’: 4.83,
 },

‘Niklaus Wirth’: { 
 ‘Programming language theory’: 4.23, 
 ‘Systems programming’: 4.22,
 ‘Software engineering’: 4.74, 
 ‘Formal methods’: 3.83, 
 ‘Algorithms’: 3.95
 },

‘Robin Milner’: { 
 ‘Programming language theory’: 5.0, 
 ‘Systems programming’: 1.66,
 ‘Concurrency’: 4.62, 
 ‘Formal methods’: 3.94,
 },

‘Leslie Lamport’: { 
 ‘Programming language theory’: 1.5, 
 ‘Systems programming’: 2.76,
 ‘Software engineering’: 3.76, 
 ‘Concurrency’: 5.0,
 ‘Formal methods’: 4.93, 
 ‘Algorithms’: 4.63
 },

‘Michael Stonebraker’: { 
 ‘Systems programming’: 4.67, 
 ‘Software engineering’: 3.86,
 ‘Concurrency’: 4.14, 
 ‘Databases’: 5.0,
 },
}

In this example, Leslie Lamport, rates the book Software engineering with 3.76, while Robin Milner, rates the Programming language theory book with 5.0. A simple problem that we might want to solve using this dataset and a recommender system, is how likely Marvin Minsky is to like the title Programming language theory. In order to solve this kind of problems, we do need a way to measure how similar people are based on their rankings. A naive but popular approach is to compare every pair and find a similarity score; now the problem is to find an adequate similarity score. The most common approaches to the similarity problem, are score by Euclidean Distance, and using the Pearson Correlation Coefficient; both terms are deeply related to statistics and linear algebra.

Euclidean distance score

The Euclidean distance between two points is the length of the line segments connecting them. Our Euclidean space in this particular case is the positive portion of the plane where the axes are the ranked items and the points represent the scores that a particular person gives to both items. Two people belong to a certain preference space if and only if, they have ranked the two items that defines the preference space. So we define a preference space for each pair of distinct items, and the points in this preference space, are given by the people that ranked the two items. To visualize this idea, we consider the preference space, defined by the items Systems programming, and Programming language theory.

This plot shows the users regarding to their tastes on both Systems Programming and Programming Language Theory. We can recognize similar users by looking to the cluster that they belongs. For example, Robert Floyd is not similar to Leslie Lamport taking into account those two items.

The figure shows the people that have ranked both items in a preference space defined by those items, and the scores given by the people to each item independently. In the chart, Leslie Lamport appears that low since he has ranked Systems programming with 2.76 and Programming language theory with 1.5. We can clearly see that regarding this items, John McCarthy, and Tony Hoare are pretty similar, while Robin Milner and Bob Floyd are slightly different; Dennis Ritchie and Leslie Lamport have little in common (regarding those items). We can now proceed to define the distance between two people in the preference space as we define the distance between a pair of points in the plane:

If d(Person[i], Person[j]) is small, then Person[i] is similar to Person[j]. Since we do want a metric that tells us how similar two people are; we do need a number (this number is proportional to the similarity of Person[i] and Person[j]). To achieve that, we are required to take a normalized value based on d(Person[i], Person[j]). Our final similarity metric based on Euclidean distance is:

This formula is designed thinking in division by zero and the proportionality that we need.

The closest to one this metric is, the closest Person[i] is to Person[j] by similarity. If we extend this idea to the set of ranked items in common for two people, we can design an algorithm that tells us the similarity of a pair based on their tastes. We just need the common items between two people and get this metric for every common distinct pair. The following algorithm, computes the Euclidean Similarity between two people based on their common tastes. Those tastes are retrieved from our main data structure stored in our data variable.

The implementation in Python of the Euclidean Distance similarity measure, it’s directly inspired by the formula we just found. The whole code for this toy Recommender System is on Github.

Once we have the data and the algorithm, we can analyze it. The major flaw of this algorithm, and in general of Euclidean distance based comparisons, is that if the whole distribution of rankings from a person tends to be higher than those from other person (a person is inclined to give higher scores than the other), this metric would classify them as dissimilar without regard the correlation between two people. There can still be a perfect correlation if the differences between their rankings are consistent. While a clever algorithm would classify them as similar, our Euclidean based algorithm, will say that two people are very different because one is consistently harsher than the other one. That behavior depends on the application of the recommender system (thus far, we have not created a recommender system; we’re just computing similarity).

Pearson correlation coefficient

In statistics, the Pearson correlation coefficient is a measure of the linear dependence or correlation between two variables X and Y. It has a value between +1 and −1 inclusive, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. In the case of recommender systems, we’re supposed to figure out how related two people are based on the items they both have ranked. The Pearson Correlation Coefficient (PCC) is better understood in this case as a measure of the slope of two datasets related by a single line (we’re not taking into account dimensions). The derivation and the formula itself are harder to find and understand, but by using this method, we’re eliminating the weight of harshness while measuring the relation between two people.

The PCC algorithm, requires two datasets as inputs, those datasets don’t come from how people ranked the items, but they come from the common ranked items between two people. PCC helps us to find the similarity of a pair of users. Rather than considering the distance between the rankings on two products, we can consider the correlation between the users ratings.

To clarify the concept of correlation, we include a new dataset and some charts. The dataset, includes few ratings of some remarkable computer scientists to some CS books.

A sample dataset to instruct you on this similarity measure. How some famous computer scientists have rated those famous CS books.

In order to understand how related are two people, we proceed by plotting their preferences (treating each book as a point, whose coordinates are determined by the rating on this item by both users). Once we have that specific plot, we do need to find the best fit straight line over those points. Finding such a line, requires knowledge of linear regression, a topic that’s out of the scope of this tutorial. While finding the best fit straight line, is not as trivial as it seems, finding the PCC depends just on the data that we already have. This best fit line serves us to explain the concept.

The plot shows the 2-dimensional space defined by the ratings of Ullman and Carmack, as well as the best fit straight line. The positive slope of the line, shows a positive correlation between those points, then, the PCC for Ullman and Carmack is positive.

The last plot, shows a negative correlation between Navarro and Norvig.

If we have one dataset {x[1], x[2], …, x[n]} containing n elements, and another dataset {x[1], x[2], …, x[n]} containing n elements, the formula for the sample PCC is:

A little algebraic manipulation, yield us to the following formula

This formula, let us write a program to compute the PCC between two people.

The implementation in Python of the Pearson Correlation Coefficient similarity measure, it’s directly inspired by the formula we just found. The whole code for this toy Recommender System is on Github.

Both similarity measures allow us to figure out how similar two people are. The logic behind a recommender system, is to measure everyone against a given person and find the closest people to that specific person, we can do that by taking a group of the people for whom the distance is small, or the similarity is high.

By using this approach, we’re trying to predict what’s going to be the rating if our person rates a group of products he has not rated yet. One of the most used approaches to this problem, is to take the ratings of all the other users and multiply how similar they are to the specific person by the rating that they gave to the product. If the product is very popular, and it has been rated by many people, it would have a greater weight, to normalize this behavior, we do need to divide that weight by the sum of all the similarities for the people that have rated the product. The following function implements this approach.

How to recommend items or retrieve similar people given an specific person, and a similarity measure. We’re using a higher order function to test the recommendations done by using any similarity measure function. The whole code for this toy Recommender System is on Github.

Given a person included in the index (data), a bound (that is maximum number of items to recommend), and a function to measure the similarity between people (euclidean_similarity, or pearson_similarity), this function gives an estimate on how the person would rate the item according to how similar people rate the item. As an example:

While Algorithms and Concurrency are perfect topics to recommend to Alan Perlis, or at least that was what our algorithm found, we should keep Marvin Minsky far from the Databases item. There is a strange phenomenon here, depending on the similarity measure, Marvin Minsky seems to like a lot or dislike a little bit the Programming language theory and Formal methods items. By looking for the scores variable while inspecting the code if you call recommend(“Marvin Minsky”, 5), you can tell that Robin Milner and John McCarthy are the closest to Marvin Minsky, while both Robin Milner and John McCarthy are very different from each other; and also Robin Milner tends to rate a little bit harsher than John McCarthy. That insight clearly taught us that we do need to compare both measures depending on the nature of our data, the election of bound also affects this kind of strange recommendations.

Data exploration, and wrangling comes as significant factors while implementing a production recommender system. The more data it can process, the better recommendations we can give our users. While recommender systems theory is much broader, recommender systems is a perfect canvas to explore machine learning, and data mining ideas, algorithms, etc. not only by the nature of the data, but because of the relative ease visualizing and comparing the results.

Resources on Recommender Systems

Recommender Systems: An Introduction. An academic reference whose first chapter explain with more detail and rigor the material discussed here. Besides math it includes design hints and practical usage of recommender systems. It’s the standard textbook on the topic.

Recommender Systems: An Introduction

Programming Collective Intelligence. Written by Toby Segaran. Its first chapter includes a math lightweight approach to this amazing topic. It includes an explanation on the two similarity measures explained here, and an approach to match items instead of users, it also includes “big” datasets to play with. The whole book explores machine learning related ideas using a programming-first approach.

Programming Collective Intelligence: Building Smart Web 2.0 Applications

Recommender Systems Specialization. A whole Coursera Specialization on the topic.

Recommender Systems

The following links provide useful information on deployment of real recommender systems.

How Recommendation System Works. A review of companies using recommender systems.

How Recommendation System Works | EdLab

Amazon.com Recommendations. Item-to-Item Collaborative Filtering. A popular-science description of Amazon recommender system written by the engineer that was behind it.

https://medium.com/media/f2197dc997f140d989909a4299b1e1b7/href

How does the Amazon Recommendation feature work? Some hints on recommender system design and production-ready artifacts (reading the links related here requires a lot of mathematical maturity and a greedy research enthusiasm).

How does the Amazon Recommendation feature work?

How Amazon’s Recommendation System works and What it might be missing.

How Amazon's Recommendation System works and What it might be missing...How we can build better models. #recsys #algorithms #models #cs #math #amazon

Now Anyone Can Tap the AI Behind Amazon’s Recommendations.

Now Anyone Can Tap the AI Behind Amazon's Recommendations

How does the Netflix movie recommendation algorithm work?

How does the Netflix movie recommendation algorithm work?

An Introductory Recommender Systems Tutorial was originally published in AI Society on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Comprehensive Introduction to Word Vector Representations

Esteban Vargas — Thu, 09 Feb 2017 21:57:36 GMT

Making a computer mimic the human cognitive function of understanding text is a really hot topic nowadays. Applications range from sentiment analysis to text summary and language translation among others. We call this field of computer science and artificial intelligence Natural Language Processing, or NLP (gosh, please don’t confuse with Neuro-linguistic Programming).

Bag of Words

The ‘Bag of Words’ model was an important insight that made NLP thrive. This model consists on receiving a list of labeled text corpora, making a word count on each corpus and determining with how much frequency each word (or morpheme [1] to be more precise) appears for every given label. After that, the Bayes’ Theorem is applied on an unlabeled corpus to test which label (a sentiment analysis that labels between positive and negative, perhaps) it has a higher probability of belonging to, based on morpheme frequencies.

Even Though decent (>90%) test scores can be achieved with this method, it has 2 problems:

Syntactic and semantic accuracy isn’t as high as it should because of the fact that context is king. For instance; ‘Chicago’ means one thing and ‘Bulls’ means another, but ‘Chicago Bulls’ means a completely different thing. Counting word-frequencies doesn’t take this into account.
For more practical use cases, we need to understand that data in real-life tends to be unlabeled, therefore passing from a supervised to an unsupervised learning method yields a greater utility.

Simple Co-occurrence Vectors

Analyzing the context in which a word is used is a transcendental insight to attack this problem. Taking into account a word’s neighboring words is what has made NLP take a quantum leap in the most recent years.

We will set a parameter ‘m’ which stands for the window size. In this example we’ll use a size of 1 for educational purposes but 5–10 tends to be more common. This means that each word will be defined by its neighboring word to the left as well as the one to the right. This is modeled mathematically by constructing a co-occurrence matrix for each window. Let’s look at the following example:

I love Programming. I love Math. I tolerate Biology.

Here the word ‘love’ is defined by the words ‘I’ and ‘Programming’, meaning that we increment the value both for the ‘I love’ and the ‘love Programming’ co-occurrence. We do that for each window and obtain the following co-occurrence matrix:

Once we have the co-occurrence matrix filled we can plot its results into a multi-dimensional space. Since ‘Programming’ and ‘Math’ share the same co-occurrence values, they would be placed in the same place; meaning that in this context they mean the same thing (or ‘pretty much’ the same thing). ‘Biology’ would be the closest word to these 2 meaning ‘it has the closest possible meaning but it’s not the same thing’, and so on for every word. The semantic and syntactic relationships generated by this technique are really powerful but it’s computationally expensive since we are talking about a very high-dimensional space. Therefore, we need a technique that reduces dimensionality for us with the least data-loss possible.

Singular Value Decomposition

The idea here is to store only the most ‘important’ information in order to have a dense vector (eliminating as much 0’s as possible to keep only the relevant values) with a low number of dimensions. The way we do this is by applying a technique borrowed from Linear Algebra called Singular Value Decomposition [2] which in summary is the generalization of the eigendecomposition of a positive semidefinite normal matrix (such as the matrix in the example above, which is a symmetric one with positive eigenvalues).

This approach generates really interesting semantic and syntactic relationships. Semantically we could visualize things such as ‘San Francisco’ and ‘New York’ are at the highest level of similarity possible, at the next level of similarity there’s ‘Toronto’ and at the next one there’s ‘Tokyo’. Syntactically we can find words clustered around their respective morphemes; for example ‘write’, ‘wrote’ and ‘writing’ can be clustered together and then far away there’s another cluster with the words ‘cook’, ‘cooking’ and ‘cooked’. With this approach dimensionality has indeed been reduced, however, the computational cost of this approach scales quadratically (O(mn²) for the nxm matrix) which is something not very desirable. Let us then introduce you to a model that solves this computational complexity runtime issue:

GloVe

The way that we are going to finally solve our computational complexity issue is by predicting the surrounding words of every word instead of counting co-occurrences directly. This method is not only more computationally efficient but it also makes it viable to add new words to the model, which in other words means the model scales with corpus size. There are various prediction models but we’re going to talk about one in particular that generates really powerful word relationships, called GloVe: Global Vectors for Word Representation. [3]

The way these models predict surrounding words is by maximizing the probability of a context word occurring given a center word by performing a dynamic logistic regression. This just means we are going to find the global optimum of a probability function. Review Convex Optimization [4] if this doesn’t sound familiar. Our cost function is the following:

Then something mind-blowing happens. The multi-dimensional plot (represented in 2 dimensions here) understands that what Dollar is to Peso, USA is to Colombia; as well as that what Dollar is to USA, Peso is to Colombia. The most impressive thing about this isn’t that cognitive intelligence assessments test how well can a human build these kind of relations, but that the semantic relation between words turns into a mathematical one. For instance, if you perform the vector operation Peso — Dollar + USA, you will get Colombia as a result. The reason why this happens is because these words tend to appear in the same context. Imagine we are training a corpora of economic news; you’ll often find fragments such as “The {Country} {Currency} appreciated” or “Firms that import from {Country1} to {Country2} are worried because the {Currency2} has depreciated with respect to the {Currency1}.”

This first tutorial has been a lot about the mathematical background behind modern deep learning for NLP techniques. With this notion we can now crack some code to perform sentiment analysis, which we’ll do on our next tutorial.

Happy hacking!

[1] The smallest meaningful unit of a word. For example; reading, read and readable share the morpheme ‘read’. Python libraries such as nltk allow you to run an algorithms that reduce each word in a corpus to its morpheme in only a few lines of code.

[2] Here’s a comprehensive tutorial from MIT OCW: https://www.youtube.com/watch?v=cOUTpqlX-Xs If you need a Linear Algebra refresher, please take it.

[3] https://pdfs.semanticscholar.org/b397/ed9a08ca46566aa8c35be51e6b466643e5fb.pdf

[4] http://cs229.stanford.edu/section/cs229-cvxopt.pdf

Thanks to Juan C. Saldarriaga, Ana M. Gómez and Melissa M. Argote for revising the drafts of this text.

A Comprehensive Introduction to Word Vector Representations was originally published in AI Society on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI Society - Medium

GANs from Scratch 1: A deep introduction. With code in PyTorch and TensorFlow

Generative Networks Explained

“The coolest idea in deep learning in the last 20 years.” — Yann LeCun on GANs.

1. Introduction

What’s so magical about GANs?

So.. why generative models?

Use Cases

2. Understanding a GAN: Overview

Mathematically Modeling a GAN

Training a GAN

3. Coding a GAN

Results

Conclusions

Further Reading/Watching

References

What happens next? (an opinion on AI and jobs)

The ultimate industrial revolution

The collapse of the monetary system

What happens next?

Other Links

Why Convolutional Neural Networks are a Great Architecture for Machine Translation

Recurrent Neural Networks for Language Translation

The Lisp approach to AI (Part 1)

The Lisp approach to AI

Lisp in the real world

If Lisp if so great, Why TensorFlow’s main language isn’t Lisp?

Exploring AI with Lisp

My first experience with deep reinforcement learning

Now, for the project…

The experience

Finally, why PacMan?

References

Interesting links

Hello, Gradient Descent

Getting some insight

Calculating the next step

Why minus and not plus?

Okay that sounds good, but what do you mean with α?

Calculating the gradient

A little example

Hacking Time

Defining the example

Coding time

Further reading

An Introductory Recommender Systems Tutorial

Euclidean distance score

Pearson correlation coefficient

Resources on Recommender Systems

The following links provide useful information on deployment of real recommender systems.

A Comprehensive Introduction to Word Vector Representations