Become a Machine Learning Expert in Under an Hour

Published in

Capital One Tech

12 min readFeb 28, 2018

By Domenic Puzio, Data Engineer, and Jennifer Van, Software Engineer, Capital One

Next week, we’re headed to Austin to speak at SXSW, and we couldn’t be more excited! (Of course, speaking is cool, but we really can’t wait for the film festival.) Our talk is titled How to Become a Machine Learning Expert in Under an Hour, and, as the name suggests, it will be a crash course in all things machine learning (ML). One of the coolest things we’ll dive into is deep learning, and we’ll apply it to a text-based problem: creating chatbots.

Level-Setting on the Deep Learning Hype

Deep learning is two things:

An idea that takes traditional neural networks and supercharges them.
A huge buzzword.

The goal of our talk is to cut through the hype surrounding machine learning and get to an understanding of how it works, so we, as consumers, can better understand how to adapt to a world increasingly driven by machine learning. We hope this blog post can further extrapolate some of the techniques we’ll discuss onstage, in addition to providing a play-by-play for creating our chatbot example.

But first — why do we care about deep learning? And when did this technology become such a huge deal?

For us, the “why” is easy: deep learning is going to be ubiquitous. It’s changing the way we drive (Tesla), the way we pick out movies (Netflix), and even who we date (Tinder). Deep learning makes noisy, unclear data valuable, like your photo library or the speed at which you scroll down your newsfeed — all of this can be transformed into intelligence about the way you talk, the places you go, and the type of person you are. And that intelligence can be turned into money. Deep learning should change the way we think about our data and how we choose to share it.

As for the “when,” the ideas behind deep learning have been around for over 50 years, but they haven’t taken off until the past several years. Deep learning requires a ton of data to understand complex problems; to reach expert level, a model needs about 10 million data points. Processing all this data requires smarter algorithms on the software side, and more computing power on the hardware side. Fortunately, we have on-demand computing resources in the cloud and specialized machine learning machines that use GPU processing to speed up calculations, both of which greatly empower deep learning. The tools required to implement deep learning techniques have also become more accessible — much of machine learning software is now open-sourced, so more people than ever can contribute to this growing field. More broadly, changes in software, hardware, and societal acceptance of technological advances over the last several years have contributed to widespread adoption and a growing appetite for deep learning applications.

Since deep learning requires a lot of computational firepower and a metric ton of data, we only want to use it when necessary. Deep learning should be reserved for problems for which there is an objective ground truth (so our model can evaluate how it is doing and learn accordingly) and in spaces that are very large in scale. We also don’t want to use machine learning to perform tasks that can be summed up by a series of rules. Programming a stoplight doesn’t need deep learning, but training a computer to drive a car does.

We use deep learning at Capital One for a variety of use cases, from cyber security to image recognition. One of my favorite examples is Eno, our ML-powered chatbot that can help you manage your Capital One bank and credit card accounts via text messaging. While our team didn’t specifically work on Eno, we thought it would be fun to build a chatbot of our own, one that’s just an example and won’t go into production, but still really fun — we set out to create a chatbot that could talk like us!

Building our Hypothetical Chatbot

First off, all steps and code that we follow can be found in detail on GitHub, so head there first if you want to try this at home.

Like we said earlier, building a deep learning model requires a lot of data. And where could we source a lot of data about the way that we talk? We headed to Facebook. As you’re likely aware, Facebook tracks your every like, photo, status, message, and poke. As terrifying as this is, there’s a slight upside: they make this information available for you to download. So, we’re going to train a chatbot using our Facebook Messenger data.

The first step is to download your data from Facebook and then use an open-source tool to parse it into messages and responses (these steps are spelled out in detail on the GitHub repository). Now, we train the model! To train a deep learning model, we need to provide our network with many sample inputs (messages sent to us) and many samples (our replies). By ingesting many messages and responses, the model learns underlying patterns that are present in the language that we use: the meanings of different words, the syntax of forming sentences, and even how to converse.

If you don’t use Facebook messenger frequently, or if you don’t use it to converse (most of my messages are links that I want to share with friends), the network may not be able to pick up on the underlying patterns that make up how you speak. But at the very least, the result will be interesting!

For a neural network to learn these complex relationships, it requires thousands of inputs, and it looks through each input many times. Due to these time-intensive requirements, the model is going to take a while to train — at least four hours on a standard laptop (again, the steps to start the model training are on GitHub). So, while your model is training, let’s brush up on some of the theory behind how this type of model works.

Under the hood, this chatbot model is a sequence-to-sequence long short-term memory (LSTM) network. There’s a lot to unpack there.

Sequence-to-sequence means that the model takes in data in a sequence (in this case, a sequence of words forming a message) and outputs a sequence (the reply, another sequence of words). These are two neural networks working in concert; the first, sometimes referred to as the encoder, takes the input sequence and learns how to condense it into a “thought vector” that summarizes the most important parts. The second, the decoder, learns how to take these thoughts and respond to them with an appropriate reply. The encoder might interpret that your friend is asking a question and talking about the weather; the decoder might learn to reply to weather questions with the current temperature and precipitation levels. The two networks learn the best ways to express thoughts and respond to them by viewing many examples from our training data by looking at your message history. You can see the structure of the sequence-to-sequence network below.

Long short-term memory networks, or LSTMs, are neural networks designed for capturing relationships in sequential input like message text. They provide short-term memory across the sequence, allowing the network to recall previous decisions. Let’s say our LSTM is trying to interpret the message, “What is the weather going to be like tomorrow?”

The model is going to see the question mark at the end of the sentence and remember that it saw the word “weather” earlier in the sequence. We call this capturing temporal dependencies, and it’s a challenge at which LSTMs excel. From sentences to stock prices, a lot of data that we care about comes in sequences, so LSTMs are a very valuable technique in the machine learning engineer’s toolbox.

The amazing thing about LSTMs is that they can capture these temporal dependences over significant periods of time. Maybe the message looks more like this:

“I am really worried about the weather for our hike tomorrow. Last month, I was on Old Rag, and it rained. We had a miserable time because the rocks were so slippery! What do you think?”

Although this is asking a similar question about tomorrow’s weather, the word “weather” is much further from the eventual question mark. For other neural network architectures, remembering a dependency from so long ago would be difficult, but LSTMs are more capable. The secret to an LSTM’s memory is two-fold: recurrent connections and cell state.

LSTM neurons take input sequentially, word-by-word, in this case. For each new word, the LSTM gives an output value; it passes this value to the other neurons in the network but also back to itself. This means that when the neuron gets a new word, it also gets its thoughts from the previous word. The thoughts for that previous word are based around that word but also the model’s thoughts on the word that preceded the previous word, which, in turn, was based on the word coming before that one. This is recurrence: our prediction at the end is based on the values that came all throughout the sequence. The neuron’s thoughts on the very first word influence its thoughts on the second. The information extracted from each word re-occurs (or recurs) in each subsequent prediction, influencing the final output. These recurrent connections, shown in the image below, are part of how LSTMs build short-term memory.

The problem with recurrent connections is that they give a stronger weight to more recent items in the sequence. While “weather” might influence the neuron’s thoughts on “hike” (three words away) quite a bit, it influences the neuron’s view on the question mark (thirty words away) exponentially less. In short, the value of items in our sequence degrades over time. This is where the LSTM’s cell state comes to the rescue.

Each LSTM neuron — aside from taking in information about the current word and the previous words — also maintains a state, a memory that loses relevance over time. One LSTM neuron might choose to remember the type of sentence that we’re dealing with (question); one neuron might care about an important concept (weather); another might store the location (Old Rag); and one neuron might keep track of the time frame (tomorrow). With all this information in mind, the LSTM network as a whole (the combination of all these individual neurons) can make an accurate judgment on the meaning of the entire statement: a question about the weather in Old Rag tomorrow.

The way that LSTMs maintain this cell state, or memory, is through two gates. The first gate tells the neuron whether to forget its cell state. For example, we know that one neuron is keeping track of the location being discussed (Old Rag). If the text goes on to mention a new hiking spot (let’s say White Oak Canyon), we should forget about Old Rag.

Next, we need to remember the new place in our cell state: White Oak Canyon. This is what the second gate handles, adding new information to the cell state. These two gates are controlled by the inputs to the cell, which makes sense, since we only need to forget about Old Rag when we see a new location later in the sequence. In addition, we only know what to remember in place of Old Rag by looking at the new input. The forget gate and the remember gate form a dynamic cell memory that retains importance over time.

Below, we can see neurons from an example LSTM that are learning to capture relationships like line length and quotations using the cell state.

The final aspect of the cell state in LSTM neurons is a critical one: each neuron learns what it should remember. As machine learning engineers, we don’t tell the network to recall locations or timings or anything else. The network is going to learn which relationships in our data are most important and then capture those. This is what makes deep learning so powerful. Our model can learn relationships that perhaps we ourselves cannot describe (much human language understanding takes place subconsciously) and can remember information without being told what information is relevant.

Now that we know how our model is learning, let’s check in on how it’s doing.

We train our model over many epochs, or time periods, during which the model sees our input and output values. At the initial epochs, our model is just learning. We pass it sample messages (like, “are you serious”) and it can’t come up with a reply (hence the empty brackets).

After about halfway through our training, we can start to see words in the model’s response; we can actually watch the chatbot learn English. From these early responses, it’s clear that we use Facebook Messenger more often when the weather is bad and we’re not outside having fun.

The LSTM is starting to pick up on frequently-used words and style of speech. And the more Messenger information we show it, the better we expect it to do. Finally, our responses are starting to make some sense.

Now this is how Jen actually talks! And it looks like our LSTM is recalling that one time we accidentally installed malware on Jen’s laptop for a hackathon project…

While it’s not perfect, remember that this algorithm had to learn how to spell words, the meanings of those words, and how to stitch words together. This is no small task!

So why doesn’t it look like a perfect, human-crafted response? The biggest reason is lack of data. While we might have thousands of Facebook messages to look through during training, a deep learning model of this complexity would require millions of data points to truly sound like us. Given this, our model does a stellar job with the limited resources it has.

Of course, this project was just for fun — we won’t be putting this chatbot online. But we hope that it illustrates a few important concepts. First, deep learning is incredibly powerful; it has the ability to pick up on complex relationships that even humans struggle to explain, like how we recognize a friend’s face in a crowd or how you are understanding this sentence as you read it now. Second, deep learning has limitations. Without large volumes of well-organized data and the computing power to process this data, deep learning models do not live up to their full potential. The third and most important point is that you can do this too! Thanks to the open-source philosophy behind the tools used in this demo (and many other machine learning tools), anyone can get started in the world of artificial intelligence.

You may not be a machine learning expert just yet, but by stepping through the training of a deep learning model, you’re well past buzzwords and on your way to understanding why deep learning matters, when to use it, and how it works.

If you’ll be in Austin for SXSW, be sure to check out our introductory talk on machine learning!

These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018 Capital One.

Become a Machine Learning Expert in Under an Hour

Written by Capital One Tech