This month, Jam3’s R&D team explores Machine Learning and Neural Nets.
In each cycle, the Jam3 R&D team tackles a new technology or technique. This month, our primary goal was to learn more about machine learning and how it could be applied in a browser.
For our first week, we did some broad research to find out as much as we could, and also built a convolutional neural network that can classify images. In the second week, we built a more polished demo that uses a machine learning algorithm to group and visualize data from Twitter.
This final prototype, which we called Hashpipe, visualizes thousands of tweets by the semantic meaning in their text and hashtags. Over time, the program begins to recognize language patterns in the data set and forms clusters in 3D space. For demonstration, our application visualizes Twitter data around the 2016 US Election, the Rio Olympics, and incidents reported on the Twitter feed of the Toronto Police.
With each data set, the machine learns to recognize different language patterns as it analyzes the tweets. The algorithm is visualized in a three-dimensional graph, rendered with WebGL in the browser.
Check out the link below to see the demo:
(Note: Desktop Chrome, Firefox and Safari browsers only.)
What is machine learning, anyway?
At its core, machine learning is a technique for building a computer program that can teach itself to improve over time as it’s fed fresh data. As far back as 1959, it was defined as a “field of study that gives computers the ability to learn without being explicitly programmed”.
Machine learning and neural networks are currently hot topics and an active field of research. Although the public perception is largely driven by flashy projects like Google DeepDream, Style Transfer, and the Cannes Lion-winning project The Next Rembrandt, many of the real-world applications cruise along under the radar, mostly because they’re designed to solve very specific problems.
For example, machine learning is being used for handwritten text recognition, email spam filtering, facial recognition, and more recently to diagnose rare forms of leukemia.
Feeding the machine
For this project, we decided to use Twitter as our source for data. It provides a rich and organic set of text, hashtags, usernames, emojis, images, geolocations, and other metadata.
However, it also introduces some challenges for machine learning: our Twitter API privileges only allow us to collect approximately 3,000 tweets for a single hashtag or user timeline, and many of the collected tweets are from bot and spam accounts. Despite those challenges, it provided some interesting results.
To gather the tweet data sets, we used Node.js and the twit module to scrape a specific query or user timeline. Some of our experiments also acted on images in tweets. For this, we used devtool to scrape, scale and crop the thumbnail images to 16x16 pixels, then write the PNGs to the local filesystem for later processing. With these tools in place, we were ready to jump into the data and see how we could apply machine learning to different slices of the Twitterverse.
Image Classification with a Convolutional Neural Net
We tested queries from Twitter with the tags #burger and #porsche (because both are certified awesome), training our network on the media we found in each tweet. After training, our network was crudely able to classify a user photo as either a “Porsche” or “Burger” — and output common hashtags depending on this classification. Here is an example output from a photo of a burger:
With that basic proof-of-concept out of the way, we wanted to expand things out a bit and see what else we could discover. Because of the noise and variety of media on Twitter, a larger image classifier would have been difficult to train.
For that reason, we decided that analyzing text rather than images might give us more interesting results.
Semantic Meaning in Language
To analyze and visualize tweets by their text, we needed to first convert the language into a representation the computer can better understand. This process is often called word embedding, and it involves mapping words and phrases into a set of real numbers. For example, the following getEmbedding function returns a 5-dimensional vector for an input string:
For these experiments, we used node-word2vec, a set of tools pioneered by Google engineers for mapping words to high-dimensional vectors. When applied to our entire data set of tweets, we end up with vectors that roughly represent the semantic meaning of each word. We can even perform basic linear algebra on the results, such as finding similar words with a Euclidean distance function. This was a crucial first step in processing our data set, allowing us to use math to explore relationships between words.
Visualizing High-Dimensional Vectors
After building a vector for each word in our textual data, we end up with a lot of numbers that can be fed into a computer program, but are pretty much impossible for a human to understand.
In order to bring our data down to earth and out of the realm of pure abstraction, we needed to reduce the vectors into a more familiar format, like two or three dimensions. This is known as dimensionality reduction. A common approach to this problem is using a machine learning algorithm, t-distributed stochastic neighbor embedding (t-SNE).
With t-SNE, you give the algorithm high-dimensional data (like thousands of 5-dimensional vectors) and over time it will learn to find patterns in the data and visualize the set in a smaller dimensionality. You can read more about t-SNE in this article.
Visualizing Twitter Media
We felt that t-SNE provided a nice visual backdrop for our research into machine learning, so we built a tool that could quickly fetch, embed, and visualize any query or user timeline from Twitter. The visualization and machine learning step runs in a web browser, using WebGL to render the data and a fork of Karpathy’s tsnejs as the implementation of the algorithm. The t-SNE integration runs in a Web Worker to avoid blocking the main thread.
We tried a few different approaches to visualize our Twitter data. First, we attempted to organize the data by the media in each tweet. When you run t-SNE on arbitrary images, it begins to group them crudely by similarity. For example, the algorithm will separate photos of leafy green trees from photos of black cats. This sort of untrained classification can be useful to find patterns and similarities in data sets too large for someone to easily examine.
For fun, we ran t-SNE on a data set from @archillect — a bot that walks the internet and learns to curate beautiful images.
t-SNE organized the images in the archillect data set both by brightness, with darker images on one end and lighter on another, and by hue, grouping similarly-coloured images together.
Visualizing Textual Data with Hashpipe
Visualizing tweets with image data looked interesting, but we wanted our visualization to represent patterns in semantic meaning rather than just image similarity. This is where our earlier work with word2vec and word embeddings comes into play.
For our final prototype, we decided to visualize textual data in three general subjects: the US Election (#election2016), the Rio 2016 Olympics (#rio2016), and the Toronto Police Operations timeline (@TPSOperations). In each set, the machine analyzed our word vectors and learned to separate the tweets by different language patterns.
Using 1,750 tweets from the #election2016 hashtag, the t-SNE algorithm slowly began to separate the data by party affiliation — tweets favouring the Democratic Party clumped together, as did tweets favouring the Republican Party.
Once we discovered that the machine was clustering tweets by party affiliation, we manually coloured a subset of the data as red (strongly Republican) or blue (strongly Democrat), leaving the rest grey. This helps to show that the machine successfully clustered our tweets by party affiliation, as shown below:
The machine also identified clusters of Twitter bots advocating for or against a particular candidate, and helped us discover some unusual art projects, like @t_r_u_m_p_i_n_g.
In the Rio set, the tweets began to separate by language — the computer could distinguish clear differences in English (red), French (blue) and other languages (grey). Interestingly, the machine understood Portuguese (yellow) and Spanish (orange) as closely related languages, demonstrated in the following screenshot.
Our third data set sampled random tweets from the Toronto Police Operations account, @TPSOperations. Over time, the machine learned to recognize different crimes (stabbing, shooting, etc), events (collisions, missing persons) and other styles of writing (like replies to questions about police activity). Below is a screenshot of the visualization, superimposed with text to highlight the various clusters the machine recognized.
One of the most interesting things about t-SNE and machine learning is that we never explicitly told the machine to separate tweets by language, crime, party affiliation or any other metric.
Instead, we simply handed the word embeddings to the algorithm, and the machine learned over time how best to separate the data into clusters, highlighting patterns and differences in the data set.
Where could this take us?
The main use case of t-SNE is to visualize and understand high-dimensional data, so that insights can be gained about the data set. For example: t-SNE helped us quickly identify problematic data: bots, spammers, irrelevant tweets, and various other false positives. The t-SNE visualization made this kind of stuff obvious very quickly.
In a larger data set, such as millions of tweets about Coca-Cola, t-SNE could be used to visualize market segmentation and identify new topics of interest about the brand. It would also be interesting to see t-SNE applied to a database of movie or song metadata, allowing viewers to more easily discover and explore content tailored to their interests.
With Hashpipe, we also demonstrated a simple method for hashtag and keyword recommendation. This is based on the Euclidean distance between points, and can suggest hashtags for a specific input. For example, the input #imwithher may suggest #dumptrump, #uniteblue and #hillyes. This can be used to identify new hashtags and trends automatically and provide smarter tools for tagging and searching content. This kind of hashtag analysis could be used in a number of ways: as one example, you could use it to broaden your organic reach on social media more quickly and efficiently than if you simply researched and discovered related tweets on your own.
As we discovered, you might also find unexpected relationships between seemingly unrelated terms. That could lead to some interesting discoveries.
We developed tools to scrape large amounts of text and media from Twitter, built a deep convolutional neural network to classify images, worked with word2vec for word embeddings, and visualized a number of data sets with a web-based implementation of t-SNE.
We are now equipped with a better understanding of machine learning and how it can be integrated on the web and are excited to further explore some more practical applications.
We’ll keep you posted when we do.
Questions? Comments? We’d love to hear from you.
Check out github.com/Jam3 for lots more open source projects and articles, or follow @Jam3 to hear about future updates from the team.