How Qchain Will Use Natural Language Processing
The goal of machine learning at Qchain is to match advertisers with publishers based on their content. Broadly speaking, the content can be broken down into two characteristics: topic and style. For example, FiveThirtyEight is a data-driven analytical sports and politics blog: Its topics are sports and politics; its style is data-driven and analytical. An advertiser may wish to place their content on blogs similar to FiveThirtyEight — and machine learning can help them achieve this goal at scale.
The domain of machine learning we will be working in is natural language processing (NLP). Topic modeling (also called document classification) is a well-known unsupervised learning problem in NLP. The problem is formulated to classify documents into categories based on topics. The topics are usually latent in the sense that they are not explicitly specified (hence unsupervised).
The traditional approach to topic modeling is based on word (phrase) frequency. Given a large amount of text, which is called a corpus, we can consider all unique words and phrases that make up the vocabulary. For each document, we will have word frequencies and certain keywords that tend to appear more often for each category of documents. Based on the keyword distribution, we will be able to put the documents into clusters, or groups.
Within each cluster, the documents have similar keywords, on which we will be able to figure out the topic. For example, keywords such as “goal,” “penalty,” and “offside” may appear more frequently within sports blogs — and knowing these terms, we are able to conclude that the common topic is soccer.
The most famous model for topic modeling is latent Dirichlet allocation (LDA) by David Blei, Andrew Ng, and Michael Jordan (who are all machine learning celebrities). LDA is an elegant statistical model and widely implemented.
Besides topics, the distribution of words can also be used for style analysis, as it essentially reflects diction preferences. To convey the same meaning, different writers may choose distinct expressions. Researchers have done studies to accurately classify authorship using this idea.
Here, we can take a semi-supervised learning approach. First, we can define a number of styles and manually label a reasonable amount of content as such. This will provide training data for a supervised classification model. The cool thing is that we can then look at what the model has learned and use this knowledge to calculate style similarity.
Everything we have mentioned so far works at the level of words, which doesn’t fully account for linguistic features. State-of-the-art deep learning language models will alleviate this issue.
The breakthrough of deep learning language models mainly attribute to two things: word embedding and recurrent neural networks. Words live in discrete space that is sparse and orthogonal, which severely suffers from the curse of dimensionality. Word embedding is basically a mapping from this challenging space to a vector space that is more dense and correlated. Recurrent neural networks allow for language patterns beyond keywords as entire sentences can be entered as input sequences. Recurrent neural networks can also be stacked in parallel to return output sequences, which is how Google does translation.
Luckily for us, there are a variety of open source deep learning language models available. They can help us improve both topic modeling and style analysis. Instead of clustering based on words, we can do the same thing in word embedding space based on vectors. The transition from discrete to continuous is likely to yield better clusters. Aside from word choices, recurrent neural networks can consider sentence structures which are an important element of style. Recently, there was even an experiment to transfer linguistic styles.
At Qchain, we will be using machine learning to explore content creator/publisher sorting, automated advertiser-creator/publisher matching, and bridging style analysis to style transfer. Of course, machine learning is quite difficult to apply, we will be careful not to make unrealistic claims or prematurely declare ourselves an “AI company.” This exploration will be gradual, and we will report our results as we rigorously test our methods.