How To Build a Machine Learning Industry Classifier
A Multi-class Text Classification Approach
Martech harvests the best that technology has to offer to accomplish measurable marketing goals and success, and as such, it offers grounds for exploration.
In this article, we are excited to be sharing our hard-earned insights on building an industry classifier. We show you how we trained a model to make educated guesses of higher probability thresholds and classify miscellaneous emails into their respective categories- automatically, in a fraction of the time it would take to perform this manually, and error-free.
Motivation
We deliver millions of emails every day, and what used to be a spaghetti of unsorted industry email paths, now forms a neat hub of email paths classified by industry. In the model we built, we first needed to educate our algorithm, therefore, emails in which words like “dress”, “fashion”, “trends” appeared qualified for a higher weighting factor for the Fashion industry, rather than Design, or Technology. What remained to be discovered though was which input method could guarantee a solid classification: email content, website content, or a combination of the two?
Employing artificial intelligence techniques and algorithms, we came up with our robust industry classifier. By harnessing computer science, data analysis and pattern recognition we were able to build another high-level application based on machine learning.
Founded upon the premises of big data and deep learning, machine learning enables us to go beyond explicitly programing computers to perform certain actions. It empowers us to teach them how to make decisions automatically.
Training Data
In order to create an algorithm for this, we need to teach our program what a model email content looks like, for every given industry. Once we’ve trained our model using email content, website content and the combination of the two, we will test accuracy of our model.
Text classification is a form of supervised learning. The objective is to break down an entire text into its components and identify patterns to automatically generate rules. To achieve that, a set of training documents with a number of known categories serve as the rules that generate the classifiers. At the end of this process, any miscellaneous document should fall into the right category. Using a set of training data D = { d1 , d2 , d3 , …, dn } with known categories C = { c1 , c2 , c3 , …, cn }, we will predict the category of a new unlabeled document q.
How do we distinguish between words, though? As explained above, the basic concept behind the algorithm is the quest to find words that are semantically emblematic of an industry. The fact of the matter is that there are both simple and complex words. By simple, we mean words that are omnipresent across various types of text, whereas the complex words are more characteristic of one topic, and highly relevant to the thematic content of a specific area, field or discipline. Without having to process an entire text to manually determine the industry category it falls into, there are several words that serve as signifiers of every industry (signified). For instance, words like “laptop” are commonly encountered in Technology, and “t-shirt” in Fashion, therefore these words become the focus of our research.
Methodology
Data Gathering
The first step is to collect our data from the web. To extract information from websites we have to build a web scraper to fetch and collect useful information from them. Python makes it really simple to pull data out of HTML and XML files with web scraping frameworks like Beautiful Soup, Cola, Pyspyder and Scrapy.
On the web, head sections, tags and text content can provide us with a good amount of quality information about the industry. Email content on the other hand has a huge disadvantage: it is not a consistently reliable source of input. More often than not, emails either lack quality of information or their content is too general to qualify for our research. These circumstances can affect the learning process but in some cases they can help us identify more complex patterns in data or directly give away the cluster to which it belongs.
Text Processing
Data cleaning/standardizing and processing is a core element in Machine Learning. In order to decrease the “noise” of our data and boost the quality of learning, we must remove the HTML tags, and then proceed with natural language processing (NLP) techniques. To achieve that, we employ the NLTK library by Python. First, we tokenize the text, that is split the text to independent words. Then we remove stop words, such as “and”, “the” etc. which carry less weight as signifiers. Finally, we stem each word by removing or replacing word prefixes and suffixes in order to find a common root of words and decrease the size of our vocabulary.
Bag of Words
Now that we have collected a number of words, the next step is to reduce the complexity of the labeled set and extract useful information. This can be achieved by building a matrix of words (represented with columns) and documents (represented with rows). The simplest and most efficient way to complete this task is by using Bag of words (BoW) with tf-idf (term frequency-inverse document frequency). More specifically, we are employing weighted BoW as normally BoW assigns 0.0 or 1.0 to words, whereas we need varying weights to measure relevance to one industry or another.
This is why BoW model is a very common solution in Natural Language Processing (NLP) problems. The idea is to represent every document to a BoW (:vector) and classify different vectors together based on their similarity.
BoW takes all unique words in a document and puts them in a list. Then, Tf-idf works as a numeric measurement which scores the importance of a word across a set of documents and assigns values between 0.0 to 1.0, for the less important words to the most important ones, respectively.
Dimensionality Reduction
To avoid overfitting, namely the model clinging to the training data and being unable to stand on its own outside the training set, we have to reduce the dimensional feature space of our sample. At the same time, we must keep the most valuable, relevant information in order to improve scalability, efficiency and accuracy of our classifier. This is where Singular Value Decomposition (SVD) comes in. We perform SVD on our training set to go from n features to k features (with n>k), where n is all the words in the vocabulary and k the most important words.
Below we see our industry clusters after SVD:
Classification
Text classification problems are often linearly separable and the linear kernel works impressively well with a large number of features. This is true because mapping data to a higher dimensional space doesn’t improve the performance of the model, so that makes Linear SVC the perfect tool for the job.
We preferred a “Probabilistic” classification over “Hard” classification so we predicted the probability of a client belonging to an industry and we evaluated the classification results across three probability thresholds, namely 60, 70, and 80 % probability.
Evaluation table
The given table provides an overview of the three probability thresholds for each input source. It can be noticed that email as a source of input does not score as highly as the other two categories and cannot serve as a reliable source of high probability as it could not provide quality data. With regard to content coming from email and website, probability thresholds are high, however, compared to website alone serving as an input source, the latter presents a safer choice. Despite the fact that the probability threshold of 60% stands at a lower percentage compared to the respective percentage of email and website combined, it is rendered negligible since the 80% threshold counterbalances this effect. In this way we managed to classify correctly more complex structures.
The Model
Conclusion
Congratulations! You have officially taught your model to distinguish and categorize. But, as in real life, learning is an ongoing process continually enriching models with input. To have an accurate model, the preprocessing and the training sample are key to the success of the classifier. To further improve our classifier, building a quality dataset and tuning parameters with optimization techniques is highly recommended.