Image for post
Image for post

Boosting Sales With Machine Learning

How we use natural language processing to qualify leads

Per Harald Borgen
Jun 7, 2016 · 7 min read

The problem

It started with a request from business development representative Edvard, who was tired of performing the tedious task of going through big excel sheets filled with company names, trying to identify which ones we ought to contact.

Image for post
Image for post
An example of a list of potential companies to contact, pulled from

In essence, Xeneta help companies that ship containers discover saving potential by providing sea freight market intelligence.

Image for post
Image for post
This customer had a 748K USD saving potential down to market average on its sea freight spend.
Image for post
Image for post
This widget compares a customers’ contracted rate (purple line) to the market average (green graph) for 20 foot containers from China to Northern Europe.
  • Freight forwarding
  • Chemicals
  • Consumer & Retail
  • Low paying commodities

The hypothesis

Though the broad range of customers represents a challenge when finding leads, we’re normally able to tell if a company is of interest for Xeneta by reading their company description, as it often contains hints of whether or not they’re involved in sending stuff around the world.

Given a company description, can we train an algorithm to predict whether or not it’s a potential Xeneta customer?

If so, this algorithm could prove as a huge time saver for the sales team, as it could roughly sort the excel sheets before they start qualifying the leads manually.

The development

As I started working on this, I quickly realised that the machine learning part wasn’t be the only problem. We also needed a way to get hold of the company descriptions.

Image for post
Image for post
  • Loop through the search result and find the most likely correct URL
  • Use this URL to query the FullContact API

The dataset

Having these scripts in place, the next step was to create our training dataset. It needed to contain at least 1000 qualified companies and 1000 disqualified companies.

Cleaning the data

With that done, it was time to start writing the natural language processing script, with step one being to clean up the descriptions, as they are quite dirty and contain a lot of irrelevant information.

Image for post
Image for post
An example of a raw description.


The first thing we do is to use regular expressions to get rid non-alphabetical characters, as our model will only be able to learn words.

description = re.sub(“[^a-zA-Z]”, “ “, description)
Image for post
Image for post
After removing non-alphabetical characters.


We also stem the words. This means reducing multiple variations of the same word to its stem. So instead of accepting words like manufacturer, manufaction, manufactured & manufactoring, we rather simplify them to manufact.

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(‘english’)
description = getDescription()
description = [stemmer.stem(word) for word in description]
Image for post
Image for post
After stemming the words.

Stop words

We then remove stop words, using Natural Language Toolkit. Stop words are words that have little relevance for the conceptual understanding the text, such as is, to, for, at, I, it etc.

from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
description = getDescription()
description = [word for word in description if not word in stopWords]
Image for post
Image for post
After removing stop words.

Transforming the data

But cleaning and stemming the data won’t actually help us do any machine learning, as we also need to transform the descriptions into something the machine understands, which is numbers.

Bag of Words

For this, we’re using the Bag of Words (BoW) approach. If you’re not familiar with BoW, I’d recommend you to read this Kaggle tutorial.

from sklearn.feature_extraction.text import CountVectorizervectorizer = CountVectorizer(analyzer = ‘word’, max_features=5000)
vectorized_training_data = vectorizer.transform(training_data)
Image for post
Image for post
An example of a very small (35 items) Bag of Words vector. (Ours is 5K items long).

Tf-idf Transformation

Finally, we also apply a tf-idf transformation, which is a short for term frequency inverse document frequency. It’s a technique that adjusts the importance of the different words in your documents.

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm=’l1')
tfidf_vectorized_data = tfidf.transform(vectorized_training_data)
Image for post
Image for post
The vector after applying tf-idf. (Sorry about the bad formatting)

The algorithm

After all the data has been cleaned, vectorised and transformed, we can finally start doing some machine learning, which is one of the simplest parts of this task.

def runForest(X_train, X_test, Y_train, Y_test):
forest = RandomForestClassifier(n_estimators=100)
forest =, Y_train)
score = forest.score(X_test, Y_test)
return score
forest_score = runForest(X_train, X_test, Y_train, Y_test)
  • Gram Range: size of phrases to include in Bag Of Words (currently 1–3, meaning up until ‘3 word’-phrases)
  • Estimators: amount of estimators to include in Random Forest (currently 90)

The road ahead

However, the script is by no means finished. There are tons of way to improve it. For example, the algorithm is likely to be biased towards the kind of descriptions we currently have in our training data. This might become a performance bottle neck when testing it on more real world data.

  • Test other types of data transformation(e.g. word2vec)
  • Test other ml algorithms (e.g. neural nets)


Writings from the people of Xeneta

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store