Boosting Sales With Machine Learning

How we use natural language processing to qualify leads

Per Harald Borgen
Jun 7, 2016 · 7 min read

In this blog post I’ll explain how we’re making our sales process at Xeneta more effective by training a machine learning algorithm to predict the quality of our leads based upon their company descriptions.

Head over to GitHub if you want to check out the script immediately, and feel free to suggest improvements as it’s under continuous development.

The problem

An example of a list of potential companies to contact, pulled from

This kind of pre-qualification of sales leads can take hours, as it forces the sales representative to figure out what every single company does (e.g. through read about them on LinkedIn) so that he/she can do a qualified guess at whether or not the company is a good fit for our SaaS app.

And how do you make a qualified guess? To understand that, you’ll first need to know what we do:

In essence, Xeneta help companies that ship containers discover saving potential by providing sea freight market intelligence.

This customer had a 748K USD saving potential down to market average on its sea freight spend.

More specifically, if your company ships above 500 containers per year, you’re likely to discover significant saving potential by using Xeneta, as we’re able to tell you exactly where you’re paying above the market average price.

This widget compares a customers’ contracted rate (purple line) to the market average (green graph) for 20 foot containers from China to Northern Europe.

This means that our target customers are vastly different from each other, as their only common denominator is that they’re somewhat involved in sea freight. Here are some examples of company categories we target:

  • Automotive
  • Freight forwarding
  • Chemicals
  • Consumer & Retail
  • Low paying commodities

The hypothesis

This made us think:

Given a company description, can we train an algorithm to predict whether or not it’s a potential Xeneta customer?

If so, this algorithm could prove as a huge time saver for the sales team, as it could roughly sort the excel sheets before they start qualifying the leads manually.

The development

We considered crawling the companies’ websites and fetch the About us section. But this smelled like a messy, unpredictable and time consuming activity, so we started looking for API’s to use instead. After some searching we discovered FullContact, which have a Company API that provides you with descriptions of millions of companies.

However, their API only accept company URL’s as inputs, which rarely are present in our excel sheets.

So we had to find a way to obtain the URL’s as well, which made us land on the following workflow:

  • Using the Google API to google the company name (hacky, I know…)
  • Loop through the search result and find the most likely correct URL
  • Use this URL to query the FullContact API

There’s of course a loss at each step here, so we’re going to find a better way of doing this. However, this worked well enough to test the idea out.

The dataset

The first category was easy, as we could simply export a list of 1000 Xeneta users from SalesForce.

Finding 1000 disqualified was a bit tougher though, as we don’t keep track of the companies we’ve avoided contacting. So Edvard manually disqualified 1000 companies.

Cleaning the data

In the examples below, I’ll go though each of the cleaning techniques we’re currently applying, and show you how a raw description ends up as an array of numbers.

An example of a raw description.


description = re.sub(“[^a-zA-Z]”, “ “, description)
After removing non-alphabetical characters.


from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(‘english’)
description = getDescription()
description = [stemmer.stem(word) for word in description]
After stemming the words.

Stop words

from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
description = getDescription()
description = [word for word in description if not word in stopWords]
After removing stop words.

Transforming the data

Bag of Words

BoW is a simple technique to turn text phrases into vectors, where each item in the vectors represents a specific word. Scikit learn’s CountVectorizer gives you super simple way to do this:

from sklearn.feature_extraction.text import CountVectorizervectorizer = CountVectorizer(analyzer = ‘word’, max_features=5000)
vectorized_training_data = vectorizer.transform(training_data)

The max_features parameter tells the vectorizer how many words you want to have in our vocabulary. In this example, the vectorizer will include the 5000 words that occur most frequently in our dataset and reject the rest of them.

An example of a very small (35 items) Bag of Words vector. (Ours is 5K items long).

Tf-idf Transformation

More specifically, tf-idf will emphasise words that occur frequently in a description (term frequency), while de-emphasise words that occur frequently in the entire dataset (inverse document frequency).

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm=’l1')
tfidf_vectorized_data = tfidf.transform(vectorized_training_data)

Again, scikit learn saves the day by providing tf-idf out of the box. Simply fit the model to your vectorized training data, and then use the transform method to transform it.

The vector after applying tf-idf. (Sorry about the bad formatting)

The algorithm

I first sliced the data into 70% training data and 30% testing data, and then started off with two scikit learn algorithms: Random Forest (RF) and K Nearest Neighbors (KNN). It quickly became clear that RF outperformed KNN, as the former quickly reached more than 80% accuracy while the latter stayed at 60%.

Fitting a scikit learn model is super simple:

def runForest(X_train, X_test, Y_train, Y_test):
forest = RandomForestClassifier(n_estimators=100)
forest =, Y_train)
score = forest.score(X_test, Y_test)
return score
forest_score = runForest(X_train, X_test, Y_train, Y_test)

So I continued with RF to see how much I could increase the accuracy by tuning the following parameters:

  • Vocabulary: how many words the CountVectorizer includes in the vocabulary (currently 5K)
  • Gram Range: size of phrases to include in Bag Of Words (currently 1–3, meaning up until ‘3 word’-phrases)
  • Estimators: amount of estimators to include in Random Forest (currently 90)

With these parameters tuned, the algorithm reaches an accuracy of 86,4% on the testing dataset, and is actually starting to become useful for our sales team.

The road ahead

Here are a few activities we’re considering to do in the road ahead:

  • Get more data (scraping, other API’s, improve data cleaning)
  • Test other types of data transformation(e.g. word2vec)
  • Test other ml algorithms (e.g. neural nets)

We’ll be pushing to GitHub regularly if you want to follow the progress. And feel free to leave a comment below if you have anything you’d like to add.


Per Harald Borgen

Thanks for reading! We are Xeneta — the world’s leading sea freight intelligence platform. We’re always looking for bright minds to join us, so head over to our website if you’re interested!

You can follow us at both Twitter and Medium.


Writings from the people of Xeneta


Xeneta is the leading ocean and air freight rate benchmarking and market analytics platform.

Per Harald Borgen

Written by

Co-founder of Scrimba, the next-generation coding school.


Xeneta is the leading ocean and air freight rate benchmarking and market analytics platform.