Basic text classification pipeline and deployment

Lehar Oha
Lehar Oha
Mar 27, 2019 · 5 min read

TL; DR: In instances of missing data, a rather common task is to build binary classification models for customer types based on their corresponding names. Such models can be built in a variety of ways. In this brief walk-through, we look at three distinct approaches each with a set of benefits and drawbacks. We use data from the Estonian Tax and Customs Board and deploy our final model as an open web application.

Problem statement

In most incoming bank transactions the bank is able to differentiate if a payment is coming from a private person or from some institutional (e.g. corporate) payer, here we look at the problem where this is not the case.

Being able to create one such indicator/flag is important due to following:

  • Usage in credit models & strategic planning
  • Usage in income detection
  • Usage in payment classification (e.g. salary metric)
  • Creating better guarantees for payment comment trustworthiness (limit adversarial attacks)
  • Overall improvement of downstream use cases

Data and preprocessing

We build a binary text classification model based on public data from Estonian Tax and Customs Board (EMTA) and aim to predict from entity name if it is a private person or an institution name. Source data can be found here (we take last 8 quarters Excel files).

Selected sample of data:

Image for post
Image for post

To prepare data for modeling and reduce it’s dimensionality we apply some preprocessing to name and obtain clean_name which is actually used for predicting label/target is_institution. Here label (is_institution) is achieved by taking all self_employed as privates, additionally is_institution is encoded to 1 and privates to 0.

Below we also list some preprocessing steps:

  • Unifying long names among common non-privates (e.g. aktsiaselts -> as)
  • Transliterating Russian strings
  • Replacing digits with a common numeric tag
  • Replacing punctuation with space
  • Joining company types to one token (e.g. G.M.B.H -> GMBH)
  • Normalizing Unicode
  • Replacing repeating spaces
  • Lower-casing and removing one letter tokens

For our final analysis we have a data-frame with ~157k rows and 2 columns.

A fast look at the data reveals that only around 4% of names are private entities related and over 80% are local Estonian businesses. Also note that in the original data-set the non_resident type has a mix of privates and non-privates which in an ideal case should be corrected.

We also split the data to train, validation and test sets (weights: 60, 20 and 20).

Fun and light deviation

Before we go on to build our model let’s look at the tax payers names (particularly companies) and delineate if a company is bankrupt (e.g. imaginary name string: ‘KALASAARE VETIKMÕISA, OÜ (PANKROTIS)’) and see what tokens are most associated with those companies (note that we omitted company types from names, e.g. ‘OÜ’, ‘AB’ etc.).

  • grupp (EN: group)
  • eesti (EN: estonia(n))
  • ehitus (EN: construction)
  • invest
  • transport
  • baltic
  • group

So the lesson (to be taken with a strong simile & some doubt) here is that if you want to stay in business, then do not name your company:

Ehitus & Transport Grupp Invest OÜ

Implementation toolbox

For implementation we are using Python 3.6 and libraries like numpy, pandas, scikit-learn, gensim, flask, requests to name a few and highly appreciate their contributions to open source software.

Rule-based model

This model is rule-based and here we look at most common tokens associated with institutions, for example:

  • Name might contain explicit numbers (e.g. 42)
  • Name might contain common types of business structures: ‘AS’, ‘AB’, ‘OY’ etc.
  • Name might contain local government related tokens: ‘vallavalitsus’ (rural municipality), ‘linnavalitsus’ etc.
  • Name contains non-profit related tokens: ‘koü’, ‘mtü’ (non-profit acronym) etc.

Using this model (or rather pattern matching for the rules above) on a validation set we see that the model ‘predicts’ in many cases private, but actually it’s an institution (around 66% of true labels end up wrongly classified).

If we would implement such a system for scanning payment events, then it is easy to get started and it is also quite transparent, but hard to manage due to unscalability of rules and their coverage. The performance is not so good either so, let’s try something better.

Bag of words & character-based models

Here we stack together word and character-based text vectorizers (simply counting the number of times each token appears in a name) and on top of that put logistic regression classifier. We use the top 15k features/tokens for each, for words use uni-grams and for characters range (2,4) (e.g. ‘swedbank’ -> sw, we, swe, swed etc.). The shape of the features matrix for training is therefore around (100 000, 30 000), meaning each of 100k training samples are described by a vector of 30k.

We combine it all into a convenient scikit-learn pipeline.

For such model we achieve strong result, confusion matrix for validation set below:

Image for post
Image for post

If we look at the model mistakes, then:

  • It predicts institution for less known (rare) and foreign names.
  • It predicts private for some non-residents, unitokens (e.g. ‘gandi’), some combined names (‘riigikantselei’, ‘poksiklubi’ (boxing club)) and institutions named after some personal names (e.g. ‘Juhan Liivi nimeline kool’; ‘Susan Tao Language School’)

We now explain the model globally by taking the most contributing (top 50) features (logistic regression coefficients) to institutional class (from word uni-grams set):

All of the above seems logical and matches with human perception, it is highly unlikely that a private person’s name could be ‘põllumajandusühistu’ (agricultural cooperative).

For private class we have (top 20):

Again, all good. We see that we have common Estonian names (with a certain gender bias).

Embedding-based model

Let’s try also something more fancy and use so called embeddings (tokens mapped to vectors of real numbers, for example if above we had for each name vector of length 30k, then now we have it with length 0.1k). For that we trained fasttext (also word2vec) based on train data and based on them composed entity name vectors (via commonly used vectors averaging). Therefore, the shape of features matrix for training is (100k, 100). This was fed into a random forest classifier.

Let’s see the resulting confusion matrix:

Image for post
Image for post

We see that the model underperforms BOW model and has a higher off-diagonal weight.

Public model deployment

Next we try to deploy the best model (BOW). To this end we serialize (pickle) model pipeline, build a simple Flask (micro web framework) app and load the model to a Heroku server.

To test the final classifier for your suitable names please try the following web app request (may take some loading time):

Swedbank AI

AI, machine learning and deep learning at one of the…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store