“What’s in a Name?” Shakespeare’s age-old question answered using machine-learning

17 min readMay 17, 2022

Using Classical machine learning and Deep Learning models to identify whether a given name refers to an individual or a corporate entity.

Is it really true that a name doesn’t really convey any other information with it?

We decided to dig down on one aspect of this question, and find the answer not with opinions, but with data. We collected around 3 crore (1 crore = 10 million, so 30 million) names of different companies and individuals and tried to find patterns in them. While we were collecting those names, we pondered over the idea of classifying a name into the name of a company or an individual. This article details how we could find patterns in the names, how we built a machine-learning model around it, and yes, how we were able to answer the question “What’s in a name?”

This blog is a part of the research project undertaken by the Machine Learning team at CrimeCheck.ai — a product from GetUpForChange Services Private Limited. We are India’s leading AI-powered search engine for the legal domain.

Contents:

Data Collection
Exploratory Data Analysis
Data Preprocessing and Data Cleaning
Model Training using Classical machine learning models.
Model Training Using Deep Learning
Performance Testing
Conclusion

1. Data Collection

We started our data collection by gathering names of individuals and corporate entities from as many sources as we can. For names of individuals, we brainstormed on different ways of gathering as many names as we possibly can. Eventually, we settled for getting those names from the Indian court records. However, those datasets were not readily available (unstructured), so our data engineering team at CrimeCheck crawled legal records and extracted the names of petitioners and respondents from the high court and data available on other court sites. The data thus acquired had nomenclature (different practices of writing an individual name across India) i.e in some cases, it was just the name of an individual whereas in many other cases, the father’s name or guardian’s name was also present. There were a number of formatting issues as well, like incorrectly placed spaces, tabs, incorrect spellings, etc which will be analysed more in the preprocessing section.

Examples of the types of individual names extracted were:

["Rakesh sharma s/o girish jetha sharma","Rakesh","Avni","Ananya","Ramesh","Avni kumar wife of prasanth prasad singh","shah gijanjali" ]

Person Data

Company data

Gathering names of companies was quite a challenge. We managed to collect corporate entity names from the below sources :

1. Companies registered in India between 1857 and 2020. This was available on the Kaggle platform. There are around 19 lakh names in this dataset alone.

2. MCA (Ministry of Corporate Affairs) Data, this data is available on the mca.gov.in website. It consists of all the limited companies in India. It amounts to around 17 lac (1 lakh = 10 ^ 5) names.

3. TIN (Taxpayer Identification Number) Dataset, which is extracted at the central and state level from TIN portal sources. This data is collected from various sources and integrated as one by our parent company GetUpForChange Services Pvt Ltd.

4. Startup names and unicorn companies were also added to our data set using some readily available datasets from Kaggle, the link for the same is provided in the reference section.

5. Common words which are usually seen in company names were also added to our dataset. A few of them are as follows :

['ankle', 'eileen', 'clicks', 'analyse', 'drills', 'closest', 'semi', 'carrier', 'continuously', 'actuarial', 'makefile', 'sensual', 'sweep', 'leveraging', 'oak', 'adopted', 'mediterranean', 'natal', 'esp', 'husband', 'faulty', 'bouquets', 'stairs', 'cleaning', 'applications', 'trojan', 'sparrow', 'gateways', 'toilets', 'feeders', 'reg', 'aging']

6. Some verbs can be possible company names, and can be used as part of company names. A few of them are as follows :

[‘Accelerate’, ‘Accommodate’, ‘Accomplish’, ‘Accumulate’, ‘Achieve’, ‘Acquire’, ‘Act’, ‘Activate’, ‘Adapt’, ‘Add’, ‘Address’, ‘Adjust’, ‘Administer’, ‘Advertise’, ‘Advise’, ‘Advocate’, ‘Aid’, ‘Aide’, ‘Align’, ‘Allocate’, ‘Amend’, ‘analyse’, ‘Answer’, ‘Anticipate’, ‘Apply’, ‘Appoint’, ‘Appraise’, ‘Approve’, ‘Arbitrate’, ‘Arrange’, ‘Articulate’, ‘Ascertain’, ‘Assemble’, ‘Assess’, ‘Assign’, ‘Assist’, ‘Assume’, ‘Attain’, ‘Attend’, ‘Attract’, ‘Audit’, ‘Augment’, ‘Author’, ‘Authorize’, ‘Automate’, ‘Avert’, ‘Award’, ‘Bargain’, ‘Begin’, ‘Bolster’]

7. In our company data set, we have added companies with single words or company symbols to make our model prediction better.

[‘hdfc’,’sbin’,’tcs’,’rtnindia’, ‘steelcity’, ‘tata’, ‘ficosa’, ‘nocil’, ‘nitinfire’, ‘nxtdigital’, ‘coffeeday’, ‘fcl’, ‘wabag’, ‘meitu’, ‘manugraph’, ‘creative’, ‘rmmil’, ‘jf’, ‘ongc’, ‘hotelrugby’, ‘morepenlab’, ‘airolam’, ‘ey’, ‘jetknit’, ‘rohltd’, ‘menonbe’, ‘maton’, ‘repl’, ‘grobtea’, ‘mbecl’, ‘rcom’, ‘bharatidil’, ‘sathaispat’, ‘dalbharat’, ‘lactalis’, ‘noise’, ‘pentagold’, ‘petronas’, ‘geship’, ‘srpl’, ‘enil’, ‘khfm’, ‘shreepushk’, ‘bpcl’, ‘cipla’, ‘concentrix’, ‘intentech’, ‘cantabil’, ‘allcargo’, ‘apollohosp’, ‘mitsubishi’]

So, after gathering data from different sources, our dataset size was 3 crore and the distribution of the individual and company class is further explained in the EDA section.

2. Exploratory Data Analysis

Data distribution of our dataset was as follows — approximately 2.11 crore (~69.5% of total data) were individual names and the rest ~97 Lakh (~30.5% of total data) were company names.

As evident from the bar chart, our dataset is not perfectly balanced, and has enough records for either of the types i.e. company and individual.

2. Company data comprises different types of data including limited, nonlimited, government entities, and symbols of limited companies.

Fig: Different types of naming patterns found in company names.

We can observe that majority company data belong to the limited company category, followed by non-limited company category, followed by government bodies. A very small subset of data that constitutes less than 1% of total data is formed by stand-alone names like ‘Swiggy’, ‘Zomato’ or symbols of limited companies (i.e HDFC, SBIN, etc) which is referred to as ‘symbol’ in the count plot above.

boxplot of length of data.

From the Box plot, we understand that most of our text length lies between min and max which is 0–40. In pre-processing, we would remove all the text with length greater than 40. But the limit of 40 causes a great deal of data loss(>7%), so another approach for this is to reduce the outliers and understand in real-time what we can choose as the max length of a company or individual name. While analysing the the data with text length greater than 40, we understood that removing these text would cause great deal of information loss.We analysed data in iterations of character length, to decide on a limit for maximum length of text to be used in ou data. We came to a conclusion that text length greater than 60 can be considered as junk, Also we observed 60 characters for a text was not leading to miss out any important information and the amount of junk data was least. With this, we removed all the data with text length greater than 60 characters which lead to <1% loss of data.

4. Distribution of the number of words used in each data

Considering the length of text as a parameter for outlier detection, from the words present in the text we can see most of the company names and individual names have 1–7 words. So, can names having number of words greater than 9–10 be considered outliers? To observe and understand more about the distribution of the data we look at the distribution plot of the length of characters of the text, for company label and individual label separately.

As evident from the above distribution, company names have an average length greater than individual names. Also, we see company names’ text lengths can go up to 60 characters.

After analysing the max length of the names,their respective median & mean, boxplot, and distribution of the data separately for each entity, we have decided to keep 60 characters as the limit. Any text with a length greater than 60 will be removed and considered an outlier.

3. Pre-processing of Data

As we have seen, the majority of our dataset is extracted from court data and court data has junk like addresses, multiple names in a row which need to be separated, and junk words that can be categorised as outliers.

Pre-processing steps:

Cleaning person name data

In our dataset, respondent and petitioner names have characters like “@” / “alias”/ “aka” which have meaning, for example:

"rakesh sharma @ raksh", "ramchandra alias ram"

Here, “@” and “alias” represent other names that refer to the same person. These names can be taken as ‘nicknames’. So in order to capture the meaning of “@”, “alias”, “aka”, we convert these characters and words to a common word “synonym”; and convert “&” to “and” to capture it’s true meaning. If not converted, these characters might just be removed from the text. As these characters have some meaning in our text, we need to make sure that these special characters are removed. After converting “@” and “&” characters, we remove all the special characters other than alphabets, numbers, and “/”. As you can observe, we are not removing “/” as it is one of the most used characters in our data.

2. In the name data set, there are some text data combined with “addresses”, for example:

["sh rajneesh malik pinki address main 100 ft road sant nagar bur"]

This kind of text needs to be separated or split using the address keyword. Further, we can extract the person part of the text.

3. Pre-processing the company text

Most company names consist of words as follows:

common_words = ["pvt", "limited", "private limited", "p ltd", "co", "company", "ltd", "groups", "technologies"]

Company names can be easily identified by the above names present. But we want our model to predict the entity even without the words mentioned above. So in the preprocessing we remove the above words and split the text from the above words so that our model can predict a company without the need for the presence of the above-mentioned words.

Before preprocessing data looks like this:

['anand corporate holdings private limited', 'kuber hotels p ltd', 
 'hiring partners hr solutions privatelimited', 'saffire trendz co']

After preprocessing, the data looks like this:

['anand corporate holdings ', 'kuber hotels ', 'hiring partners hr solutions', 'saffire trendz ']

This preprocessing is done to increase the prediction capability of the model for company names.

In our company data, we have kept a ratio of 70:30 where 70% of data don’t have common words which are mentioned above in common_words and 30% of data have these common words.

5. To get more idea about the common words in our data, we can use Wordcloud to create an image of commons words used in our data for a better understanding of the data.

Word Cloud of Company Data

As we see from the above image, there are some individual names too in the word cloud i.e Vijay, ram, Balaji, Laxmi, etc. From this, we understand that these names come from non-limited companies like “Balaji enterprises” or “Vijay general stores”, etc.

Word Cloud of Individual Data

From the above word cloud of individuals, we understand that some common words in the data can be surnames (Patel/Singh/Sharma) or words like synonym/alias, father”, “pita ka naam” which tells us that our data has a lot of individual names along with their guardian names. For example, as we saw in the above-preprocessing steps that “@” is converted to “synonym” because the “@” character has some meaning in our data.

4. Model Training using Classical machine learning models.

After preprocessing and removal of outliers, next, we convert our text to numeric data to feed our machine-learning model. There are various ways we can vectorize our text, such as vectorizer TF-IDF, word or character embedding techniques for example chars2vec, word2vec, or GloVe techniques.

TF-IDF is a method that gives us a numerical weightage of words that reflects how important the particular word is in a document. In our case, we have a small length of the text and our importance is more on character level than word level, so we use N-Grams to extract the significance of the text at the character level. With N-Grams we can collect the common co-occurring characters and provide their significance in the whole data.

4.1. Model Training with TF-IDF Vectorization Method.

Vectorization of our data was done by the Character n-gram TF-IDF vectorizer method.

Vectorization of the model with TF-IDF with N-Gram.

After vectorization, we split our data into train and validation data. The shape of the data was around 2 Lakh, which is enough to train a Classical machine-learning model.

For our problem statement, we want the probabilistic estimate as outputs. Some classification models, such as Naive Bayes, Logistic Regression, and Multilayer Perceptrons (when trained under an appropriate loss function) are naturally probabilistic but SVM, tree-based models do not provide probabilistic estimates or we can say scores from predict_proba() method are uncalibrated (either scores are over-confident or under-confident or not true estimates) or score is not a true probability estimate. As we want to use a tree-based model for our training for it’s low-bias nature, we use CaliberatedClassfierCV from the sklearn library. This gives us probabilistic estimates which convert uncalibrated mode scores to calibrated scores (actual probabilistic estimates). After Hypertuning the LGBMClassifier with RandomizedSearchCV on different parameters like learning rate, number of leaves, max depth, and number of estimators, we train the calibrated LGBMClassifier with the hypertuned model.

Hypertuning of LGBMClassifier with RandomizedSearchCV

>>output : LGBMClassifier(learning_rate=0.05, max_depth=7,  
                           n_estimators=1000,num_leaves=10)

After hypertuning the parameters of LGBMClassifer, we train the model and predict the accuracy of the model on the validation data set.

Training Score achieved from Caliberated LGBMClassifier model was:
>> Train Score: .945
>> Validation Score: .949

The confusion matrix and classification report of the calibrated model looks as follows:

4.2. Model Training with Character Embedding Vectorization Method.

One of the other techniques for vectorization of text is character level embedding with the Chars2Vec library. The Chars2vec language model is based on the symbolic representation of words, The model maps each word to a vector of a fixed length. Chars2Vec catches the syntactic meaning in a text on a character level, whereas TF-IDF cannot catch the sequence of the texts.

Vectorization code snippet for character embedding looks as follows:

Model Training was done with LGBMClassifier as the base classifier in the CaliberatedClassifierCV model. The accuracy achieved from character level embedding was .85.

The confusion matrix and classification report for this method looks like this:

After using different types of vectorization methods, we conclude that the accuracy of the TF-IDF method was more accurate (better), so we have chosen the TF-IDF method as our final vectorization technique.

Limitation of the Classical machine-learning model:

The Classical model is not able to predict company entities without keywords like private, limited, Pvt, Ltd, co, group. The main aim of our model is to make them predict companies without these common entities.
This model is not able to predict single word company names or company names by their symbols

For example, the model was able to predict “SBI Bank” as a “Company” but was not able to predict “SBIN” (symbol of SBI bank) as a company. Examples of False Negatives For this model are:

‘SBIN’,’Zomato’,’Paytm’,’Meesho’,”TATASTEEL”

Our goal is to make our model accurate enough to predict a company without the common words present in company names, and also predict companies from their Symbols.

As we are using TF-IDF for vectorization, it doesn’t preserve the sequence of the text. In our problem statement, the sequence of the text plays an important role. Alternative for this issue is that we use Deep Learning models such as RNN and vectorization methods for preserving the sequence of the text can be done with the rank method on character level

Looking at the false-positive & false-negative data points of this model and the limitations, next we try to train the model using Deep Learning to overcome all the limitations of the Classical machine-learning model.

5. Model Training Using Deep Learning

All the limitations of the Classical model can be handled with Deep Learning. The sequence of the text can also be preserved and used in the prediction of an entity. As vectorization for Classical machine-learning model with TF-IDF doesn’t preserve the sequence of characters, we use the Rank method for vectorization to preserve the sequence of the text.

The Rank function can be performed on the word or character level where each word or character is given weightage importance, called a Rank. In our problem statement, as our data length is not high, we use a character-level rank dictionary where each character is given a preference (rank) as per their occurrence in the whole dataset. Higher the occurrence of the character in the dataset, higher the rank.

For example, if the text is “tata steel”, the text is converted into a list of numbers where the length of the vector is the same as the length of the text, and each character in the text is assigned a number (depending on the rank of the character) from the character level rank dictionary.

The code snippet of creating a rank dictionary:

rank dictionary: {'a': 1, 'h': 2, 'i': 3, 'd': 4, 'c': 5, '1': 6, 'k': 7,'f': 8, 'j': 9, '/': 10, 'v': 11, '6': 12, '5': 13, 't': 14,                             'q': 15, 'o': 16, '0': 17, 'z': 18, 'y': 19, '4': 20, 's': 21,                            'm': 22, 'e': 23, '3': 24, 'l': 25, 'r': 26, '9': 27, ' ': 28,                            'u': 29, 'g': 30, 'p': 31, 'b': 32, 'x': 33, 'n': 34, '2': 35,                            '7': 36, '8': 37, 'w': 38}

The vectorization code looks as follows:

For every text row, we convert it into a numeric format, but as we know the length of the text can be different for different text. So we have decided on keeping our input dimension as 100. We convert the numeric list we get from the vector variable and we pad the remaining elements of the list with zeros to make each text of the same dimensions. After padding the vector with zeros, the text looks like this:

array([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,    0, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,          0, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,          0, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,          0, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,          0, 0,  0,  0,  0,  0,  0,  0,  0,  0, 35,  6, 35,  6, 22, 26,         35, 13, 13, 14]], dtype=int32)

After converting all the training data points into numeric character embeddings like above, we feed the vectors to the models to train on the numeric data. Next, we split our training data into training and validation data. As We want to train our model in the Sequence-to-Sequence (Seq2Seq) model, which converts sequences from one domain to sequences from another domain.

Architecture of our Deep Learning model

Now, let’s talk about LSTM Architecture: Long Short-Term Memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of Deep Learning. Unlike standard feed-forward neural networks, LSTM has feedback connections. It can process not only single data points (such as images) but also entire sequences of data (such as audio or video).

There are many RNN-based architectures, for example: LSTM, Fully Recurrent Networks, Recursive Neural Networks, and Gated Recurrent Unit Neural Networks. All this architecture is used for sequence model classification.

We could have used Gated Recurrent Unit NN instead of LSTM, The key difference between these two architectures is that GRU has 2 gates (update/reset) and LSTM has 3 gates (input/forgot/output) in their architecture.

GRU Vs LSTM

GRU’s architecture is less complex than LSTM. GRU is usually preferred for small datasets.
GRU uses less training parameters than LSTM and therefore uses less memory and executes faster than LSTM. GRU is used when you need less memory consumption and want faster results.
One can choose LSTM if you are dealing with large sequences and accuracy is a concern
Our data size is in multiple crores and we are more concerned about accuracy of results than resource consumption or speed. Hence we preferred LSTM, even though it requires a bit more complex architecture than GRU.

Data Generators

Data Generators are used when we face limitations on RAM size in our local systems while training and building Deep Learning models. Data generators allow us to process large datasets into small batches for each iteration. With data generators, we can determine the sample batch size to be passed on to the model at a time.

The accuracy of the train set was achieved up to .986 and up to .97 on validation data explicitly created for testing our model with real-time data.

6. Performance Testing

We have created separate test data of 10000 points to test the performance of our models, and made it available on GitHub. Performance testing will be done in two phases — one for the Classical model and one for Deep Learning model.

6.1. Performance Testing of Classical ML Model

Reading the unit test data and its shape.

2. After loading the data, we need to vectorize the data for the Classical machine-learning model with the TF-IDF vectorizing method.

3. Testing the performance of the Classical machine-learning model on the test data

4. From the accuracy of the Classical machine-learning model on the test data, we see that there is no overfitting of the model on the data.

6.2. Performance Testing of Deep Learning Model

Vectorization for the sequence model was done by the rank dictionary method, to feed to a sequence model (LSTM).

2. After Vectorization, we predict the scores and calculate the accuracy score of the model for the test data.

The accuracy score of the Deep Learning model is .988 on the test set. After going through both the model performance, we can come to many conclusions as mentioned below.

7. Conclusion

After performing performance testing for both the models, we saw the training score and testing score were close to each other which concludes that the models are not overfitting.
The Deep Learning model is able to capture more patterns than the Classical machine-learning model because in our Deep Learning model we are able to preserve the sequence of the text which is of great importance in our problem statement. Also looking at the accuracy scores, the Deep Learning model’s performance is better than the Classical machine-learning model.
In summary, this article explained one of the business problems that we at CrimeCheck.ai is trying to solve. Further, the article explains the steps that we performed while building this model which is the data gathering process, EDA, preprocessing steps, Classical machine-learning model building with various vectorization method, Deep Learning model as an alternative to the Classical machine-learning model, Performance testing for all the models. Please refer to the references section to find all the relevant links used while working on this project.

Authors: Akshay Goswami, Prachi Parmar

With support from: Rishikesh Fulari, Aswin Kumar, Pradeep Bhatt

References :

CrimeCheck — Criminal Verification Services

CrimeCheck is a legal-tech product for crime verification of company or individual, based on the largest database for…

crimecheck.ai

Indian Companies Registration Data [1857–2020]

Registration data with CIN, Company Name, Financial Information & Email Address

www.kaggle.com

List of unicorn startup companies — Wikipedia

This is a list of unicorn startup companies. In finance, a unicorn is a privately held startup company with a current…

en.wikipedia.org

Write your own Custom Data Generator for TensorFlow Keras

Create your own custom data generator for TensorFlow Keras models with ease.

medium.com

GitHub - crimecheck-blog/entity-recognizer-test-set

Test-Dataset

github.com

Character ngram tf-idf vectorizer using scikit learn

TF-IDF encoding TF-IDF (Term Frequency — Inverse Document Frequency) encoding is an improved way of BOW (bag of words)…

mjeensung.github.io