Classifying Gender Based On Indian Names

Published in

Simpl - Under The Hood

5 min readJun 8, 2016

In the business of finance, gender might prove to be very critical feature that explains variance when used along-with other strong features. We definitely wanted to track the gender of all our users, but adding another field to the signup process might not have been the best decision. Unlike in the US we could not find a publicly available repository of Indian names with their gender which we can query to decide genders. Other option was scraping baby names websites and building our own repository, but before doing that we wanted to try something more creative.

So, we decided to take the machine learning approach for deciding the gender of our users based on their first names. Why first names? Because that’s what we had at that time and we were surprised by the results that we got.

Creating The Dataset

At that time we had data for about 25,000 Indian users with their genders which we had collected from various sources. After we were done extracting first names from the full names of users based on simple whitespace split approach, we started looking for typos in the names. By doing a quick sort based on names and applying some similarity string matching we were able to remove most of the typos. With few other data cleaning steps, we ended up with ~20000 first names to train and test our model.

Feature Engineering

We started with the below features and then dropped few on the way to make sure we do not end up overfitting the model. In the end we had ~500 features most of it were various n-grams selected on the basis of their importance.

Ending N-grams
We took 1-gram, 2-gram, 3-gram and 4-gram from the ending of the first name. Eventually we dropped the 4-gram feature as it was leading to overfitting of the model
Last letter is vowel
This feature was like a flag with the value True if the last letter of the first name was vowel and False if it was not. In our analysis we noticed that the names of 85% female user ends with vowels, while the number is just 13% for male users.

3. Ends with Sonorants
Sonorants are those sounds in phonetics which are marked by a continuing resonant sound. In english y, w, l, r, m, n, and ng are sonorants. In Devanagiri there are 8 sonorants ह ha: म्ह mha, न्ह nha, ण्ह ṇha, व्ह vha, ल्ह lha, ळ्ह ḷha, र्ह rha. In our analysis, that a higher %age of Indian female names ended with sonorants as compared to male names

4. Ratio of open syllables to number of syllables
Open syllables are instances where nothing comes in the syllables after a vowel. During our analysis we found that the ratio of #Open Syllables to that of total syllables in the name is higher for the females names when compared to the male counterparts.

Model Building

We started with creating a Naive Bayes classifier which was giving us an f1 score of ~86%. Next we tried random forest and few gradient boosting algorithms as well. Finally, with SVM we were able to take it to ~94%. I am not a big fan of ensembles unless it’s a competition where getting a high score is the only objective, we did-not wanted to make the model overly complicated for few decimal places of accuracy, so we sticked with the classic Support Vector Machines (SVM).

Initially we trained the models with single features to realise that n-gram alone was giving us an f1 score of ~84%. Combined with the vowel flag it went upto ~95%. The other variables were not making much of a difference. To further increase the accuracy, we created 2 models for better predictions, one with the training set ending with vowels and the other one without it. This was done to make sure that the model learns enough from the other variables as well.

For getting an overall better results, we first check if the first name already exists in the corpus of names for which we have the gender marked before we pass it to the classification algorithm. This ensures that we use the algorithm only for the new names hence optimising the execution time as well.

About Model Selection and Hyperparameters

Support Vector Machine is one of the classic ML algorithm that has been used in a lot of classification techniques. You can read more about it here . One of the most important hyperparameters used in SVM is the selection of kernels. The most popular and widely used kernel is ‘RBF’ which stands for Radial Basis Function, mostly since it can accommodate infinite dimensions. To our surprise, in our case RBF was performing inferior to that of the “Linear” kernel. This was mainly happening due to overfitting of data.

The other model that we tried initially but dropped later was Naive Bayes Classifier. These are family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. Below is how a conditional probability is represented as per Bayes theorem, and this forms the base of naive Bayes classifiers.

This is one of the oldest and yet most popular ML classifiers used in the field text retrieval and classifications (the classic email categorising problem — Spam vs Ham). You can read more about the classifier here.

We also tried Random Forest and few Gradient Boosting Algorithms, all giving us an f1 score of about 89–90%.

Future Work

We have been working on combining various data points to it which includes purchase behaviour as well to make the predictor more stable and accurate. We are also planning to open this as an api for others to use this feature during signups and other such flows, making the user experience even better.

PS : Graphs used in the article are for indicative purpose and might not provide the accurate details in all the cases

This post was written by Raj Vardhan, our data sciences lead at Simpl.