Classifying Nationalities Using Names

Published in

The Startup

5 min readSep 14, 2020

What’s in a name ?

Well there’s a lot. There was a use case for our product Immersify, where we had to separate out Indian Vs. Non-Indian names. The approaches can vary in terms of complexity, but many times simpler models can get the work done too.

Let’s create some models to classify names as Indian / Non-Indian

Data

There’s a convenient dataset located at https://www.kaggle.com/chaitanyapatil7/indian-names
This consists of both male and female Indian names.

We load and clean this data to remove some wrong entries and finally end up with around 13 K records

len(indian_cleaned_data)
13754

For non-Indian names, there’s a nifty package called Faker. This generates names from different regions.

from faker import Faker
fake = Faker(‘en_US’)
fake.name()
‘Brian Evans’

We generate approximately the same number of names as we have in the Indian set. We knock off samples which have greater than 5 words. Our Indian set contained a lot of names with only the first names. So lets make the distribution in the non-Indian set also similar.

non_indian_data.head()  name            count_words
0 sara gulbrandsen    2
1 kathryn villarreal  2
2 jennifer mccormick  2
3 james eaton         2
4 melissa bond        2

We finally end up with about 14K names in Non-Indian and 13K names for Indian.

Approaches

We will try with the following different approaches.

Naive Bayes with Count Vectorizer / TFIDF Vectorizer
Naive Bayes with SentencePiece Encoding
LSTM with Character Encodings
LSTM with SentencePiece Encodings

Naive Bayes with Count Vectorizer / TFIDF Vectorizer

A count vectorizer will simply one-hot encode all the names. The count of unique names in our set is 11345 and that’s what the shape of the vector is going to be.

When we fit a multinomial Naive Bayes to this data , the model will calculate the conditional probabilities for any given name.

For instance — “Ramesh Kumar” will translate to the joint probability of p(Indian|Ramesh)*p(Indian|Kumar)

p(Indian|Ramesh) = p(Ramesh|Indian) * p(Indian) / p(Ramesh)
p(Non-Indian|Ramesh) = p(Ramesh|Non-Indian) * p(Non-Indian) / p(Ramesh)

So the model will calculate the likelihoods and the prior probabilities for all names in the corpus and the probability that a name like Ramesh is Indian will be much higher than the probability that it’s non-Indian.
Intuitively we can see, that such a model could have a big problem when it encounters names that it hasn’t seen before. Let’s build the model and check that.

model = MultinomialNB()
model.fit(X_,train_labels)X_test = vectorizer.transform(test_names.values.astype(‘U’))
test_predicted = model.predict(X_test)print(classification_report(test_labels,test_predicted))           precision recall f1-score supportindian       0.99     0.77    0.87    2751
non_indian   0.82     0.99    0.90    2912accuracy                      0.89    5663
macro avg    0.91     0.88    0.88    5663
weighted avg 0.90     0.89    0.88    5663

Our curated set of names which arent present in the corpus are
[‘lalitha’,’tyson’,’shailaja’,’shyamala’,’vishwanathan’,’ramanujam’,’conan’,’kryslovsky’, ‘ratnani’,’diego’,’kakoli’,’shreyas’,’brayden’,’shanon’]

Lets check the Naive Bayes model’s predictions on this set

 names     predictions_nb_cv
0 lalitha      non_indian
1 tyson        non_indian
2 shailaja     non_indian
3 shyamala     non_indian
4 vishwanathan non_indian
5 ramanujam    non_indian
6 conan        non_indian
7 kryslovsky   non_indian
8 ratnani      non_indian
9 diego        non_indian
10 kakoli      non_indian
11 shreyas     non_indian
12 brayden     non_indian
13 shanon      non_indian

As expected, it does pretty bad on new names.

Naive Bayes with SentencePiece Encoding

We can recognize names as Indian even if we have never heard a particular name before. The reason is there are certain ‘structures’ in the word that sound Indian.

Lets break a name into structural pieces for instance ‘anita kumar’. This could be split as ani|ta|kum|ar or any other such structuring.

We can see the potential this would have for new names as well. There are various types of ‘subword encoding’ we can try like BytePair Encoding, BERTWordPieceEncoding etc. which are available in the Hugging face package called Tokenizers.

We will use SentencePiece Encoding. This is a mix of BytePairEncoding and Unigram Language Model. Very similar to BERTWordPiece Encoding, it has the added advantage of operating directly on sentences.

tokenizer = SentencePieceBPETokenizer()
tokenizer.train([“./train_names.txt”],vocab_size=vocab_size,min_frequency=2)

Example of a SentencePiece Encoding -

kulvinder kaur becomes ‘▁kul vinder ▁kaur’

Now we build a Naive Bayes model as before, but we will use the features as the SentencePiece Embeddings.

So the model would now look for probabilities like p(indian| ‘_kul’ ) * p(indian | ‘vinder’) * p(indian | ‘_kaur’)

print(classification_report(test_labels,test_predicted))        precision recall f1-score support indian       0.97     0.97    0.97    2751 
non_indian   0.97     0.97    0.97    2912 
accuracy                      0.97    5663 
macro avg    0.97     0.97    0.97    5663 
weighted avg 0.97     0.97    0.97    5663

Lets test the model with the new names

names     predictions_nb_enc_tf
0 lalitha      indian
1 tyson        non_indian
2 shailaja     indian
3 shyamala     indian
4 vishwanathan indian
5 ramanujam    indian
6 conan        non_indian
7 kryslovsky   non_indian
8 ratnani      indian
9 diego        non_indian
10 kakoli      indian
11 shreyas     non_indian
12 brayden     non_indian
13 shanon      indian

The model did pretty well, both on the test set as well as the new name set !

Lets now compare it with more advanced techniques with LSTMs

LSTM with Character Encodings

We now use Character Encodings to tokenize the names. LSTM preserves the sequential information and hence the name structure.

We tokenize using the Keras tokenizer

def char_encoded_representation(data,tokenizer,vocab_size,max_len):
  char_index_sentences = tokenizer.texts_to_sequences(data)
  sequences = [to_categorical(x, num_classes=vocab_size) for x in   char_index_sentences]
  X = sequence.pad_sequences(sequences, maxlen=max_len)
  return X

We build a simple model with a single layer LSTM

def build_model(hidden_units,max_len,vocab_size):
  model = Sequential()
  model.add(LSTM(hidden_units,input_shape=(max_len,vocab_size)))
  model.add(Dense(1, activation=’sigmoid’))
  model.compile(loss=’binary_crossentropy’, optimizer=’adam’,   metrics=[‘accuracy’])
  print(model.summary())
  return modelmodel = build_model(100,max_len,vocab_size)model.fit(X_train, y_train, epochs=20, batch_size=64,callbacks=myCallback(X_test,y_test))

Training the model for about 20 epochs, gives us an accuracy of 0.96 on the validation set.

Our check with new names gives the below predictions

names    predictions_lstm_char
0 lalitha      indian
1 tyson        non_indian
2 shailaja     indian
3 shyamala     indian
4 vishwanathan indian
5 ramanujam    indian
6 conan        non_indian
7 kryslovsky   non_indian
8 ratnani      indian
9 diego        non_indian
10 kakoli      indian
11 shreyas     indian
12 brayden     non_indian
13 shanon      non_indian

It's good as expected. We will continue to test one final model.

SentencePiece encoding with LSTM

BertWordPieceEncoding was the subword technique used to train SOTA models like BERT. So lets try to use a similar encoding technique, along with our LSTM. However, lets drastically reduce the vocabulary to only 200 as compared to the 2000 we had in Naive Bayes.

def sent_piece_encoded_representation(data,tokenizer):
   encoded_tokens = [tokenizer.encode(str(each)).ids for each in data]
   sequences = [to_categorical(x, num_classes=vocab_size) for x in  encoded_tokens]
   X = sequence.pad_sequences(sequences, maxlen=max_len)
   return X

After training, we get a validation accuracy of 0.956. So almost the same as the previous character based one. For shorter sequences like name, character based LSTM still manages to do a good job.

Conclusion

We found that a simple Naive Bayes does pretty well coupled with a subword encoding technique, which enables it to generalize well to new names as well. However , character encoding which had a total vocabulary size of only 53, does well with a sequential learning model like LSTM

Hope you liked the article ! Do leave your comments. You can check the full code at https://github.com/ashavish/name-nationality

Classifying Nationalities Using Names

Written by Asha Vishwanathan