Do you want to check if Country Name coming in your data is correct or not? Here is a Python function that does it for you.

Richa Monga
4 min readAug 4, 2018

--

Recently I started my journey towards extensively using Python for Data Analysis and there were so many instances where I encountered data sets which contained country names.The problem was how do I validate if country names coming in millions of rows in my data set were correct or not?

Have you ever came across such issue? If yes,then this post shall be helpful for you.I have written a function in Python which can be used to find out incorrect country names coming in a data set.

I will explain is by implementing it on a Kaggle data set which has incorrect county names in its country data.

Data Source : timesData.csv from ‘World University Rankings’ data set on Kaggle

This data set has incorrect/misspelled entries for country name like : Unisted States of America,Unted Kingdom. This can be seen in screenshot below.

Records form timesData.csv which have invalid country names

To identify such anomalies I have written a function which can identify such invalid country names.

This function uses ‘pycountry’ library of Python which contains ISO country names.It provides two-alphabet country name,three-alphabet country name,name,common name,official name and numeric country code.Sample is shown below:

list of countries in pycountry

Step wise demo to show application of this function using timesData.csv

Step 1:

Import Python libraries

Step 2:

Load the csv file into a pandas data frame

Step 3:

Fetch ‘country’ column from the data set as list and convert each element of the list into upper case so that we can do case insensitive comparison with list of countries from pycountry.

Step 4:

Call the function “country_name_check()” which has been designed to pick invalid country names and it should give us result as Unisted States of America,Unted Kingdom.

Now that we have seen that function is returning us expected results,let us have a look at the function definition:

This function compares country name coming in the input list with each of the following provided by pycountry.countries:

alpha_2 : Two character country code

alpha_3 : Three character country code

name: Country name

common name : Common name for the country

official name : Official name for the country

Also,comparison is being done by converting each of the above attribute content into upper case since we have input country name list also in upper case.

Another thing to be noted here is that,I have created a list called ‘tobe_deleted’ in the function definition.This list contains of those countries for which we have different version of name in pycountry and therefore we do not want these countries to appear as invalid country names when our function is called.

Example:

  1. North Korea and South Korea are two separate countries and hence both are valid country names.However,in pycountry.countries we have do not have entry for North Korea and South Korea but we have a entry like this:
Entry for Korea in pycountry.countries

Therefore, our function can handle this situation and if North Korea or South Korea appear in the input data set then it will be treated as valid despite the fact it is not present in pycountry.countries.

2. MACAU is also spelled as MACAO,therefore both the sames are valid.However,pycountry.countries has only one entry with spelling as MACAO.country_name_check() can handle both MACAO and MACAU.

3. Similarly, pycountry.countries has entry for IRELAND with name=’Ireland’.However,it is also sometimes referred as ‘Republic of Ireland’.country_name_check() can handle both ‘Ireland’ and ‘Republic of Ireland’ in input data set.

4. In same fashion,multiple name handling has been performed for Iran and Sudan in country_name_check() function.

This function can further be empowered to handle more such scenarios to make it even more useful and powerful.

I hope this function helps all the people who might have faced issues with handling invalid country names in data sets at any point during data analysis.Thanks for reading my post and any suggestions and feedback are welcome to improve this function.I will keep posting my solution of commonly encountered issues while doing data analysis.

If you liked this article please 👏 and feel free to connect with me on LinkedIn.

Thank you once again for your time and Happy learning!

--

--