Data cleaning fundamentals for any Machine Learning project

Augusto Gonzalez-Bonorino
Analytics Vidhya
Published in
12 min readJul 6, 2021

--

Hello everyone! In this article I will be talking about the data scientists’ favorite task of any project… cleaning data. Just kidding, almost no one enjoys this task. As you can read in this Forbes article 76% of data scientists find data cleaning the most time consuming and least enjoyable task of their job. Nevertheless, it is arguably the single most important stage of any Machine or Deep learning project so a good understanding of the fundamentals of the art of data cleansing is crucial for your project’s success. In this post I will address two main categories:

  1. Handling structured data
  • What is it and where to get it from.
  • Basic techniques to clean dataframes.

2. Handling unstructured data

  • What is it and how to get it.
  • A web scraper example to extract data from the web.

This is not meant to be a comprehensive guide to data cleaning. Instead, my goal is to help you guys understand the main concepts of data cleansing and what is it that data scientists are trying to achieve by doing this. There is no fixed recipe for cleaning data, it is more of an art than a science in some ways so keep an open mind and try different things. Moreover, there are many complex algorithms and techniques you could use to clean more complex datasets which I will cover on future posts. I hope you find it useful and a good beginner exercise.

Structured data

Structured data is the category given to those datasets that are organized, clearly defined and easily accessible. Cool, what does this mean? It means that the data analyst can easily analyze this type of data. Because it is clearly labeled and efficiently organized it also simplifies Machine learning engineers’ task of developing training/validation sets needed to train their models. Some examples of structured data are excel spreadsheets, SQL databases or medical records. In general, it just makes working with massive amounts of data much easier.

How do you get ahold of one of these datasets? Well, there are many possibilities, you can download one from Kaggle, develop a spreadsheet yourself or leverage your companies database. Check out this article for a more detailed study of where to get datasets from. Today, we are going to download a dataset from Kaggle containing economic data about the BRICS economies and perform basic Exploratory Data Analysis (EDA) and data cleaning techniques. BRICS is a category of the largest emerging countries in the world and it stands for Brazil, Russia, India, China and South Africa.

You can download the data from here.

If you do not have a Kaggle account this is a good time to do so. Kaggle provides an incredible amount of diverse datasets and competitions that can provide you with both data for your personal projects and the opportunity to showcase your skills in competitions which often offer jobs, internships or very juicy prizes of that sort.

Very well, let’s get our hands dirty. I stored the csv file containing economic data on my computer and used pandas to read it on a jupyter notebook (to which you can find the link at the bottom of this article).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
economicData = pd.read_csv('C:\\Users\\Bonoc\\Documents\\Medium\\BRICS\\archive\\BRICS Development Indicators\\Economy_Data.csv', sep=';')

Next, we can start looking into some of the details of our dataset. First, we will print out the first 10 rows to check how it is organized.

economicData.head(10)

Great, we observe that the dataset has a long format and that there seems to be several NaN values. Let’s check the data type of each colum.

economicData.info()

The first thing that comes to my mind when seeing this description is that we could change the data type of “Year” to something simpler. We could change it to datetime but since we are only provided with the year (not a full date) we should just normalize them to int64. The info() method is very useful when you are interested in getting a quick overview of everything your dataset contains. Alternatively, you could analyze each part separately with the following commands:

# returns a tuple (num of rows, num of columns)
economicData.shape
# Returns the data type of each column
economicData.dtypes
# Returns the name of the columns in your dataset
economicData.columns

The next thing we should check for is null values. We could check this in a plethora of ways but I find the most descriptive way to be the percentage of null values relative to the number of records.

for col in economicData.columns:
pct_missing = np.mean(economicData[col].isnull())
print('{} - {}%'.format(col, round(pct_missing*100)))

Now we are faced with a tricky decision. We note that 30% of the values are missing. Ok, what do we do with them? We could delete those rows, fill the null values with some statistical measurement like the mean or the median (or the mode for categorical data) or we could assign it a special value like a super big or small value (i.e. -99999 or 12345678) just to differentiate them. The decision is on your hand now, if you believe that those values do not add a significant amount of information to your analysis then just delete them. If you want to develop a prediction model maybe you can try replacing them with the mean or median of all the values in the dataset. Since in this post we are not specifically trying to develop a particular model I have decided to delete those rows with missing values to reduce the size and complexity of this data set. We achieve that by leveraging pandas dropna method:

economicData.dropna(subset=['Value'], inplace=True)

Beware of inplace=True as that will apply the changes to the dataset directly, inplace=False will output a preview of how the dataset will look like but will not apply those changes. Additionally, I would like to delete some columns that do not seem to add significant information either: “SeriesCode” and “CountryName”. You might be wondering, why “CountryName” and not “CountryCode”? The answer is that the country code is a unversal representation of countries and it does not change with datasets, while names could be abbreviated or in different languages. This is useful if at some point you decide to aggregate another dataset with additional data about those countries, if you keep the country’s code you can join or merge those datasets based on those codes (because join and merge require the cell values to be identical).

economicData.drop(['SeriesCode', 'CountryName'], axis=1, inplace=True)

Awesome! Our dataset is looking much neater already. If we print out the dataset again we will note that the index have been modified because we deleted rows without updating the index. This is an optional step but often useful. Let’s reset our index now that we have cleaned most of our dataset:

economicData.reset_index(drop=True, inplace=True)

Finally, we can select specific indicators to reduce the size of our dataset even more and focus only on those things we are truly interested in analyzing. I hand-picked 8 indicators and modified the existing economicData dataframe to only contain data about those indicators, and since we already handled null values we can be certain that every row will contain values. Then, I reset the index one last time and change the “Year” column data type to int64.

economicData = economicData.loc[economicData['SeriesName'].isin(labels)]economicData.reset_index(drop=True, inplace=True)economicData.loc[:,'Year'] = pd.to_datetime(economicData['Year'], format='%Y').dt.yeareconomicData

Amazing! Now we are left with a much smaller subset of the original dataset, with no null values and columns with the preferred data type. From here the world is yours, you could create separate dataframes that only contain data of specific countries or one per country like this:

brazil = economicData[economicData['CountryCode'] == 'BRA']china = economicData[economicData['CountryCode'] == 'CHN']southAfrica = economicData[economicData['CountryCode'] == 'ZAF']russia = economicData[economicData['CountryCode'] == 'RUS']india = economicData[economicData['CountryCode'] == 'IND']

Or you could plot the evolution of a specific indicator over time. For example, we could visualize the GDP growth per year for each country. Excuse the lack of neatness of my code, I am also still mastering visualizations in python :)

plt.plot(brazil.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Year'],
brazil.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Value']/1000000000)
plt.plot(russia.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Year'],
russia.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Value']/1000000000)
plt.plot(southAfrica.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Year'],
southAfrica.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Value']/1000000000)
plt.plot(india.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Year'],
india.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Value']/1000000000)
plt.plot(china.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Year'],
china.loc[economicData['SeriesName'].isin(['GDP (current US$)'])]['Value']/1000000000)
plt.title('BRICS GPD (current US$) growth in billions')
plt.xlabel('Year')
plt.ylabel('GPD (current US$) in billions')
plt.show()

Unstructured data

Unlike structured data these datasets do not have a specific format and are not organized in an efficient way. This does not mean they are better or worse, they are just different. Structured and unstructured data often complement themselves because they have very different uses. Some examples of unstructured data are PDFs, tweets, natural language and data scraped from the web. Unstructured data is often slower and more challenging to work with but it provides us with the opportunity to perform all kinds of cool experiments like sentiment analysis, image classification, automated financial analysis among many other tasks. Besides, who doesn’t love a good challenge?

Today we are going to scrape the Berkshire Hathaway portfolio, clean it and store it on a pandas dataframe. Note that this post focuses on data cleansing, hence not much emphasis will be put towards explaining web scraping concepts. Nevertheless, if this is something you are interested in definitely let me know in the comments and I will write an article explaining the art of web scraping in detail :)

Let’s get started…

First we will code our web scraper using beautiful soup and requests. Here is how to do so:

import pandas as pd
import bs4 as soup
import requests
import numpy as np
## link where we will get our data from
url = 'https://finance.yahoo.com/u/yahoo-finance/watchlists/the-berkshire-hathaway-portfolio?.tsrc=fin-srch'
## download webpage HTML code
info = requests.get(url)
data = soup.BeautifulSoup(info.text, 'lxml')

In a nutshell, a web scraper is a program that extract the HTML code of a given webpage(s) and let’s you interact with it to extract information. It is a very common tool used for various Data Mining tasks.

Now that we have the HTML we need to inspect it and find the information that is relevant to our project. In our case we will extract the table containing all the holdings in the Berkshire Hathaway portfolio with their respective information (price, volume, ticker…). For this we can right-click on the table containing such information and select “inspect”. If this is your first time looking at HTML code it might be a little daunting, take a deep breadth and relax, it is much simpler that it seems. Just hover over the different lines of code and you’ll see that the web browser highlights the part of the webpage that line of code relates to. The tag <tbody> seems to be containing all the data in the table. If you click on the <tbody> tag, several other tags will appear. Again, hover over them and notice that the <tr> tags contain the table rows and that the <td> tags are the ones containing the information of each individual cells. Pretty intuitive right? Very well, since we want the information in each cell of the table we will loop through all the <td> tags on this HTML and store that data on a python list:

## we leverage bs4 to inspect the HTML we just downloaded
## once we find the location of the data we are interested in we store it on a list
stocks = []for i in data.findAll('td'):
stocks.append(i.text)

I encourage you to print out the original data and analyze it. Play around with it a little bit. Experimentation is a key aspect of any research project. When doing this we can observe, for example, that <td> returned the data of individual cells of all the tables in the webpage, which in this case is two. So, I want to get rid of the information found on the first table containing data about the performance of the overall portfolio.

## extra code to eliminate the first ten data points about the overall portfolio
parsedStocks = stocks[10:]

Great! We are halfway there. The next thing we must do is re-organize our data in order to be able to create a pandas dataframe. For this, I thought it would be helpful to create a nested list, containing the information for each individual stock in a separate list. Nine data points per stock are provided so if we divide our stocks list every nine elements we will probably achieve exactly what we want. Let’s try it out:

## Let's create a a list per stock. 
## This way it will be easier to create a dataframe with each row representing a ticker and all its info
x=0
y=len(parsedStocks)
listStocks = []
for i in range(x,y,9):
x=i
listStocks.append(parsedStocks[x:x+9])

Here we are looping over our parsedStocks list in a range from 0 to the length of the list (but not included) and stepping over every 9 elements. Then, we append these lists to a new list called listStocks (not very creative I know) and voilá! we have our cleaned nested list containing information about the BHportfolio holdings. Let’s create our beloved pandas dataframe now:

## The column names for our dataframe
headers = ['Ticker', 'Company', 'Last Price', 'Change US$', 'Change %',
'Time', 'Volume', 'Avg Volume (3 months)', 'Market Cap']
## Create dataframe
BHportfolio = pd.DataFrame(columns=headers)
## Finally populate the dataframe with out stock data
for row in listStocks:
length = len(BHportfolio)
BHportfolio.loc[length] = row

BHportfolio

Incredible! From here we can go on to utilize any of the aforementioned techniques we experimented with while cleaning our BRICS dataset. Take a look at BHportfolio.info() and check the dtypes and the number of columns/rows. Check if you could delete anything to make it more concise (I decided to drop the Time column for example because I had no use for it), check if there are any null values.

Alternatively, you could develop an SQL database very easily with your new dataframe. Just copy the following code:

import sqlite3database = sqlite3.connect('CursoDB2.db') # Any name you want
c = database.cursor()
# I create a table called "STOCKS"
# Note we are now using SQL syntax, but in Python! Pretty cool...
c.execute("CREATE TABLE STOCKS (Ticker, Company, LastPrice, ChangeDollar, ChangePerc, Volume, AvgVolume, MarketCap)")database.commit()BHportfolio.to_sql('STOCKS', database, if_exists = 'replace', index=False)# Now we can access our new SQL database and perform queriesc.execute('''
SELECT * FROM STOCKS
''')
for row in c.fetchall():
print(row)

Wow, python is truly an amazing tool for all kinds of projects. From here you are only bounded by your imagination. For example, I used this program to create a .bat file and automatically scrape stocks from yahoo finance every day before and after market open, and then I could develop a trading bot that uses that updated data to make trades or generate alerts. The possibilities are endless!

Conclusion

I hope you found this article helpful and informative. The main goal was to provide the reader with a simple methodology to download, analyze and clean datasets in both structured and unstructured formats, as well as explaining the main differences between one and the other (especially why one should not be considered “better” than the other one). An understanding of the fundamental concepts of data cleansing and analysis will be crucial not only for Machine or Deep learning projects, but for any kind of project in the future given the recent trends of Big Data, rising demands for employees/interns with some kind of programming skills in every discipline, and the general push toward a more digitized and data-driven society.

Thank you so much for reading!

--

--

Augusto Gonzalez-Bonorino
Analytics Vidhya

2nd year PhD Economics at Claremont Grad Univ. From Argentina. I created the Entangled Mind blog. Check it out ;) https://agbonorino.medium.com/membership