Data Cleaning with Python For Beginners — Introduction
Data cleaning is a vital aspect in machine learning because error-free data are required to be fed into the machine learning algorithm. Presence of errors in the data used in training the algorithm will lead to inaccurate results.
Who should read?
If you are budding data scientist or machine learning engineer who started machine learning or data analysis directly and learning Python in parallel— this is for you. Or if you are an expert in machine learning and a professional in analyzing data, you may pick up something new here or get more clarity to what you already know because I will be pretty basic and explain each concepts well.
Series Intermission
- Data Cleaning with Python For Beginners — Introduction (you are here)
- Data Cleaning with Python For Beginners — Code Implementation
- Data Cleaning with Python Using Libraries — NumPy and Pandas
- Data Cleaning with Python Using Libraries — Natural Language Toolkit (NLTK)
- Data Cleaning with OpenRefine.
Prerequisites
I assume you have Jupyter Notebook installed on your system. If not, download the Anaconda Distribution. The Anaconda distribution is easiest way to perform Python/R data science and machine learning on Linux, Windows, and Mac OS X. Also, it comes with pre-installed Python packages too.
What is Data Cleaning?
Data cleaning can also be referred to as data cleansing, data washing or data scrubbing. It can be defined as a process or technique of removing corrupt data. A data can be a record from a table, record set or database.
#Data
In statistics, data can be defined as facts or figures from which conclusions can be drawn. The singular of data is datum but this is rarely used because data is uncountable.
#Table
A table is the organization of data in rows and columns. It can also be referred to as a spreadsheet or datasheet. It is made up of records and fields. If you are fond of confusing a “row” with a ”column”, the popular children song can help you: Row, row, row your boat. Gently down the stream. Merrily, merrily, merrily, merrily. Life is but a dream. You can row your boat from left to right not top to bottom.
#Record
A record can be referred to as the row (x-axis). It contains all the fields present in the table. Consider Table 1 above, the first record present is: “1, DELE, GIWA, 23” and the last record present is: “4, Oseni, Kunle,43”
#Field
A field can be referred to as the column (y-axis). It contains a single data entry in a record. Consider Table 1 above, the AGE ( field) for the first record is 23 (data) and the LAST NAME (field) for the last record is Kunle (data).
#Record Set
A set of record is a collection of records and it is referred to as a table or spreadsheet.
#Database
A database is a data structure that contains a collection of data. This data are stored as multiple tables.
Why Data Cleaning is important?
Data cleaning is the most important as well as the most time consuming part of data science because of the numerous data generated on a daily basis. It has then become very essential to examine the generated data with a tool and fix the errors present in the data because it is no longer possible to skim through the large data pool. Errors that can be present in the data includes:
- Irrelevant observations
- Unwanted outliers
- Missing data
#Irrelevant observations
Irrelevant observations may be a duplicate observation. This types of data are generated a lot. For example: If I consider my browsing history, there are some sites that I’ve checked more than once and I just need the list of websites that I have visited. It’s imperative I clean the duplicate websites in my browsing history before performing other operations on my data. Irrelevant observations also can be observations that are not needed in the analysis.
#Unwanted outliers
An outlier is something different from others. An example of an unwanted outlier when calculating the mean wealth status in Nigeria is Aliko Dangote. Aliko Dangote’s worth is over $11 billion. So if we are to add his worth to the mean wealth status in Nigeria, we will get a wrong result because less than 0.03% of the population in Nigeria has such kind of worth. It’s better to clean such outliers in our data before performing any kind of operation.
#Missing data
In our tables, some important fields might be empty for a record. It is imperative that we either remove such records from our dataset before performing other operations on the data so we won’t get a skewed result.
Ways of Cleaning Data
Since the inception of big data, there have been various libraries developed that assist in deep analysis of data. These libraries not only analyses data but also looks for relationships between them. The libraries are listed in the series of intermission.
TL;DR
Although it is possible to use APIs provided by Python or any other programming language to clean and analyse data. However, these APIs are not robust enough and cannot be used to clean and analyse complex datasets.
What’s Next?
On the next intermission, we’ll be making use of Python APIs to analyse data. We’ll be going through a real scenario, identify the errors and analyse it.
Thank you for using your time to read this article. Liked it? Clap your 👏 to say “thanks!” and please share it to help others find this article.