Playing with Data 101: Data Cleaning

Zhan Yu
Analytics Vidhya
Published in
4 min readApr 13, 2020

Basic data cleaning using Pandas in Python3 of Jupyter Notebook.

From the Internet

Pandas is derived from the term “panel data”. The Pandas package provides us with DataFrame operations. Typically, Pandas will be our package of choice for creating and manipulating tabular data that is represented as DataFrames, with each column stored as a Series object.

In this blog, we are going to use SAT and ACT scores in 2017 and 2018 to demonstrate some basic data cleaning and what we can find out from our data.

Some background:
SAT and ACT scores are widely used for admission decisions of educational institutes and awarding merit-based scholarships. Even though most colleges do not prefer one test over the other, SAT and ACT participation rates are very different in each state of the US.

Data reading

We have been given four datasets: 2017 SAT Scores, 2017 ACT Scores, 2018 SAT Scores and 2018 ACT Scores from the source for the SAT data here, and the source for the ACT data here(the website has already updated to the year 2019). They are not big datasets so we can read them and check for errors manually. We have totally about 1000 items in these 4 datasets. Apart from “States”, all others look like numbers. The States column has 50 states in the US and capital city DC. Then we code-checking the DataFrames to see the numbers of rows and columns and if there are missing values.

Data cleaning

Even though the DataFrames does not contain missing values, sat_2017 does not have a row of national participation rates or scores. So we drop this row. Also, we discovered that there are 3 errors:

  1. In sat_2017, Maryland average Math score is 52 which is impossibly low since the lowest math score is 200.
  2. In act_2017, the Science score for Maryland is 2.3 which is also not reasonable since the Composite score is 23.6.
  3. The type of Composite in act_2017 is object since in row 51 the value is a string 20.2x. We fixed the errors and then changed all numerical columns into int or float.
SAT Scores 2017
ACT Scores 2017

Merging DataFrame

Before merging data frames, we need to rename the columns with all lowercase letters and no spaces. Also, the column names should be unique and informative since we have several same column names among 4 DataFrames.
When we merged DataFrames, we found that in act_2018 there is District of columbia instead of District of Columbia coursing many NaN. We fixed that too.
We did it in 2 steps:

  1. Merge 2017 data frames as Combined SAT and ACT Scores in 2017 and 2018 data frames as Combined SAT and ACT Scores in 2018
  2. Merge 2017 and 2018 data frames as Final Scores for 2017 and 2018

Then we exported the data files which looks like this:

Outside research

Iowa, Kansas and New Mexico are the three states have the lowest test participation rate in 2017. Iowa, Kansas, and Alaska are the three states have the lowest test participation rate in 2018. And a fun factor is that ACT’s headquarters is in Iowa City, IA. On the other hand, some states have the participation rate of the SAT or ACT as high as 100%. Does this mean Some states make the SAT or ACT as a requirement for all eligible students for several reasons but for many students and parents, SAT or ACT scores can be critical for applying for scholarships.
For students applying for scholarships, many scholarships are more competitive in “smarter” states. For students who want to compare themselves to their in-state peers, state average scores are also very useful. Source

Why do those states have low SAT and ACT participation rates? Does it mean their education systems have problems?
The research shows that more and more universities do not consider SAT/ACT as a mandatory factor for admission. Two Ivy League schools have decided that many of their graduate school programs do not need a test score for admissions, fresh evidence of growing disenchantment among educational institutions with using high-stakes tests as a factor in accepting and rejecting students.
The reason is that many pieces of research show SAT and ACT scores are strongly linked to family income, mother’s education level, and race. Nearly 50 accredited colleges and universities that award bachelor’s degrees announced from September 2018 to September 2019 that they were dropping the admissions requirement for an SAT or ACT score. That brings the number of accredited schools to have done so to 1,050 — about 40 percent of the total. Source

While education institutes are adapting more ways to attract a wider range of applicants, it is more difficult for CollegeBoard and ACT Inc. who are competing with each other to increase the participation rate.

--

--