Playing with Data 101: Data Cleaning
Basic data cleaning using Pandas in Python3 of Jupyter Notebook.
Pandas is derived from the term “panel data”. The Pandas package provides us with DataFrame operations. Typically, Pandas will be our package of choice for creating and manipulating tabular data that is represented as DataFrames, with each column stored as a Series object.
In this blog, we are going to use SAT and ACT scores in 2017 and 2018 to demonstrate some basic data cleaning and what we can find out from our data.
Some background:
SAT and ACT scores are widely used for admission decisions of educational institutes and awarding merit-based scholarships. Even though most colleges do not prefer one test over the other, SAT and ACT participation rates are very different in each state of the US.
Data reading
We have been given four datasets: 2017 SAT Scores, 2017 ACT Scores, 2018 SAT Scores and 2018 ACT Scores from the source for the SAT data here, and the source for the ACT data here(the website has already updated to the year 2019). They are not big datasets so we can read them and check for errors manually. We have totally about 1000 items in these 4 datasets. Apart from “States”, all others look like numbers. The States
column has 50 states in the US and capital city DC. Then we code-checking the DataFrames to see the numbers of rows and columns and if there are missing values.
Data cleaning
Even though the DataFrames does not contain missing values, sat_2017
does not have a row of national participation rates or scores. So we drop this row. Also, we discovered that there are 3 errors:
- In
sat_2017
, Maryland averageMath
score is 52 which is impossibly low since the lowest math score is 200. - In
act_2017
, theScience
score for Maryland is 2.3 which is also not reasonable since theComposite
score is 23.6. - The type of
Composite
inact_2017
is object since in row 51 the value is a string20.2x
. We fixed the errors and then changed all numerical columns intoint
orfloat
.
Merging DataFrame
Before merging data frames, we need to rename the columns with all lowercase letters and no spaces. Also, the column names should be unique and informative since we have several same column names among 4 DataFrames.
When we merged DataFrames, we found that in act_2018
there is District of columbia
instead of District of Columbia
coursing many NaN
. We fixed that too.
We did it in 2 steps:
- Merge 2017 data frames as Combined SAT and ACT Scores in 2017 and 2018 data frames as Combined SAT and ACT Scores in 2018
- Merge 2017 and 2018 data frames as Final Scores for 2017 and 2018
Then we exported the data files which looks like this:
Outside research
Iowa, Kansas and New Mexico are the three states have the lowest test participation rate in 2017. Iowa, Kansas, and Alaska are the three states have the lowest test participation rate in 2018. And a fun factor is that ACT’s headquarters is in Iowa City, IA. On the other hand, some states have the participation rate of the SAT or ACT as high as 100%. Does this mean Some states make the SAT or ACT as a requirement for all eligible students for several reasons but for many students and parents, SAT or ACT scores can be critical for applying for scholarships.
For students applying for scholarships, many scholarships are more competitive in “smarter” states. For students who want to compare themselves to their in-state peers, state average scores are also very useful. Source
Why do those states have low SAT and ACT participation rates? Does it mean their education systems have problems?
The research shows that more and more universities do not consider SAT/ACT as a mandatory factor for admission. Two Ivy League schools have decided that many of their graduate school programs do not need a test score for admissions, fresh evidence of growing disenchantment among educational institutions with using high-stakes tests as a factor in accepting and rejecting students.
The reason is that many pieces of research show SAT and ACT scores are strongly linked to family income, mother’s education level, and race. Nearly 50 accredited colleges and universities that award bachelor’s degrees announced from September 2018 to September 2019 that they were dropping the admissions requirement for an SAT or ACT score. That brings the number of accredited schools to have done so to 1,050 — about 40 percent of the total. Source
While education institutes are adapting more ways to attract a wider range of applicants, it is more difficult for CollegeBoard and ACT Inc. who are competing with each other to increase the participation rate.