Working with Data while Trying to Stay DRY
A Computing Principle to live by
As someone relatively new to data science, I spend most of my time practicing new skills while trying not to pick up bad habits. Blind panic was my impetus to begin my data science journey.
My go-to systems were offline; I needed to give my team access to large amounts of data without printing spreadsheets on a forest of copies.
One of my passions is providing equitable and sustaining education. Not only is this enshrined in our constitution, but it has also been the throughline of my personal story. As a new American who emigrated from the Commonwealth of the Bahamas, access to an extraordinary education has helped me navigate my life. This led me to dig into this data, and along the way, I learned the following principles:
- DRY — Don’t Repeat Yourself. Use libraries, functions, and classes to explore data and generate insights
- Simple is Best — Use the latest Python Style Guide when documenting
- Reproducible — Use Git for version control to write, revise and share content with collaborators
- Accessible — Leverage storytelling in Markdown within Jupyter Notebooks and create public repositories with an easy-to-access, well-documented set of content
- Systematic — Be consistent by following a framework like CRISP-DM to go through the data cycle for each project you tackle
What’s the Problem?
Currently, I am analyzing the graduation outcomes for subgroups of students in Consortium Schools and comparing those outcomes to New York City Public Schools according to data hosted by the NYSED and on NYC Open Data.
- What is the graduation rate of each subgroup in NYC Public High Schools?
When I started working with this dataset, I was excited because I could grab data using pandas.read_html.
# Read Tables from the NYSED Website
df= pd.read_html('https://data.nysed.gov/gradrate.php?instid=7889678368&year=2021&cohortgroup=1')
But then I found myself writing the same code snippet repeatedly:
- Load the data
- Look at the Shape
- Identify the features of the data
- Remove or Replace outliers
- Review the data
I know it does not seem like that many steps, but imagine curating a Jupyter Notebook with ~40 schools where you were trying to make each data frame as squeaky clean as possible.
This is where the Do Not Repeat Yourself Principle comes in. If you find yourself writing the same code repeatedly, you have headed down the wrong way and need to return to the path of python righteousness.
Applying this principle took my code from 400 lines to 1 line per school. I created helper functions to summarize and plot data for subgroups which greatly simplified my code.
def explore_df(df, school_name, grad_rate_comp_type):
"""
This function explores a dataframe and generates
a graph and a summary of the dataframe
Inputs
--
df: pandas dataframe
Outputs
--
bargraph
summary_df : summary of data frame
"""
if grad_rate_comp_type == "state":
grad_rate = 85;
if grad_rate_comp_type == "city":
grad_rate = 79;
else:
print("Please enter either city or state")
formatted_df = format_df(df)
summary_df = summarize_dataframe(formatted_df)
plot = plot_bar(formatted_df, school_name, grad_rate)
return formatted_df
return summary_df
return plot
While I am sure that I will improve as I spend more time internalizing Flatiron’s principles for best Data Science practices. I am ecstatic that I am finally starting to stay a bit DRY.
Here are a few visualizations I have been able to generate quickly while using this principle.
The visualization illustrates that as of June 2021, students who were female, Asian, White, Multiracial, Non-English Language Learners, and were not Homeless exceeded the mean graduation rate for New York City Public Schools. I will continue to dig into this data and keep you posted on my progress.
What‘s my next step?
I would like to gain a better understanding of the following:
- How do the graduation rates of students at member schools of the Performance Standards Consortium compare to the mean city and state graduation rates for each subgroup?
- Why is there no graduation data for migrant students who attend NYC Public High Schools?
- Is there a way to review students’ performance at the intersection of subgroups? Ex. Hispanic Male Students who are English Language Learners
I am currently working on creating visualizations for all of the member schools of the New York Performance Standards Consortium using seaborn
.
Contact Me
If you would like to be updated with my latest articles, follow me on Medium. You can also connect with me on LinkedIn or email me at tenicka.norwood@gmail.com.