Working with Data while Trying to Stay DRY

4 min readNov 12, 2022

A Computing Principle to live by

As someone relatively new to data science, I spend most of my time practicing new skills while trying not to pick up bad habits. Blind panic was my impetus to begin my data science journey.

My go-to systems were offline; I needed to give my team access to large amounts of data without printing spreadsheets on a forest of copies.

One of my passions is providing equitable and sustaining education. Not only is this enshrined in our constitution, but it has also been the throughline of my personal story. As a new American who emigrated from the Commonwealth of the Bahamas, access to an extraordinary education has helped me navigate my life. This led me to dig into this data, and along the way, I learned the following principles:

DRY — Don’t Repeat Yourself. Use libraries, functions, and classes to explore data and generate insights
Simple is Best — Use the latest Python Style Guide when documenting
Reproducible — Use Git for version control to write, revise and share content with collaborators
Accessible — Leverage storytelling in Markdown within Jupyter Notebooks and create public repositories with an easy-to-access, well-documented set of content
Systematic — Be consistent by following a framework like CRISP-DM to go through the data cycle for each project you tackle

What’s the Problem?

Currently, I am analyzing the graduation outcomes for subgroups of students in Consortium Schools and comparing those outcomes to New York City Public Schools according to data hosted by the NYSED and on NYC Open Data.

What is the graduation rate of each subgroup in NYC Public High Schools?

When I started working with this dataset, I was excited because I could grab data using pandas.read_html.

# Read Tables from the NYSED Website
df= pd.read_html('https://data.nysed.gov/gradrate.php?instid=7889678368&year=2021&cohortgroup=1')

First 15 rows of Data Frame of NYC DOE High School Subgroup Graduation Rates

But then I found myself writing the same code snippet repeatedly:

Load the data
Look at the Shape
Identify the features of the data
Remove or Replace outliers
Review the data

I know it does not seem like that many steps, but imagine curating a Jupyter Notebook with ~40 schools where you were trying to make each data frame as squeaky clean as possible.

This is where the Do Not Repeat Yourself Principle comes in. If you find yourself writing the same code repeatedly, you have headed down the wrong way and need to return to the path of python righteousness.

Applying this principle took my code from 400 lines to 1 line per school. I created helper functions to summarize and plot data for subgroups which greatly simplified my code.

def explore_df(df, school_name, grad_rate_comp_type):
    """
    This function explores a dataframe and generates 
    a graph and a summary of the dataframe
    
    Inputs
    --
    df: pandas dataframe
    
    Outputs
    --
    bargraph
    summary_df : summary of data frame 
    """
    if grad_rate_comp_type == "state":
        grad_rate = 85;
    if grad_rate_comp_type == "city":
        grad_rate = 79;
    else:
        print("Please enter either city or state")
    formatted_df = format_df(df)
    summary_df = summarize_dataframe(formatted_df)
    plot = plot_bar(formatted_df, school_name, grad_rate)
    
    return formatted_df
    return summary_df
    return plot

While I am sure that I will improve as I spend more time internalizing Flatiron’s principles for best Data Science practices. I am ecstatic that I am finally starting to stay a bit DRY.

Here are a few visualizations I have been able to generate quickly while using this principle.

Visualization of the Subgroups that exceed the mean (78%) graduation rate of NYC DOE High School Students

The visualization illustrates that as of June 2021, students who were female, Asian, White, Multiracial, Non-English Language Learners, and were not Homeless exceeded the mean graduation rate for New York City Public Schools. I will continue to dig into this data and keep you posted on my progress.

What‘s my next step?

I would like to gain a better understanding of the following:

How do the graduation rates of students at member schools of the Performance Standards Consortium compare to the mean city and state graduation rates for each subgroup?
Why is there no graduation data for migrant students who attend NYC Public High Schools?
Is there a way to review students’ performance at the intersection of subgroups? Ex. Hispanic Male Students who are English Language Learners

I am currently working on creating visualizations for all of the member schools of the New York Performance Standards Consortium using seaborn .

Contact Me

If you would like to be updated with my latest articles, follow me on Medium. You can also connect with me on LinkedIn or email me at tenicka.norwood@gmail.com.

BECOME a WRITER at MLearning.ai //FREE ML Tools// Clearview AI

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com