Flatiron Data Science Fellowship-Final Project Plan

Timothy Mango
5 min readJul 16, 2019

--

Data Science Project Management Tips, Final Project Proposal, and Final Project Data

This week marks the end of formal coursework for Flatiron School’s April 22, Washington, DC Data Science Fellowship. It has been a blast! As the coursework ends, my cohort nears final projects. I have been waiting a long time for this and I am excited to see the projects each of us will create. The timeline for the final project is 2.5 weeks and graduation is scheduled for August 2nd. The intent of this post is to share tips for managing data science projects, my final project proposal, and my final project data. My next post will include some initial visualizations for the final project using GeoPandas and the GeoJson data described below.

1) General Data Science Project Management Tips

A) Preparing for an interesting new project…

In my opinion, data is king when preparing for a new data science project. The first objective is to find a great dataset (or multiple datasets). Next, it’s time to brainstorm relevant questions that could be answered with the data. A successful project will answer one or more specific questions and deliver some kind of value to a project stakeholder (it is best if the stakeholder is not just you). After examining the dataset and question or questions of interest, it is time to create a rough project scope and timeline. This project scope and timeline is at a high level and can be adjusted during the exploratory data analysis (EDA) and data cleaning stages of the project.

B) Obtaining Data: Finding data, Scraping Data, Querying and Merging Data

Factors to consider are the level of cleanliness for the desired data and your project timeline. Maybe you want to clean dirty data to more effectively answer a question for an organization. Maybe you want to spend more time training certain types of models and slightly improve model predictions. Maybe you want to be able to better understand and explain a difficult topic.

Lucky for me, I maintain a list of interesting datasets I find while adventuring through the internet (literally while reading other blogs, working on other projects, or reading papers/watching youtube videos). This usually provides a base resource for starting new projects and more data can be scraped, queried, or merged as the prospective project solidifies. Timeline and project scope are your best friends here, if you know you already have good data stick with your gut.

Some datasets more easily answer certain kinds of questions, the choice of data will define the rest of your project! Remember, it’s always possible to use the same data to answer different questions in later projects. It is also possible to repeat projects with new techniques. Most projects will never be completely “finished”, more value can almost always be added.

C) How to be original, resourceful, and credit giving

The scientific process requires research to be built on top of other research, otherwise we would be stuck solving the same problems! A final project does not have to be “original” in all of its elements. I attended local data science meetups, read through academic papers, data science forums, and other GitHub projects before deciding on my final project. However, my project is unique in its approach and its deliverable is valuable. It is important to remember that you don’t need to recreate the data science wheel from scratch. If you can be resourceful and credit giving, you will improve your ability to manage project scope. Just remember to cite your sources!

2) My Data Science Final Project Topic

A) Initial Problem and Project Value

I attended a DataKind meetup in Washington D.C. where volunteer data scientists/analysts are working on projects related to Earth Challenge 2020. At this event I was introduced to this project, which maps waste cleaned up by volunteers (volunteers use an app to submit waste cleanup data). My thought was that volunteers could pick up certain kinds of waste, but what if there was hazardous waste? My goal is to create a map (visualization) of locations that have a high likelihood of containing hazardous waste sites. These location areas can then be investigated and if a new hazardous waste site is identified, it can cleaned by the government.

B) Initial Project Proposal

My first project deliverable is a list of 50–100 census block groups that may contain an unidentified hazardous waste location (superfund site). My second project deliverable is a heat map of census blocks that have a high likelihood of containing an unidentified superfund site.

First, I will combine US census block group data and a dataset of superfund site addresses. Next, I will apply multiple prediction models with the goal of predicting superfund site block group location. The target variable for each prediction model will be census block locations that contain superfund sites, the explanatory variables will come from the census data (and maybe the much larger dataset shown below). The heat-map will use the false positives from the various prediction models to identify new potential superfund site locations. An alternative idea is to find the best performing predictive model and create a heat-map of its prediction probabilities.

3) Census and Superfund Site data

My data search/acquisition was inspired by this Kaggle dataset.

A) Target Variable Data-To Be Merged

Superfund site addresses. These addresses are updated from the Kaggle dataset and need to be geocoded and linked to census block group. This data will create the target variable for the prediction models. Data includes Deleted NPL Sites, NPL Sites, Proposed NPL Sites. All of these addresses represent hazardous waste sites at a specific address.

B) Census Data- Scope Fit

2019 PDB report is an update from the Kaggle dataset (the Kaggle dataset uses the 2015 report). My data is coming from the United States Census Bureau and can be found from this link.

Data summary: ~230k rows for each census block, ~300 variables, mostly demographic information

C) Census Data- Scope Reach

I found an amazing dataset from this Quora article, this resource is really unique. The free available data by census block can be found here. Warning, downloading the .zip file took about 20 minutes.

Dataset1 summary: ~230k rows for each census block, ~7500 variables

Dataset2 summary: Includes GeoJson files for each census block. This is what was used to create the Geopandas mapping (see next blog post).

Dataset3 summary: Includes Metadata for 7500 variables attached to census blocks and other information

I don’t exactly know how the predictive models will perform, but I am convinced I will be able to create an interesting final visualization. Even if model performance is poor, I will be able to deliver the identified project deliverables. I am determined to find at least one predictive model that is up to the challenge!

--

--