If you want to tackle hard questions, you need the right data

Finding the right data to crunch for a project is hard, but embrace the opportunity of that challenge

Shreye Saxena
SAP.iO
5 min readOct 17, 2018

--

Atlas by SAP works hard to make valuable datasets accessible for our customers. Finding the right data for a project is difficult. While tools like Google’s new Dataset Search make looking for data easier, it can be tricky to know what to use.

Data scientists bridge the gap between questions and the data that answers them

Atlas curates a wealth of geospatial data that is both easy to find and consume. So when NAF wanted to take a data-driven approach to building community partnerships within California school districts, we decided to work together to identify the right data for their needs. NAF works with education, business, and community leaders to transform the high school experience. Together we defined three questions to guide our data hunt:

  • Where do we find students who are at the greatest risk of drop-out?
  • What business sectors offer the most employment around these districts?
  • What colleges and universities would be accessible for those students?

With our strategic questions laid out, we were ready to take on the tricky one: where should the data search begin?

1. Find folks asking the same questions

First, we wanted to find high schools that had students with a high risk of dropping out. For any complex problem, particularly for social issues like improving educational outcomes, there are bound to be many other people working to measure progress. A great place to begin a data search for a project is by researching institutions tackling similar challenges.

Sure enough, the California Department of Education releases tons of data on educational outcomes and demographics at public schools across the state. After some old-fashioned data cleaning, we organized measures to track graduation outcomes and socioeconomic measures of at-risk youth in the five Bay Area school districts we’re studying.

Plotting data from the California Department of Education, we see that Vallejo City Unified School District has a 7.5 times higher high school dropout rate than neighboring Sonoma Valley Unified School District.

2. When the “right” data doesn’t exist, be resourceful with what does

Second, we wanted to find the business sectors with the greatest presence in and around these school districts. NAF promotes five industry themes to develop a future-ready workforce, and aligning those themes with local economic opportunities would have a powerful impact.

This isn’t the type of data school districts track, but the Census Bureau has lots of data on industry employment for census block groups. Using Atlas, we were able to easily roll census block employment data up to school districts and identify the employment opportunities in the area.

Government datasets are great resources for a wide range of information and insights. Coupling Census data with Atlas’s catalog of proprietary datasets allowed us to deliver even more value to NAF. Leveraging our data on over 30 million points-of-interest, which includes industry classifications, we created a list of business partners that operate within each school district and that align with NAF’s five industry themes. That represents millions of potential bridges between motivated students and meaningful work experiences.

Although employment data isn’t reported for school districts, we can easily aggregate labor measures from the Census block groups a school district is composed of.

3. Keep it simple

Finally, we wanted data on colleges that would be accessible to students in NAF’s Bay Area academies. But what does “accessible” mean in this context? Atlas could compute catchment areas based on drivetime from a high school. Or we could use machine learning and link high schools to the colleges their students are most likely to attend, if we use the right data. For any geospatial analytics problem, there are bound to be many creative and resourceful solutions.

But sometimes creativity and resourcefulness can take you off course. The original question was a simple one — which colleges could students reach after graduating from NAF’s academies? So the data should be simple too. Remembering the first tip, we accessed the U.S. Department of Education’s College Scorecard data for colleges and universities across the country. As NAF prepares students to be college and career ready, their ability to realize opportunities at local, public, post-secondary institutions with two and four-year programs is really about “proximity”. It is much simpler to filter the College Scorecard down to public programs within 20 miles of each school district’s boundaries than it is to implement a machine learning model. The resulting list of 31 colleges is exactly what NAF needed to explore potential partners in the region. Sometimes, simple is all it takes.

By extending the boundaries of the school districts by 20 miles, we can identify 31 public colleges and universities as potential local destinations for students who graduate from NAF’s academies.

There is no such thing as the perfect dataset, but focusing on the underlying questions can help bridge the gap

Finding the right data for a project is just the first step of an analysis process, yet it is one of the most important to ensure results are meaningful. With so much data readily available, finding the right dataset to work with can feel a lot like looking for the right needle in a stack of needles. But laying out questions that address the underlying problem can help make the data search less daunting, and in the case of NAF, empower conversations with partners on effectively aligning programs with community needs.

Remember these tried and true tips when looking for data to use in a project: Start by researching institutions that are tackling a similar problem to yours. If you can’t find the exact dataset you need, get creative with the data that you do have available. And before diving into an analysis, ask if there is a way to make the data simpler.

— — — —

Atlas by SAP is an SAP-created venture, funded by the SAP.iO Venture Studio. Shreye Saxena leads data science efforts for Atlas, enabling customers to derive meaningful insights from location-based data.

--

--

Shreye Saxena
SAP.iO

Practicing the art of data science at SAP.iO Venture Studio