How Can We Make Open Data Even More Awesome?

Data Clinic uses NYC Open Data Week to gather community input on potential new features for scout in order to make open data easier to use

Data Clinic
Apr 9 · 3 min read
A screengrab of scout, showing results from a search for dog-related open data
Using scout to look for dog-related open datasets

In 2020, right before the pandemic hit, we launched scout. This new, open source tool was designed to make it easier to discover the open datasets that you need for your work. It does so by recommending datasets that would be easy to join together, and those that have a similar theme. We have been working hard this year to broaden the reach of scout, allowing it to see into data portals beyond NYC.

Thank you to the 2021 NYC Open Data Week coordinators, NYC Mayor’s Office of Data Analytics, BetaNYC, and Data through Design

We have also been exploring other ways that scout can make open data more accessible. During last month’s Open Data Week, we held a workshop to explore these potential new features with members of the open data community.

The first feature we wanted to explore was enabling scout to host community-curated crosswalks to standardize names, IDs and categories across all of NYC’s open datasets. Joining datasets from the open data portal can often be tricky because IDs don’t match among or even within datasets. This can be due to different agencies using different standards, to typos, or to names being truncated. For example, in datasets that contain a school_name column, there are 18,157 unique school names, but there are only 1,866 schools in NYC. This feature would allow users to submit mappings for each of these unique IDs to a canonical ID and to use these mappings to make clean versions of these datasets that are easier to work with.

A mockup of what a community-owned crosswalk feature might look like

The second potential feature focuses on exposing prior work that the community has done on a particular dataset. Often, when encountering a dataset for the first time, it can be hard to know where to start cleaning or exploring. Thankfully you’re probably not the first person ever to try to use a given dataset. The internet is often full of GitHub repos, gists, cleaning scripts, blog posts, and tools, all using the dataset you’re interested in. Too often, though, it can be hard to find these resources, and not knowing about them often leads to doing work that just reinvents the wheel.

To combat this, we are looking at adding a feature to scout that would allow resources relevant to a dataset to be shown alongside it. These resources could either be automatically identified or submitted by members of the community.

A mockup of what a community resources feature might look like

Despite the challenges of running a workshop remotely during a pandemic, we are really happy with the feedback we got from the community. We were able to validate that people really want and need these features, and we got some good feedback about the best ways to implement them, along with some thoughts about potential issues.

We are excited to continue to expand scout’s usefulness as a community resource and are going to be working hard to implement new features over the next six months. Watch this space for updates, and reach out to us if you have any feature requests!

Finally, a huge thank you to everyone who attended the session, the Open Data Week team for making this all possible, and MODA and the open data coordinators for providing such an awesome data portal for us all to build off of!

Data Clinic

We help nonprofits have a greater impact through data and…

Data Clinic

As the data- and tech-for-good arm of the financial services company Two Sigma, Data Clinic provides pro bono data science and engineering support to nonprofits and engages in open source tooling and research that contribute to the broader Data and Tech for Good movement.

Data Clinic

Written by

As the data- and tech-for-good arm of Two Sigma, we harness the power of data and technology to help nonprofits have a greater impact.

Data Clinic

As the data- and tech-for-good arm of the financial services company Two Sigma, Data Clinic provides pro bono data science and engineering support to nonprofits and engages in open source tooling and research that contribute to the broader Data and Tech for Good movement.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store