ARGO @ APA NPC ‘17 Data Jam.

Documenting our efforts at the NYC DCP + BetaNYC’s Data Jam.

Last Monday, Team ARGO participated in a Data Jam organized by NYC‘s Dept. of Capital Planning along with BetaNYC, the city’s premier civic tech community.

The Jam for the most part was an intimate mashup of interesting and capable people from the city’s civic tech scene and included a crew of city officials, students, former students, Map Tech people, and freelance do-gooders.

The event was a small part of the much much larger American Planning Association’s 2017 National Planning conference that occupied the entire lower level of the Javits Center which may give you some sense of scale in conference metrics.

The data jam was organized around a very particular dataset, New York’s FacDB, short for Facilities Database.

FacDB aggregates information about facilities that impact NYC neighborhood quality and are owned, operated, funded, licensed, or certified by a City, State, or Federal agency and span critical domains of government function
Lots of facilities (some duplicates) via NYC DCP’s Facilities explorer

Prior to the event, DCP and BetaNYC curated the following challenges to focus our jammin. At each “challenge table” was a data expert who was very helpful in navigating through basic questions and helped us quickly dive into the data with context and purpose.

  1. How can the rate and accuracy of identifying duplicate records be improved?
  2. How can sites that are administrative locations rather service locations be identified in FacDB?
  3. How can complete BBL and BIN info be gathered for large facilities, such as campuses that include multiple buildings and tax lots, by hacking together different datasets?
  4. How can the FacDB database architecture and maintenance process be improved and in order to streamline updates and improve data quality?
  5. How can DCP and other City agencies use FacDB to inform the co-location of compatible facilities?
  6. What tools and analysis can be developed using the FacDB to better empower communities to make their needs known and become more informed?

Manushi Majumdar, an ARGO researcher was assigned to a Challenge group #1 (duplicate records) while I was assigned to Challenge group#5 (co-location)

Finding duplicate facilities using SeatGeek’s tools.

In a matter of a few hours, Manushi’s team were able to code up a fun way to identify and filter out duplicate facility listings within the same building. Turns out that the popular events search engine, Seatgeek had a similar issue where they had to avoid showing duplicate event listings.

They open sourced their solution and called it the “fuzzy wuzzy” package to quickly identify similar phrases and even blogged about it.

Manushi’s team went a step ahead and visualized a sample of the facilities data by combining facility name + address+ agency identifier into a single string and fuzzy wuzzy-ing it to weed out duplicates. What resulted was the ability to visually identify duplicate records

String similarity visualized for facilities in Zip code 10019 — The ones colored bright yellow are highly likely to be duplicate records. Source: Will Geary + Manushi Majumdar

While this approach scales such that it is more computationally intense as records increase, there are other ways DCP can automate pruning this dataset for duplicates.

OpenRefine, an open-source project spun out of Google has been a sort of go-to tool for data journalists for a long time.

ProPublica, a leading independent and non-profit newsroom that has pioneered digital tools for world class investigative journalism has a great post that uses Open Refine to clean messy data. We were able to find about 6,000 duplicate records using OpenRefine’s point and click interface to select a series of steps that pruned this dataset, all without writing a single line of code. OpenRefine has the added benefit of being repurposable as you can export an entire OpenRefine project. The exported file will contain not only the data but also a history of changes. This way, the person who receives your project file knows exactly what operations you have performed, and can even undo them.

Lastly, Trifacta Wrangler is yet another data product that offers a free desktop GUI tool to create automated workflows to massage data.

OpenRefining FacDB
Taming Disorderly Data via Trifacta Data Wrangler

Colocating facilities with a service bias towards vulnerable populations.

The background provided to my group included finding ways to help the city “save money on leases by sharing and consolidating space and help make services more accessible by consolidating facilities into community hubs.

Our group initially discussed using clustering algorithms and networking tools to identify interesting connections within the data before concluding that there was not enough data to cluster by lease or service type as these are not currently present in FacDB. To summarize our discussion around co-location:

  1. We defined co-location by identifying compatible features across 1) Facilities characteristics— Space (distance between facilities) & Agency (City/State agency that operates the facility) and 2) Service characteristics — The effort required by citizens to access particular services and possibly integrating FacDB with a 311 service vocabulary.
  2. Eliminate obvious non-compatible facilities to reduce the overall problem space.
  3. An obvious co-location story included having a daycare facility near schools that serve younger people. A non-obvious co-location story included co-locating schools and youth facilities near green spaces.

We then interrogated and wrangled the data to answer some simple questions about how facilities were distributed across the city. Here is what we found out.

  • “Public” “K-12 Schools” “(DEP)” account for 37% of all FacDB.
  • “Pre-school” “Childcare” “(DOHMH)” accounts for 22% of all FacDB.
  • “Non-public” “K-12 Schools” “(NYSED)” account for 14% of all FacDB.
  • “All Age” “CAMPS” “(DOHMH)” account for 8.5% of all FacDB facilities.

These 4 groups, combined, account for ~82% of all facilities in FacDB.

This simple fact combined with our compatibility framework led us to explore what colocation would mean if biased towards the mentally ill, a chronic issue in New York City.

What if a majority of New York’s facilities could organize their services to strengthen the support fabric for the city’s mentally ill?

We abstracted this statement further to probe what a co-location decision framework could look like for “vulnerable populations”.

What if a majority of New York’s facilities could organize their services to strengthen the support fabric for the city’s vulnerable populations?

Vulnerable populations, according to the American Journal of Managed Care include:

  • the economically disadvantaged,
  • racial and ethnic minorities,
  • the uninsured,
  • low-income children,
  • the elderly,
  • the homeless,
  • those with other chronic health conditions, including severe mental illness.

As with all data jams and hackathons of this kind, time is a precious resource. This is where we left our discussions. We hope this was a meaningful contribution to future improvements and applications of DCP’s FacDB.

A big thank you to all DCP staff and the ever present and active BetaNYC and Noel Hidalgo.

Feel free to comment and call us out on anything we missed, could have done better with the intent to help FacDB be an awesome data resource.

Visualizing DCP’s FacDB though different categories — Click the image to be taken to the visualization


  • Manushi’s + Will Geary’s ipython notebook illustrating their cleaning efforts. (TBD)