An inspiration from Refactoring Facilities Database

Let’s build a more sustainable framework

Published in

NYC Planning Tech

6 min readMar 13, 2020

Last April, I joined the Data Engineering team at the NYC Department of City Planning. Almost a year later, we celebrated our 2nd birthday and welcomed a fourth team member on board. For me, the Data Engineering team is a tech hub full of excitements, where we embrace open-sourced technologies and work to create transparent and reproducible workflows. A big part of our work focuses on designing and building data products to meet the business needs of different internal and external data users.

The first data product I helped to build is the Facilities Database, which includes all the facilities owned, operated, or licensed by a city, state, or federal agency with geometries and categories created by us. The City Planning Facilities database has been around for over 20 years and it is used for neighborhood planning, fair share analysis, and facility siting.

The Map of Facilities Database on NYC Capital Planning Platform

For a very long time, it was a big headache for City Planning to maintain this data product for several reasons.

Firstly, all the source data used to be ingested and cleaned manually, which was extremely time-consuming to maintain.
Plus, the schema of each source data could change over time and it might undermine the downstream ETL pipelines accidentally.
Finally, the old Facilities Database was built in a framework using Node.js, which had very complicated dependencies to install so that the same workflow might not be reproducible over time.

When the Data Engineering team set out to refactor the Facilities Database we spent several months brainstorming and iterating the workflow. Finally, we successfully addressed all the above inefficiencies and published the December 2019 version of the Facilities Database on Bytes of the Big Apple. In this latest update, we applied a brand new data infrastructure that is 100% cloud-based, as well as improved the workflow of several processes, namely data loading and geocoding. This upgrade is undoubtedly successful since now we are able to produce the Facilities Database within five minutes on any machine at any time.

Let’s break down what we’ve done to make this happen.

Accountable Data Ingestion

Knowing how troublesome it is to intake 50+ data sources manually each time, we set the goal to prioritize using open data when we started refactoring the Facilities Database. If input datasets were not published or up to date, we would connect actively with various data owners/city agencies and ask them to update their datasets on NYC Open data or other open data platforms on a reliable frequency with a consistent URL provided. In the end, we were able to extract 40 datasets through open data URLs without manually maintaining them. It might require a lot of communications and long waiting at the beginning, but in the long run, we believe it will foster engagement within the open data community as well as save us a lot of overhead in accessing source data.

Reliable Data Infrastructure

After intaking all the datasets, we are ready to do some data munging. In order to do so, we just need to ingest all datasets in a building environment, develop ETL pipelines for each of them, and get an output table. It doesn’t sound very complicated, right? But the reality is much tougher than this. In order to refactor the Facilities Database, we needed to iterate the building process hundreds of times to get the perfect output table we all satisfied with. During the time, one of the source schema or URL might change, which would undermine the downstream ETL completely and take up a lot of time to fix. Consequently, the iteration process got to hold up.

To address this problem, we created a PostgreSQL server on Digital Ocean, a user-friendly cloud service provider, to store all relevant source data before doing any data munging. We call this PostgreSQL database Recipe, where we store the current and historical versions of different source datasets. InRecipe, each dataset is given a designated schema, where the name of the table reflects the date the dataset was inputted into the database. On top of that, each schema has a view called “latest”, which points to the most recent version that got loaded in. With this data buffer, Recipe, we can queue up all the source data in a centralized place before running the ETL pipelines, which allows us to constantly iterate the building process even if one of the source data changes.

Data Infrastructure used by NYC Planning’s Data Engineering Team

To make this happen, we data engineering team applied an open-sourced library called GDAL, which allows us to load datasets in different kinds of formats, such as CSV, shapefile, and GeoJson, into a PostgreSQL database at high speeds. In addition, the column names are normalized during this process so that standard naming conventions can be expected in downstream processes.

Later on, all the source data are loaded from the Recipeinto another PostgreSQL database, Build Engine, where we process the data and create the final output. In the end, the output datasets are exported to a third PostgreSQL database, Publishing, where other teams can get the data and use it in their applications or publish it on our open data platform, Bytes of the Big Apple.

Reproducible Workflow

Now, we’ve figured out a more efficient way to store datasets and transfer them from one database to another, but we still needed to develop a solution to make this workflow reproducible. In the previous update, one of the biggest roadblocks we encountered is that we couldn’t install all the required dependencies we used in the past due to unavailable old versions or different computer environment settings. This could diminish the whole data pipeline we’ve built. For this reason, we brought in Docker container, a virtual machine where you can run your ETL scripts whenever and wherever you want.

Learning how to use it is much easier than you may think. Let’s say you want to load a dataset from its source URL to the Recipe, the database that contains all of our archived source data. In this case, you can either install all the required dependencies, such as gdal and postgresql-client-10, to your computer or you can simply download a Docker image,cook, which has all the dependencies pre-installed somewhere in the cloud. With this Docker image on your computer, you can spin up a virtual machine (technically a Docker container) by executing one command-line, in which you can replicate the same data loading process at any time. You can read the following article if you would like to learn more about the Docker images and Docker containers.

Docker — Beginner’s Guide — Part 1: Images & Containers

This post has been published first on CodingTheSmartWay.com.

medium.com

Our data engineering team uses Github Repository to maintain different types of Docker images, where we specify all the dependencies that need to be installed in a Docker file. This file is used to create a docker image on Docker hub, where anyone can access it.

A Dockerfile using GDAL as a base image

Geocoding

We also took advantage of Docker containers to automate our geocoding process. GeoSupport allows you to get the geographic information of any address in New York City as long as it has a valid house number, street name and borough code (or zipcode). This software is very reliable and fast and we use it to geocode all of our data products, such as the Facilities Database. More precisely, we utilize, python-geosupport, the python bindings built on the GeoSupport Desktop, which enables you to call this software using a python script. To make it more accessible, we built a Docker image, python-geosupport, based on it. After applying this Docker image as well as leveraging the multiprocessing technology in our workflow, we are now able to geocode one million records in 10 minutes. You should check out the following article if you are interested in using python-geosupport to do your own geocoding projects.

Geosupport❤️Python !

Geocoding has been a difficult task for our data production processes because we were missing an easy, scalable and…