Mapping Historical Capital Spending in NYC: Introducing CSDB

Published in

NYC Planning Tech

12 min readSep 18, 2023

The Data Engineering Fellows at the NYC Department of City Planning (DCP) are excited to present the culmination of our ten-week summer fellowship: the brand-new Capital Spending Database.

Part 1: Background

What is the Capital Spending database?

The Capital Spending database (CSDB) is a data product that presents the first step in producing a spatialized historical view of the City’s capital funding. Each row in the dataset contains key information about one of over 16,000 unique capital projects, including:

a list of agencies involved in the project
the project category
the sum of all checks disbursed for the project

And, most vitally:

geospatial information where possible

This product was created in direct response to requests from DCP staff and agency partners: previously, there existed no such dataset linking historical capital spending to location information.

Who am I?

A Data Engineering Fellow through Coding it Forward; I’m one of two fellows matched with the Data Engineering Team (DE) here at DCP. I’m a rising senior at Duke University studying computer science and statistics and I’m keenly interested in applying my data-focused skillset to affect positive social change in the world, especially in the realm of social justice. Born and raised on Long Island, I grew up in NYC’s backyard and I’m thrilled to be working in a department whose interdisciplinary work touches and improves the lives of every New Yorker.

What are the potential use cases for CSDB?

This data product will provide indispensable insights into historical liquidations of capital funds. The Capital Planning team is primarily interested in using this data to benefit per capita spending and asset management in order to better create equitable investment strategies in New York City.

But what do equitable investment strategies look like? Take the Community Parks Initiative as an example. By analyzing per capita spending, the Department of Parks and Recreation identified small, local parks for redesign and reconstruction based on a threshold of less than $250k per acre. Many of these parks had not seen investment in a generation.

How did we create this dataset?

Our work on this project was split into two distinct parts: an exploratory phase followed by funneling that work into a data pipeline.

Part 2: Exploratory phase

The development of CSDB was a direct response to a new data product request from the Capital Planning team at DCP. At the core of their request was the need for a historical, spatialized view of capital spending. Breaking that request down, we determined that our final data product must primarily involve capital spending information that is both:

historical, in that this data product should have a significant backwards scope, ideally up to 20 years
spatialized, in that as much capital spending as possible must be mapped to geospatial information

With these aspects in mind, we began exploring potential input data sources. At minimum, we needed two components: a source of capital spending data and a methodology for assigning that data to geospatial information.

Finding input data and a methodology for adding geospatial information

**Fig 1.** A screenshot of the Checkbook NYC dashboard.

From the beginning, we looked to Checkbook NYC as our source for capital spending data. Checkbook NYC is an open-source dataset and tool from the NYC Comptroller’s Office that publishes every check disbursed by the city. For our exploratory work, we filtered the data to include only checks pertaining to capital spending, as defined by Checkbook NYC. This input data had nearly 2 million rows, each row representing a single capital spending check issued by NYC. Information included in the dataset included check amount, date issued, managing agency, and a few minimally descriptive text fields.

We started our project by conducting exploratory data analysis.

After determining the source of our capital spending data, we still needed to choose a method of assigning geometries to that data. Initially, we explored using fuzzy string matching to extract addresses or facility names from the Budget Code and Contract Purpose columns in Checkbook NYC — those two minimally descriptive text fields — in the hopes that we could then map individual checks to geospatial information using a geocoding package like Geosupport. But this proved to be both difficult to implement and computationally expensive. And while addresses and facility names are great, ideally we wanted to add actual geometries, like points or polygons, onto the capital spending data.

Without explicit addresses or facility names, it wasn’t feasible to add geometries to the Checkbook NYC dataset directly. So, we needed an external source of geometries — another dataset with geospatial information that we could join onto some column in Checkbook NYC. But which column would make the most sense to join on?

Each check in Checkbook NYC contains a number of ID fields (see the glossary for a breakdown of all the columns), from contract ID to document ID to capital project ID. Although Checkbook NYC is the only comprehensive data source for capital spending information at the level of individual checks, there is another robust, spatialized source of geospatial information at the capital project level — the Capital Projects Database (CPDB), a data product owned and maintained by DE which contains current and planned capital projects in the Capital Commitment Plan.

**Fig 2.** A screenshot of the Capital Planning Explorer, an online tool for interacting with the subset of CPDB data with geospatial information.

So, if we collapsed Checkbook NYC to the capital project level — aggregating and preserving the most significant columns — we could join the geospatial information from CPDB onto the capital projects in Checkbook NYC.

But each release of CPDB reflects only those capital projects included in the current fiscal year’s Capital Commitment Plan, and each new release replaces the prior one. DE maintains records of prior releases internally. So in the interest of obtaining a source of geospatial information that was also historical, we merged these prior releases with the current version of CPDB.

Limitations of input data

Limited backwards scope

CPDB only goes back to 2017 and Checkbook NYC to 2010. The backwards scope of these input data sources is limited and does not reach the 20-year mark requested by Capital Planning. However, it is unlikely that there are any data sources equally as robust with a longer backwards scope. Regardless, these two datasets were readily available to us and served as a strong launching point for a beta version of the Capital Spending Database.

Missing capital spending data from NYCHA

Currently, our source of capital spending data is limited to checks issued by citywide agencies. We hope to incorporate data from the NYC Housing Authority, and this is a potential enhancement.

Exploratory data analysis

Once we decided on input datasets, our next step involved conducting exploratory data analysis to understand the data better and visualize the initial results of joining the two together. We used Jupyter Notebooks and standard Python data libraries like pandas, geopandas, matplotlib, and seaborn to explore and visualize our input data as well as our intermediate results.

Initially, we were interested in learning which agencies spent the most on capital projects and seeing if there were any trends in this spending over time.

**Fig 3.** A bar chart displaying the top 10 citywide agencies by the total sum of all checks disbursed according to Checkbook NYC.

**Fig 4.** The figure on the left shows a time series analysis of percent total check amounts disbursed by citywide agencies over time. The figure on the right shows the same information as a heat map.

While this information was really interesting, it didn’t quite address the specific request from Capital Planning. For the purposes of this project, we decided to pivot and focus on the actual method of joining geospatial information from CPDB to Checkbook NYC. Now that we got to know the input data a little better, we could turn to focus on assessing the feasibility of adding geospatial information from CPDB to Checkbook NYC capital projects.

Joining geospatial information from CPDB to Checkbook NYC Capital Projects

After collapsing the Checkbook NYC records onto capital projects, we were left with 16,687 rows, each row representing a unique capital project. During our grouping process, we aggregated the sum of check amounts and concatenated pertinent descriptive information like the issuing agency, budget code and contract purposes into semicolon-delimited strings. After that, we could directly join records in CPDB onto matching capital projects in Checkbook NYC using the unique capital project ID.

Our preliminary results showed that directly joining CPDB geospatial information onto Checkbook NYC capital projects resulted in about 21% of capital spending represented in Checkbook NYC being mapped to geospatial information, along with approximately 23% of total capital projects.

Could we call this a success? At first glance, 23% doesn’t sound very promising. But what proportions of the capital projects left without a geometry are even mappable in the first place? To truly measure our success we had to try approximating the answer to that question — which led us to our category assignment process.

Category Assignment

CPDB uses a list of keywords to assign every capital project to one of three categories:

Fixed Asset: bridges, sewers, buildings, etc.
ITT, Vehicles, and Equipment: fire trucks, fiber cables, and the like
Lump Sum: non-specific funding for a large program or initiative

If a capital project does not contain any keywords in these categories, then no category is assigned.

The intuition here is that some capital projects are simply not “mappable” and place specific. For example, it is lower priority to assign a geometry to a capital project that installed new IT equipment for citywide agencies in offices across all five boroughs versus the redevelopment of a playground.

Fixed Assets are inherently more mappable than other categories of capital projects, so one of our goals was to apply the same categorization process used in CPDB to Checkbook NYC. Then, we could use the proportion of Fixed Assets in Checkbook NYC mapped to geospatial information to evaluate our methodology.

Leveraging the same SQL query used in the CPDB category assignment process, we used two descriptive text fields in Checkbook NYC — Budget Code and Contract Purpose — to assign capital projects to categories.

**Fig 7.** A bar chart communicating the results of assigning categories to capital projects in Checkbook NYC. Fixed Assets make up the majority of capital projects.

**Fig 8.** The same bar chart filtered for only capital projects successfully joined to a geometry from CPDB.

We found that:

61% of all capital projects were categorized as Fixed Assets
32% of Fixed Assets were mapped to geometries

We presented this information to the Capital Planning team. Given that we only used one source of geospatial information for our exploratory work, mapping 32% of fixed assets to geometries was incredibly promising! Galvanized by these preliminary results, we decided to see this project through the end of our summer at DCP by funneling our work into a reproducible data pipeline.

Part 3: Building a Data Pipeline

Using the insights derived from our exploratory work, we took everything we learned and turned it into an end-to-end data pipeline in Python.

What is a data pipeline?

As defined by IBM, a data pipeline is a “method in which raw data is ingested from various data sources and then ported to a data store, like a data lake or data warehouse, for analysis” [source]. Along the way, input data might be processed — whether by standard cleaning, filtering, aggregation, joining disjoint data sources, etc. Data pipelines often follow a standard ETL (extraction, transformation, loading) process, in which the output of each step in the process becomes the input to the subsequent step. The CSDB data pipeline follows these standards.

Extraction

Our input data sources for the extraction step of the data pipeline were:

Checkbook NYC, filtered to include only checks classified as pertaining to capital spending, as defined by the Office of the Comptroller; containing checks 2010-present.
Adopted versions of the Capital Projects Database (CPDB), 2017 to 2022; and the executive version of CPDB from 2023, filtered to include only capital projects with geospatial data.

To streamline the process of data ingestion, we programmatically uploaded this source data to Digital Ocean (DO), the cloud storage provider the DE team uses to house its data lake. The motivation here is to streamline the process of data ingestion — ensuring that our disjoint input data all resides in the same place for easy access.

Transformation

We proceeded to read, clean, wrangle, and join the data as necessary to produce CSDB.

Our steps proceeded as followed, primarily using the pandas Python package:

Clean Checkbook NYC. Cleaned Checkbook NYC input data by standardizing column names and excluding checks with invalid data, like negative check amounts.
Group Checkbook NYC by capital projects. We grouped Checkbook NYC input data by FMS ID, aggregating the sum of all checks issued for a distinct capital project and preserving the values of descriptive text fields like Budget Code and Contract Purpose, which we used to assign categories to capital projects.
Horizontally joined historical versions of CPDB. In cases where there were multiple distinct geometries associated with a capital project across historical versions, i.e. where there were duplicate capital projects, we chose the most recent geometry.
Joined merged CPDB onto Checkbook NYC capital projects to get geospatial data. We directly joined the two datasets using the unique capital project FMS ID.
Assign categories to capital projects. We aggregated information from descriptive text fields to assign categories to projects using the same logic as in CPDB, based on the presence of certain keywords.

Loading

At the end of the transformation stage, we had a viable, spatialized historical capital spending dataset. From here, we used a Python script to upload the outputs of our work back to DO, where it currently resides.

After implementing this third and final stage, we had a feasible, end-to-end data pipeline that one could run from the command line.

Testing

After funneling our exploratory work into a streamlined ETL process, how could we be confident in the integrity of our output dataset? With a suite of tests, of course! We ensured the veracity of our outputs as follows:

Unit testing. To ensure that each step of the transformation applied to the input data sources — especially the aggregation and category assignment processes — resulted in expected behavior, we used pytest to write a suite of unit tests.
CI testing. The Data Engineering team uses GitHub Actions to ensure that all the steps of the data pipeline work together and in the context of the larger codebase.

Part 4: Next steps and takeaways

Where does CSDB go next?

There’s plenty of work left to be done! From enhancing our sources for geospatial information to improving our categorization process and publishing the data publicly, there’s a world of possibilities for CSDB. For an in-depth discussion of potential avenues for improvement, please see my co-fellow Dea Bardhoshi’s accompanying Medium article.

Note: You can view our source code on GitHub, but not only that — you can contribute! The Data Engineering team’s repo is open-source. If you’d like to tackle any of these potential extensions to the Capital Spending Database or any other DE data product, submit or pick up an issue.

Lessons learned

My summer at DCP has been immensely fulfilling. In the process of building a viable new data product living in DE’s codebase, I became proficient in valuable technical skills, like:

Writing functional, modular code in SQL and Python, leveraging fundamental python data science packages like pandas, geopandas, and sqlalchemy
Using Python to write and run comprehensive unit tests
Constructing a data pipeline using an ETL process
Version control with git, including writing meaningful commit messages and PR descriptions; reviewing code and responding to code review; and best practices for opening and closing issues on GitHub

And equally as important, I learned intangible soft skills, like:

Communicating with supervisors, coworkers, stakeholders, and the general public about the outcomes of a technical project — at varying levels of minutiae
Collaborating remotely with a team of engineers
Establishing viable, short-term goals for my work in the pursuit of long-term deliverables, like presenting our exploratory work to Capital Planning or creating CSDB

I was so lucky to end up on a team as special as the Data Engineering team at DCP, and I’m especially grateful to my supervisor and teammates for the mentorship I received. Thank you to Amanda, Damon, Finn, and Alex for constantly supporting Dea and I throughout the summer.