Mapping Historical Capital Spending in NYC: Part 2

Where does the Capital Spending Database go from here?

Dea Bardhoshi
NYC Planning Tech
8 min readSep 28, 2023

--

Photo by Annie Spratt on Unsplash

What is the Capital Spending Database?

The Capital Spending Database (CSDB) is a new data product that presents the first step in producing a spatialized backward-looking view of the City’s capital spending. Each row in the dataset contains key information about one of over 16,000 unique capital projects, including:

  • a list of agencies involved in the project
  • the project category
  • the sum of all checks disbursed for the project

And, most vitally:

  • geospatial information where possible

This product was created in direct response to requests from DCP staff and agency partners: previously, there existed no such dataset linking historical spending of the capital budget to location information. At its current stage, the two source inputs are Checkbook NYC and Capital Projects Database (2017–2023).

Our work so far focused on joining Checkbook NYC data, which spans from 2010 to present, to the Capital Projects Database (CPDB) geospatial data (2017–2023). Despite the temporal limitations of the CPDB data, we successfully assigned geometries to ~23% of Checkbook NYC capital projects. Additionally, using keywords in descriptive fields we successfully categorized 92% of the projects into one of three categories:
- Fixed Asset: bridges, sewers, buildings, etc.
- ITT, Vehicles, and Equipment: fire trucks, fiber cables, and the like
- Lump Sum: non-specific funding for a large program or initiative
We prioritized “fixed asset” projects, as these are often place specific physical infrastructure projects.

This post will explore how we can improve the categorization of projects, and increase the number of projects that are mapped.

Who am I?

I am a recent UC Berkeley graduate with a Bachelor in Data Science with a concentration in Geospatial Information and Technology. I am very interested in working at the intersection of data science, public policy and especially urban planning. I love thinking about how cities develop and the role technology has to play in their complexity, so contributing to the interdisciplinary work DCP does has been very inspiring. This summer, I have worked as a Data Fellow with DCP’s Data Engineering team and here I’ll share some of the results that me, my co-fellow Ali Thursland and the rest of the Data Engineering team have worked on.

How else can we add geospatial information onto the Capital Spending Database?

In the current version of CSDB, we have been able to map ~23% of all projects and ~32% of projects categorized as Fixed Asset. What about mapping the rest?

The fields we use to categorize projects contain valuable descriptive information that we could use to enrich the data even further with more geospatial data. For instance, one budget code entry example is “MEA 3 (Flushing Meadows Corona Park)”. Clearly, this information is useful: if we could extract “Flushing Meadows Corona Park” from the budget code, we would have a place that we could map to geospatial information using one of the other City Planning datasets.

Therefore, to answer this question, we have mined the descriptive fields of the dataset (Budget code and Contract Purpose) for what we call “place mentions”. “Place mentions”, in our analysis, have been either some mention of street address (e.g. “155th Street”) or of a keyword (such as “McCarren park”). The reasoning behind this strategy is that once we have place mentions, we can use them to match to other datasets that contain the same information but also have geometries/location information. Here’s a snapshot of the columns we were working with:

In our exploration phase, we spent some time pursuing this approach. Firstly on street addresses, we used regex to extract common street address patterns, such as “NUMBER STREET” or “NUMBER AVENUE” while also accounting for common typos. The initial results were relatively modest: most street addresses in the descriptive columns were incomplete, despite a few of them being fully correct.

We were able to extract a few of these patterns, however. After extracting street information we were able to map several million dollars in spending along 155th Street, 8th Avenue, and 79th Street. Importantly, a key limitation of this approach is the fact it does not lead to exact addresses, but rather it maps information to entire streets. This is somewhat of a loss of granularity that is important for the spatial analysis our product is trying to inform.

Secondly, we honed in on capital projects whose “Budget Code” contained a keyword indicating that the project was likely a “Fixed Asset” and a place specific infrastructure project. Keywords included, park, garage, facade, and the like. Here’s a sample:

Our initial results were also relatively modest: we found that only around 6% of the checks in our data contained one of the keywords we were looking for. Collectively, those checks accounted for around $14 billion in spending. We found that the most money, $2.5B, was spent on projects containing the keyword “sewer” in the budget code, and projects with “plant” in the budget code accounted for $2B in spending. The limitation of this approach is the same as that of the street addresses: extracting keywords would be an intermediate step to mapping these checks to exact locations. However, this analysis gives an “upper bound” of what we think these keywords can be mapped to.

Once we had these place mentions, our approach for mapping them into other datasets was fuzzy string matching, a string-based technique that seeks partial matches between two strings. The higher the similarity score, the more likely the two strings are referring to the same thing, e.g. “Harlem Success Academy Charter” and “Success Academy Charter School Harlem”, which would have a high match score (0.43 in a range from 0 to 1).

The two datasets we identified would be useful to carry out this fuzzy string matching procedure along our checkbook data were the Parks properties and Facilities database (FacDB). Each of these datasets contains geospatial information which, if mappable to our capital spendings data, could increase the geospatial reach of our data product.

The fuzzy string matching process came with some challenges and interesting discoveries. First, the challenges. It turned out that when we tried to match two datasets together using fuzzy strings, the run time was prohibitively long, often in the range of a few minutes for just a subset of our data (e.g. checks only from the Department of Education (DOE), which were around 1500 took about 30 seconds). In order to account for these issues, we found that doing some data cleaning was often useful: for instance, replacing typos within the “Budget Code” descriptions or restricting our analysis to certain agencies, such as the Parks Department.

There were some interesting tools that we came across as well. Our package of choice during this initial exploration was fuzzymatcher, a tool that matches and joins two datasets together on a common column. A more advanced tool that is based on fuzzy matching is Splink, a package for probabilistic record-matching built for the purpose of joining 2 datasets that don’t have a unique identifier column in common. Splink predicts which two rows should link together and then produces the most likely match, at a much better performance.

What were our results? This is yet another nuance of this technique: often, high matching scores would act as proxies for accuracy given that it is difficult to know whether a match is indeed correct (i.e. the two strings are referring to the same entity). For the Department of Education checks, we were able to match around 2% of the money disbursed to records from FacDB. Despite this modest percentage, all hope is not lost for fuzzy string matching! In the future, we can expand our analysis to other agencies in addition to the DOE, and use other datasets more suited for matching these mentions.

How can we improve the categorization process?

A large part of our exploratory analysis focused on categorizing the capital projects within the data into one of the 3 CPDB categories: “Fixed Assets”, “Lump Sum”, “ITT, Vehicles and Equipment”. When assigning categories, we used the same strategy as with CPDB categories, using a keyword query on the descriptive columns of our dataset (Budget Code and Contract Purpose). In our categorization process, we wanted to prioritize assigning Fixed Assets as they typically correspond to physical infrastructure projects and hence would be more useful for adding geospatial information too.

Our categorization process was successful: we were able to match 92% of projects to categories and around 61% of them to Fixed Assets specifically. In order to improve on the initial success of our categorization results, we also explored other keywords we could add to the queries. By exploring the datasets for these new keywords, we came across some potential new additions. For instance, accessibility-related terms such as “ADA” or the word “accessibility” itself. How many projects contained one of these new keywords? It turned out, a relatively small amount (80–150 project in total) for each of the two columns we queried on.

Conclusion

As this post has outlined, there are a couple exciting directions for our new data product in the future. We’ve explored how to extract information from our main dataset, Checkbook NYC, and do that using various text-based techniques. For instance, we’ve been able to extract some street and place mentions, give a sense of how many projects can be mapped in this way and use fuzzy string matching to match them with other geospatial datasets. We’ve also tried improving how we are categorizing these projects, and noticing the appearance of new keywords in the data such as emerging technologies. There’s more exciting work to build on from here!

Lessons Learned:

Working with the data engineering team has already taught me so much in a relatively short amount of time. From merging my first ever Pull Request and reviewing other team members’ code to best practices for managing to-dos and learning how to scope open-ended projects, my time with the team has been incredibly instructive. I’ve had the chance to learn more about how a data engineering team can operate in the larger context of urban planning, how all the cool products the team builds and maintains are created by working with planners, GIS specialists or demographers and how the work we do directly affects New Yorkers. Seeing the data products in action and helping to make them better on the back-end has been a very inspirational experience and I am thankful to everyone in Data Engineering and DCP for their help and support!

--

--

Dea Bardhoshi
NYC Planning Tech

👩‍💻 Data Science UC Berkeley '23 | 🏙 Data Science, Urban Planning, Civic Technology | ✍️ Newsletter: https://deabardhoshi.substack.com/