PLUTO in Five Acts

Amanda Doyle
NYC Planning Tech
Published in
8 min readJun 1, 2022

I set out to write this blog post over two years ago. Instead of a written text, my ideas became a talk that I first gave at FOSS4G-NA in 2019, then subsequently to several more audiences. Wanting to have this information reach a wider audience, since PLUTO remains a core NYC dataset, I have finally put in the time to complete this blog post to share the work DCP’s Data Engineering team has done on PLUTO over the past four years.

Act I: Introducing PLUTO

PLUTO stands for Primary Land Use Tax Lot Output. It’s NYC’s definitive tax lot dataset that was first published by the NYC Department of City Planning (DCP) in 2002.

Users may use PLUTO at a micro level to learn about the owner, zoning, size, and building class of an individual tax lot. Users can also use PLUTO for macro-level analyses and study the built landscape of NYC, and how it’s changed over time.

PLUTO has a multitude of use cases because of the variety of information it contains about all NYC tax lots. PLUTO’s 87 columns capture lot level, building level, and geographic attributes. It’s this variety of data that causes PLUTO to be the “Frankenstein of the NYC lot data sets,” as David Elner describes in his blog post.

I do not know how and why DCP started to publish PLUTO, but PLUTO did become DCP’s premier public data product. The information in PLUTO is so valuable that academics, real estate developers, agencies, planners, civic technologists, and other members of the public were willing to pay $300 per borough to access the data. In 2013, PLUTO’s user base expanded with the passage of the 2012 NYC Open Data Law, and PLUTO was released on Bytes of the Big Apple free for all to use.

Act II: Modernizing PLUTO

In 2018 the Enterprise Data Management (EDM) and Data Engineering teams were created within DCP, and our CIO tasked us with rebuilding and releasing PLUTO using modern technologies.

When we took on the project of modernizing PLUTO’s build process, no one person understood the entire build process. At the time, PLUTO was generated over the course of six weeks across three different teams, each contributing their piece of the puzzle. Furthermore, much of PLUTO was processed and built on the mainframe, in (PL/I) code. Since we’re not mainframe programmers, the code was opaque and inaccessible.

How do you recreate a product whose code you can’t edit or read? You reverse engineer it.

Step 1: Get source data
Most of the attributes found in PLUTO come from three datasets from the Department of Finance (DOF): the Property Tax System (PTS) — which recently replaced RPAD, Computer Automated Mass Appraisal System (CAMA), and the Digital Tax Map (DTM). Input datasets are also sourced from DCP, Department of Citywide Administrative Services (DCAS), and other City and Federal agencies.

Step 2: Read documentation
Data dictionaries and documentation were key to our success. Before diving into code, we read through the data dictionaries for the source datasets, and started mapping PLUTO fields to the fields of the input data sources. To give you a flavor of what this process was like, here is RPAD’s data dictionary, here is CAMA’s, and here is PLUTO’s.

Step 3: Start coding
Referencing previous versions of PLUTO, we started to translate research into code. This involved experimentally manipulating attributes from source data until we produced output values matching those in the previous versions of PLUTO. While this section is very short in this blog post, most of the time spent modernizing PLUTO was spent developing and iterating on code.

Step 4: Review the output
Given how many people use PLUTO on a regular basis and rely on its data to inform decisions that shape NYC, we couldn’t publish a new version of PLUTO that deviated far from previous versions or that had incorrect data. So, we devised a way to compare our version of PLUTO to a previous version and report how many records share the same Borough, Block, and Lot (BBL) value, which is PLUTO’s unique identifier, but don’t have the same value in a given field. This comparison was imperative to make sure that we were transforming and interpreting the source data correctly. After dissecting and putting PLUTO back together our version of PLUTO did not always match the data reported in previous versions, but there were a few contributing factors.

  1. We didn’t have the same versions of the source data that were used to develop previous version of PLUTO, so naturally our output would differ.
  2. We changed some of the input data sources to use open datasets. Now, we use NYC GIS Zoning Features to programmatically determine the zoning designations of each tax lot, instead of a spreadsheet that was manually updated to capture the zoning data of each tax lot. To populate the owner type field we map data attributes from the City Owned and Leased Properties (COLP). The Landmarks Preservation Commission’s (LPC) Historic Districts and Individual Landmarks datasets are used to populate the historic district and landmark field.
  3. In cases where a value could be selected at random, we added logic to make sure that values are selected consistently. For example, there can be multiple lot types associated with a single lot, and in the past one value was selected at random. Now, we implement logic to assign the lowest valid lot type value.
  4. We changed the format and precision of some of PLUTO’s fields to be consistent within PLUTO and with other DCP data products. For example, now the fields Lot Frontage, Lot Depth, Building Frontage, and Building Depth are all rounded to two decimal places. Additionally, we standardized address formatting and removed extra spaces. Lastly, we worked to apply NULLs consistently in that a NULL means that there is no information available.

Step 5: Publish!
After a year of research, development, and experimentation, PLUTO production migrated off the mainframe and the Data Engineering team released its first version of PLUTO in 2019. Now, PLUTO is built and reviewed in less than two days across one team using open source technologies, and all of the code is open on GitHub.

It’s important to us that PLUTO’s build process is transparent to support open analytics and algorithms, and enables users to explore how each of PLUTO’s attribute values are calculated. More importantly, now that we understand how to create PLUTO, we can improve it.

Act III: Improving PLUTO

After publishing PLUTO, we immediately started soliciting input from Planners, other city agencies, and the public on how to improve it. These conversations lead to some enhancements, including:

  1. Normalizing the owner name values for City agencies. For example, previously, lots owned by the Department of Parks and Recreation could have an owner name value of “Parks Department,” “Dept. of Parks,” “NYC Parks Dept,” or one of five other variations. Now, all of these lots have the owner name standardized to “Department of Parks and Recreation.” This improvement makes it much easier to filter and analyze this attribute.
  2. Updating the input data for the year built value. LPC informed us that they had a new dataset with detailed year built information for landmarked buildings and buildings in historical districts. Given that many of the year built values from DOF for lots with buildings built before 1980 are rounded to the nearest five, the new data from LPC adds more detail. There wasn’t a building boom in 1920, 1925, and 1930, and given that this field is defined as describing the year built in 1-year increments, having 5-year interval data in the same field resulted in inconsistencies. If all year built values were rounded, a user wouldn’t assume that there were building booms at the start of each decade. Additionally, lots with buildings built before 1900 were often given a year built value of 1899. Again, construction in NYC didn’t start in 1899. Now, as a result of research and improved source data many of these lots have more precise year built values.
  3. Implementing manual corrections. Some values in PLUTO were just wrong. There is not a 100 story building in Staten Island. We implemented a method to overwrite values at the lot-level, dependent on the original value. For example, if lot 5012900096 has 100 floors, we’ll add a record to our corrections table saying: “change the number of floors value to 0 for lot 5012900096 anytime the number of floors is originally 100.” If DOF fixes the number of floors (i.e. sets them to 10), no overwriting happens. A PLUTO user can see if a record was edited by DCP based on the value in the DCPEdited field. Then, a user can look up that BBL in the manual research table to see all the changes that DCP made to that PLUTO record.

Through workshops with PLUTO users from DCP, other City agencies, and the public, we started a backlog of other potential improvements we could make. These include modifications of existing fields as well as the addition of new fields to help meet existing user needs and expand PLUTO’s use cases.

Furthermore, DCP improved its Quality Assurance and Quality Control (QAQC) process for reviewing a version of PLUTO before publishing it. Building off of the comparative review established when reverse engineering PLUTO, we built out bite-sized reports and documentation on a QAQC site. This site makes the QAQC process manageable with clear sections and descriptions on what a reviewer should be aware of, and enables the team to see how PLUTO has changed between versions. If several members of EDM review the PLUTO outputs and QAQC reports and do not report any anomalies that cannot be explained via source data or methodology changes, then PLUTO is cleared for publication and made available to the public.

Act IV: Automating PLUTO

In 2020 the emergence of new technologies thrust the team’s data production pipelines into the age of cloud automation. Additionally, DOF moved many of their data products off of the mainframe and began publishing them on the internet, as opposed to the intranet. Using GitHub Actions, we were able to to operationalize and automate the PLUTO production process, including data ingestion, execution, and publication. As a result, we published eight versions of PLUTO in 2020, instead of the usual bi-annual updates, all while working from home.

Act V: Future PLUTO

I don’t exactly know what the future holds for PLUTO, but I do know PLUTO is a prominent dataset in the open data world and people have big ideas and high expectations for it. As a team, we’ll continually plan for PLUTO improvements. Additionally, because of PLUTO’s large user base and lack of a defined governance, we’ll continue to solicit feedback from users to get ideas on what changes would be useful and to help us prioritize enhancements. So, we ask that if you have an idea for PLUTO, please let us know by opening an issue here or directly reaching out to us.

--

--

Amanda Doyle
NYC Planning Tech

Urban scientist / Geographer / Data engineer / City enthusiast