Meet the Engineers: Molly Graber

Molly Graber
NYC Planning Tech
Published in
9 min readAug 20, 2021

Molly Graber, a Colorado native, has been with the Data Engineering team at NYC Planning for almost 2 years. Coming from a background in Geography and Statistics, Molly is the main contributor to core data products including Community Profiles and Population FactFinder. Come learn about how she grew as a data engineer and how her work is leaving deep impact on NYC OpenData.

How I came to the Data Engineering team

I came to NYC generally, and to NYC’s Department of City Planning specifically, after completing a masters degree in geography from the University of Colorado. In my graduate work, I focused on using computational techniques to understand and mitigate uncertainty in small-area American Community Survey data. I was fortunate to work in a Federal Statistical Data Research Center, a secure environment where I could run analyses on disaggregated Census responses to see how spatial enumeration units (Census blocks, tracts, etc.) affect the data that informs allocation of billions of dollars in federal funding.

I came to NYC with a goal of using my technical background to improve data quality (and ideally decision-making, by implication), a general interest in demography and urban landscapes, and a strong desire to apply my training to contribute to my new, local community. My path through grad school and before (I’d studied quantitative economics, also at CU) was generally guided by a fascination with how our societies — small and large — make decisions with uncertain information. Spatial data particularly fascinated me, since an undergrad foray into GIS made me realize how adding a spatial component makes data uncertainty even more perplexing!

I landed in the Department of City Planning’s Data Engineering team. It was a good fit, where I use technical skills to produce high quality data used by planners, DCP’s economic and demographic researchers, city analysts in other agencies, and the public.

What my days look like

“As a member of a small team, I’m able to contribute to our data workflows from start-to-finish… I’m in regular communication with my coworkers. Data Engineering is a very collaborative team, especially considering how much of our days are spent writing code on our own.”

My days as a data engineer are primarily spent either writing code in python, SQL, or bash, or doing “detective” work to identify inconsistencies in our data products. I particularly enjoy the investigative part of my days, where I’m able to trace data anomalies backwards to find where in our logic (or our source data) they originated. This type of problem-solving is both creative and systematic, and it improves the quality of data we produce. Throughout all of these activities, I’m in regular communication with my coworkers.

Building a data product, start-to-finish

As a member of a small team, I’m able to contribute to our data workflows from start-to-finish. The early stages of a project typically involve scoping out the requirements for a given data product, based on the needs of the data user and most valuable characteristics of our source information. This is typically iterative, and includes drafts of both schema and logic. As we identify the work requirements, we plan out specific tasks to tackle in code and organize these as GitHub issues. We’re very organized, which I suspect is one reason our small-but-mighty team has accomplished so much, even in the midst of a pandemic!

I’m also able to determine how we’ll “gather the ingredients.” What is the best source data for answering a given question (the key questions of our products’ end-users)? Where can we most consistently get this data? How can we retrieve it in a way that is automated and ensures we have the most up-to-date information possible? What are the shortcomings of our source data, and how will these affect our final products?

Next comes writing the necessary logic, typically in Postgres, for combining our raw data into a set of tables meeting our previously-scoped objectives. This portion of my work is a little less technology-heavy, and a little more logic-heavy. My most trusty tools are databases where I can test out queries, GitHub actions workflows where I can quickly rerun entire pipelines for testing purposes — I think of these as my sandbox — and the invaluable code reviews of my coworkers.

Streamlit application for PLUTO version-to-version QAQC

I’m also able to brainstorm ways of checking the quality of our resulting datasets. As someone passionate about data quality, I love this aspect of my work. Sometimes, QAQC processes are ones Data Engineering uses to identify bugs in our own logic. These might take the form of SQL queries flagging edge-cases or outliers. Sometimes, we use similar queries to create tables flagging records that potentially need review by the domain expert, which we then incorporate as manual corrections. These corrections allow for record-specific exceptions to our logic. Other times, we use QAQC tables to identify potential improvements to source data, including the files underlying Geosupport (NYC’s geocoding software). Not all of our QAQC processes take the form of tables, however. We’ve also created Streamlit applications to aid in domain-expert review, with visualizations showing version-to-version differences and various summary statistics.

Data delivery is a fun and varied aspect of our work. We strongly believe that making good quality data includes making data that is easily accessible for its intended audience. We’re constantly improving ways of delivering our data to users, largely guided by their particular needs and skillsets. Often our data ends up hosted as CSVs in DigitalOcean with download links directly in a GitHub repository’s ReadMe. Sometimes, these hosted datasets are automatically synced with tables in Carto for Planning Labs’ (the team in charge of many of DCP’s data exploration applications) “staging” or “testing” applications. Some of our datasets go to our GIS team for final processing prior to publication on Bytes of the Big Apple and NYC Open Data. We’ve even created ways for hosting our data on Sharepoint, as much of city government work happens in Microsoft ecosystems. Most recently, we’ve been exploring hosting data on Google Cloud Platform for immediate integration with analysis and visualization tools.

Automated workflow execution via GitHub Actions

Tying it all together, we also spend our days refining automation and execution of these steps. Over the past year or so, we’ve relied heavily on GitHub actions. This has let us produce data products at a more frequent cadence, and even to off-load some of the execution of our processes to other teams.

I “talk” a lot!

Data Engineering is a very collaborative team, especially considering how much of our days are spent writing code on our own. We’re in fairly constant communication with each other (these days, in the form of Teams chats and GitHub issue comments) and heavily rely on each other for brainstorming potential causes of bugs and ideas for process improvements. When solving a problem, multiple perspectives are usually better than one, and we learn from each other. Our highly collaborative culture means we don’t accidentally duplicate work or let things fall through the cracks by assuming someone else is working on a task. We highly value the maintainability of everything we create. Collaboration and open communication help us prevent knowledge from getting siloed in a single person.

Our chattiness extends beyond our mighty team of three. We’re also in regular communication with the business owners of our data products. We reach out to make sure that our logic matches what they’d intended, give them opportunities to refine their desires for a product, and keep them up-to-date about our progress and timeline. Since many of our datasets feed into downstream analyses, we know others rely on our ability to accurately communicate our progress, timeline, and work capacity.

We also communicate our work to the agency and public. As a team that primarily produces behind-the-scenes data (rather than front-end applications), we often need to explicitly show off our work for it to be seen. We’ve done this by blogging, writing public-facing documentation, and (our favorite) giving presentations and demos. Our audiences are other DCP teams, other agencies, and the general public. These showcases are crucial to our continued success as a team. We’re able to draw more people to use our data, as well as keep in-the-loop about new use-cases that pop up.

Why DCP’s Data Engineering team is great

“I’m able to work with cutting-edge technologies and techniques typical of the private sector, but within a city agency… I’m surrounded by experts in various domains (planners, demographers, economists, urban designers, lawyers), and am able to learn from their expertise.”

Working for DCP’s Data Engineering team is truly unique. I’m able to work with cutting-edge technologies and techniques typical of the private sector, but within a city agency. I appreciate that the code I develop is placed in a larger context. Many users of my work are within my own organization. The data we create and the quality improvements we bring have direct impact on decision-making for the city where I live and work. I’m able to indirectly support some of the most basic and fundamental city services — ones I rely on myself! My job also fuses the old with the new. While the tools I use on a day-to-day basis are typically modern and rapidly changing, I’m able to add my contributions to legacy data that have existed in some form since before digitization, and likely will continue to exist for years to come.

I especially appreciate that I am a tech employee in an organization that is predominately not staffed by people with tech backgrounds. I love improving my abilities to communicate technical concepts to audiences with other skillsets. I’m also surrounded by experts in various domains (planners, demographers, economists, urban designers, lawyers), and am able to learn from their expertise.

The best part of my role in Data Engineering is the people I work with. I have incredibly talented coworkers who expose me to new tools or concepts daily. I work under supportive leadership and have lots of room to grow within my position.

What I’ve learned over the past few years

“We’re able to pivot and explore new tools fairly quickly, which allows for a fast pace of skill development… While technical skill acquisition is probably most measurable, I’ve picked up project management skills as well. I’ve also learned a lot about NYC!”

Over my time with Data Engineering, I’ve been thrilled at how quickly I’ve been able to hone new technical skills. Within the past year and a half, I’ve vastly improved my python, SQL, bash, Docker, and AWS abilities. I’ve learned postGIS, GitHub actions, and how to integrate automated workflows with platforms such as Microsoft and Carto. I’ve also been able to work in Google Cloud Platform (a relatively new addition to our toolkit). We’re able to pivot and explore new tools fairly quickly, which allows for a fast pace of skill development.

While technical skill acquisition is probably most measurable, I’ve picked up project management skills as well. By taking on our latest enhancement effort for the Developments Database, I’ve learned how to scope product improvements with business owners, determining relative difficulty and value of various changes, break larger tasks into actionable items, translate business requests into technical GitHub issues, and communicate our progress throughout the duration of the enhancement period.

I’ve also learned a lot about NYC! Working closely with the Geographic Research Unit, I’ve learned about many of the city’s geographical quirks. In working on processing Census and American Community District data for Population FactFinder, I’ve learned about NYC’s demography and why counting people in the city is such a challenge. I’ve learned about the zoning process and how both neighborhood studies and lot-level zoning actions ultimately affect what can get built where. I’ve learned about the building permitting process and how this data can valuably suggest where future New Yorkers will live. In interactions with my DCP colleagues, as well as others working for the City, I’ve learned the ways in which data supports the essential functions of a city agency. I’ve also learned of the insufficiency of data in guiding decision-making, and how community engagement and relationships are also key.

Dreams for the future

“Going forward, we aim to focus on creating a larger data ecosystem.”

As a team, Data Engineering has big dreams for our future. We’ve gotten good at producing high quality data in a way that is transparent, efficient, and automated. Going forward, we aim to focus on creating a larger data ecosystem. This means using platforms such as Google Cloud Platform to allow for easy data discovery, a searchable data catalogue, integrated metadata, and direct access to computational resources for exploratory data analysis and visualization. We also hope to add value to our data by communicating “best practices” and dataset literacy, particularly when working with uncertain data such as small-area ACS data. We’re also hoping to not only build data, but a community of data users. We’re exploring ways of allowing city analysts and the public to share analysis ideas, processing tips, and feedback with us and each other. In short, the future of Data Engineering holds opportunity for learning new tools and improving the context and quality of our work.

--

--