Be like water: building a flexible data pipeline for the Public Utilities Commission

DataSF
5 min readJun 8, 2023

In 1971, when martial artist, actor, philosopher, and San Francisco native Bruce Lee was asked about how best to express his philosophy on martial arts he said, “Be water, my friend.” Water was “formless, shapeless” taking the form of whatever vessel it was in and perfectly adapting to its circumstances.

Admittedly, there is very little overlap between this martial arts philosophy and creating a good data pipeline, but there is one takeaway which does flow through: data pipelines should be flexible and able to take whatever form the data requires.

In the rest of this blog, I’ll talk about how taking a flexible approach was critical to building a data pipeline to publish a novel dataset on public water assets in San Francisco.

Mapping public water assets

When staff of San Francisco’s Public Utilities Commission reached out to DataSF to collaborate on a map of all bathrooms and water fountains within San Francisco (“Public water assets”), we knew a flexible data pipeline would be necessary. This data could help create maps and resources to assist people experiencing homelessness, tourists looking for a bathroom, and enable analysis of the distribution of water assets across the city in terms of equity. The challenge was that the data for this project was spread across multiple departments.

There were no fewer than five departments who manage public water assets: Public Utilities Commission (“PUC”), Recreation and Parks (“REC”), San Francisco Port (“PRT”), Department of Public Works (“DPW”), and the Public Library (“LIB”). Each one had different systems of record, different data fields and different formatting. DataSF had to find a way to combine all this data together knowing that the sources and formatting would likely change over time.

In order to arrive at our final destination, we would need to prep, extract, transform and upload the data — each step of which we will explore in the following sections.

Boxes with each of the five departments (PUC, REC, PRT, DPW, LIB), connected to a question mark box, connected to an image of the final water assets map.
Our starting point

Prepping the data

Before doing anything else, we needed to make sure that (1) all datasets were stored somewhere DataSF could access them and (2) they all had some mechanisms in place to validate data and prevent unintended changes.

Ditch the spreadsheets

One of the less-stable places data can be stored is a spreadsheet. Spreadsheets are great for quick calculations and visualization, but are prone to human error and accidental changes in column names and structure. To address this, we uploaded all spreadsheet data to our Open Data Portal. There we could more easily enforce rules like consistent column names and data types. Uploading data to the Open Data Portal also came with the added benefit of a friendly REST API which allowed us to easily extract data and copy it to our data staging area (more on this below).

Check the other sources

Our other sources of data were the Recreation & Park data on an ArcGIS server and DPW COVID-19 Pit Stops data which was already on the Open Data Portal.

Boxes with each of the five departments (PUC, REC, PRT, DPW, LIB), connected to the location of their dataset (Excel, ArcGIS, Socrata), connected to a question mark, connected to an image of the final water assets map.
Where we stand after doing some data prep

Setting up a staging area

Next, we created a data staging area; an intermediate step in the pipeline meant to bring all the raw datasets together in a central location. This allows us to get them ready to be combined and transformed (this is also referred to as a data landing zone).

Extracting data from sources

Next, we needed a tool that could extract data from these sources and move it into our staging area. For this step, we selected Microsoft Azure Data Factory (“ADF”) based on its prebuilt connections, easy setup, and integration with our Azure Blob Storage staging area.

Though we could have transformed the data as we are copying it into the staging area, we decided to copy the raw version of the data into staging. This is sometimes referred to as ELT (Extract, Load, Transform) because you extract and load raw data then do transformations after. This method allows for much more flexibility down the line because you have more data to work with and don’t have to define a schema until later in the process.

Adding the extract and load steps to our flow chart

Transforming the data with Snowflake

We finally get to the step that we’ve all been waiting for: combining all the datasets together into a single source of truth for bathroom and water fountain locations!

Similar to the last step, there are many options to perform these transformations (most any data warehouse will do!) DataSF uses Snowflake, a cloud data warehouse. Though there are pros and cons to every tool, Snowflake allows us to copy in unstructured (non-schematized) data then define the schemas as we are doing the transformations. In simple terms, this means that if one of the data owners decides to delete a column, add a new value, or format their data differently, nothing will break in our pipeline up until this point. The data remains shapeless, formless, just as Bruce Lee would have wanted.

Within snowflake, all the joins (i.e. combining of data tables) and data transformations are done with SQL queries.

Publishing the dataset

Once the data is combined into its final form, all we have to do is pull it into our Open Data Portal with a pre-built gateway, write documentation, and give the people of San Francisco (and beyond) access to a nice, new dataset (and map).

This new dataset allows anyone to see where City-maintained public water fountains and bathrooms are located, the type of facility, and who operates them. From the Open Data Portal, you can peruse the facilities on a map, download the data for analysis, or create an automated connection via API to use in your own applications.

Map of San Francisco with blue and red dots representing drinking fountains and bathrooms repectivly
Public bathrooms and water fountains across San Francisco

--

--

DataSF

DataSF's mission is to empower use of data within the City and County of San Francisco. Learn more at datasf.org.