Breathing Easy with Bigtable

Elissa Lerner
Jun 13 · 6 min read

As summer wildfire season officially begins in California (it started May 20th this year, according to Cal Fire), it’s hard to believe the devastation of 2018’s Camp Fire happened barely six months ago. At the time, the air quality was so bad that it rivaled the worst pollution levels in the world. And given the recent influx of IoT-connected air sensors, real-time air quality maps with scary-looking red and purple hazard zones became ubiquitous over the internet.

But as accurate as these sensors may be, the majority of them are static, and sit on top of buildings. Without deploying hundreds of sensors around the Bay Area (or anywhere else with air quality problems, for that matter), there’s bound to be some areas with less coverage. If only there were a way to make these sensors mobile, process the data in real-time, and maybe provide some historic and predictive insights around air quality while you’re at it. Right?

So that’s what we did.

In the months leading up to Google Cloud NEXT, a few Google engineers collaborated with Aclima to devise a solution using their mobile sensor networks. Recognizing we could essentially extend our existing Airview set up, which places Aclima’s air monitoring devices in Street View cars, we used their dataset for real-time historical and predictive analysis wrapped up in an interactive web app and demonstrated it at NEXT. Here’s how we made it happen.

NEXT booth demo with Aclima sensor box

Architecture

Aclima already uses Google Compute Engine (GCE) and Kubernetes to translate information from sensors into data points for analysis and writing them to Google Cloud Storage and BigQuery. But for the real-time analytics solution we had in mind, we needed something different. So we looked to Cloud Bigtable, which is ideal for large time-series data. We knew Bigtable would easily be able to handle the queries coming from the application we built, as well as the writes coming from the Aclima sensors, all within millisecond latencies.

Our ingestion pipeline was fairly straightforward. All sensor data was calibrated by Aclima’s pipeline, which receives sensor data in batches, processes it, and stores it in BigQuery. To that, we attached a lightweight (aka, not production-quality) program which would read the data from BigQuery and publish to Pub/Sub for real-time processing and visualizing. From there, our Dataflow pipeline picks up the data from Pub/Sub and sends it to GeoMesa, an HBase integration. Once processed, the data is written to Bigtable.

Ingestion pipeline

From past experience with demos of GeoMesa back at NEXT ’17 to Codelabs we’ve built to help users get familiar with designing geospatial data schemas, we knew GeoMesa was going to be helpful with this kind of data project. Since the standard Bigtable APIs only let you query by row (or range of rows), creating more nuanced queries requires planning, iterating, and testing. However, with just a few lines of code, GeoMesa designs a row ID system into a Z-curve that combines latitude, longitude, and timestamp into your dataset so you can create performant queries.

Sample Z-Curve from GeoMesa

With our underlying pipeline set and retrieving as much as 50,000 data points per second on busy days, we were ready to build a front-end app. We knew we wanted to be able to look at a given area in a selected time frame. We used Angular Material and the Google Maps API to create views for both live data and historical data, since they were stored and queried the same way. While Google Maps isn’t really a high-fidelity data visualization tool, it does make it easy to draw legends and heat maps. So we decided we would sample just every tenth point of the data in order to retain the full structure of the map while removing some of the data density. Even with these concessions, the app still delivers street-level granularity of air quality data.

Machine learning and analysis

Aggregating live data and serving it in realtime is pretty neat, but we wondered what else we might be able to do with the power of Bigtable and approximately three years of historical data in the Bay Area. Since we had metrics by latitude, longitude, and date, might we be able to make predictions about air quality on a future date? Or create a prediction for a past date that didn’t have existing data?

We decided to start training a machine learning model in TensorFlow using the data in Bigtable. Normally, TensorFlow models rely on data in GCS. But one of the perks of Bigtable is its extremely fast read speed, which means you can train a model without having to move data out of Bigtable. This is especially helpful if you’re using an online model, which is constantly updating and learning.

We ended up training our model on GCE, because we needed TensorFlow 1.11, which at the time wasn’t available on Cloud Machine Learning Engine (CMLE). If we were to redo the experiment today though, we’d be able to perform both the training and serving from AI Platform using Bigtable.

We tried a few different models, and ultimately ended up with a linear regression model using input features of NO, CO2, BC, hour of day, day of month, month, and zip code. The model was then hosted on CMLE, which provided an endpoint from which we could serve online predictions via Cloud Functions.

Training ML model

We decided to focus our experiments on NO2 predictions since we had the most data for that value. The Cloud Function would produce the predicted NO2 level for any zip code and date combination, and the entire system could continue to update as new data was ingested.

Serving architecture

While not earth-shattering news, we found that heavy traffic affected NO2 levels — bottlenecks around the Bay Bridge and areas next to airports and highways showed consistently higher levels of NO2. We also found that the feature that influenced this prediction the most was the presence of NO. This makes sense: NO is the primary nitrogen gas pollutant found from sources like vehicle tail pipes. It oxidizes in less than a minute into the atmosphere, making it one of the more common ways that NO2 is formed. And while CO2 is another indicator of combustion, the relative amounts of CO2 and NO emitted depends on the source type, so there’s less of a correlation between CO2 and NO2.

Sample prediction

The bigger the better

Ultimately, this kind of street-level air quality demo is just a proof-of-concept of the ways you might use Bigtable to work with real-time data. As Aclima continues to add more cars and regions around the world, the insights and patterns around air quality will also grow. You could imagine tracking (and predicting) spikes of pollutants in the air throughout fire seasons, or other environmental factors. You could imagine supplementing this information with public datasets like Historical Air Quality from the EPA or elsewhere. Or perhaps you’re imagining something entirely different. Whatever your biggest, fastest problem may be, Bigtable can handle it.

To try out Bigtable on your own, check out this Codelab.

Thanks to Billy Jacobson (@billyjacobson), Daniel Bergqvist (@bexie), and Robert Kubis (@hostirosti)

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Elissa Lerner

Written by

Writer. Editor. Googler. New Yorker.

Google Cloud Platform - Community

A collection of technical articles published or curated by Google Cloud Platform Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.