BigQuery best practice IoT

Google Cloud best practice for data science with IoT.

Nicholas Ord
Google Cloud - Community
5 min readApr 1, 2019

--

Here a few essentials to get “data science ready data” into Google BigQuery.

This applies primarily to industrial applications like predictive maintenance but is illustrated here in some fun examples for the home.

Engineering Requirements:

  1. Each sensor gets its own microsecond timestamp (UNIX).
  2. The device (holding multiple sensors) also gets its own timestamp.
  3. The server stamps that device data packet when it arrives.
  4. Atomic clock reference for all 3 of the above (device pulls this on boot)

BigQuery errors which drive data scientists crazy:

BigQuery entry error

device_time increased OK

server_time follows OK

The light values changed quickly (blue circle) OK

BUT the sensor ts (red circle) stayed the same NOT OK

Issue

Handling concurrent micro-second timestamps across 11 sensors in real time (synced to an atomic clock) is hopeless in Linux but it can be done in RTOS (C / Assembler). We used a version from commercial helicopters where synchronous inputs are vital, running on ARM M4 with FPU and Crypto:

Real time Section separate to IP Linux Section

Why is this important?

If we don’t make this effort at the IoT sources level closely working with the cloud then data into Google PubSub => Google DataFlow will be junk.

Timestamps need to be precise for each sensor event and synchronous with each other, or data scientists get very frustrated.

Imagine driving a car but looking out at traffic movements from 10 seconds ago. Would you drive? Real time data and near real time analysis in milliseconds is crucial for anything termed “important” or “urgent”.

Share prices, aviation, military intelligence, energy supply grids, hospitals — but what about any asset which will be online in the future?

Consider that the first customer in this value chain is the data scientist. Only they can tell a product owner what might be possible with the data.

Examples using Google DataLab

The noise dB levels shown in blue (digitally rectified audio — only amplitude not frequency; so no speech) correlate well with the thermal imaging of a person in red. We can infer that the person is alive and doing something like cooking in that location without needing to film them or overhear their voice.

Two sensors now in the room and the sensor streams do not correlate well

In both data sets are about 2 million JSON readings, accurately timestamped. The costs on a Google Cloud IoT system will be about 4 EUR per year per board.

How long in the bathroom?

How long was someone in the bathroom can be determined from accurate overlays of timestamped thermal sensor data points — the more intense, the longer the person’s body heat made an impression — similar to a telescope on long exposure times looking at distant stars

What happened in the toilet?

A small room in which the CO2 starts to increase with breathing, then a large emission of organic gases then the extractor fan removes all gases to a much lower level — all in 7 minutes.

This sequence then created a machine learning label “person went successfully to the toilet today” — which (excuse the Beavis and Butt-head tittering in the background) is actually one of the top 4 questions between doctors and elderly patients. Machine learning features can preserve the privacy of the person.

Machine learning the action of a toilet (gas combinations over 7 minutes combined with thermal imaging (interpolated algorithms) of person sitting.

The vibration “peaks” in that same toilet location correlate with entering the room (lights going on and off) which hardens the ML label. If timestamps were not there it would only be a mush of information. Instead we have a 95% confidence in real time of a complex human action. This is then clear enough for an AI to “understand”.

The same can be done for sleeping, eating, living room activities which over time create share price like graphs of a person’s life (and there’s a ton of useful analytics software for finance which can be re-purposed for accurate health care trends).

How will I feel in 2 days?

Imagine asking the question “how is my mother?” and seeing that she is OK, in real time, with predictive medical insight for the future.

Without her wearing anything, no cameras or speech microphones. How valuable is that insight to a concerned relative? Or possibly a service trying to manage hundreds of thousands of patients in a city at the same time?

It is one thing to have real time data at low cost.

It is another to have analytics processed fast enough (of that data) but also in real time to produce results in less than 100 milliseconds latency at scale.

Here how this will look in production.

Factory environments:

Temperature, pressure, humidity and organic gases for each day of the week. Strong outliers in the organic gases show emissions / cleaning events.

For certifying safety and environmental compliance, BigQuery messages are easily authenticated and hashed at source using on board crypto.

Temperature has tight boxes of temp versus Pressure more variable due to weather changes
Humidity versus Organic Gases (which has outlier spikes from machines cleaning fluids)

Large Office:

CO² in a typcial office over 12 hours. At 1750 ppm brain function significantly decreases. Some offices had 4 clusters of sensors every 100m² and we found clouds of CO² tends to hang heavy in the corners even after the windows are opened. To get oxygen back to 400ppm CO² base level is to circulate air in corners.
Same office with light coming thru a closed window East sunrise peaks midday until it goes round the wall. Closed window explains CO2 increase. Objects like chairs moving (accelerometer Z axis). People going into closed office meetings after 13:00 shows reduction in dB correlating with less vibration.

For info on the GCP IoT architecture used check IoT on Google Cloud at scale

Thank you to 3 ex colleagues from LHC CERN who enlightened me on the importance of time stamping at source to create this IoT lab. This work led to a number of patent applications

Key:

93E3 BEBC C164 D766

publicly auditable identity here

--

--