BigQuery best practice IoT
Google Cloud best practice for data science with IoT.
Here a few essentials to get “data science ready data” into Google BigQuery.
This applies primarily to industrial applications like predictive maintenance but is illustrated here in some fun examples for the home.
Engineering Requirements:
- Each sensor gets its own microsecond timestamp (UNIX).
- The device (holding multiple sensors) also gets its own timestamp.
- The server stamps that device data packet when it arrives.
- Atomic clock reference for all 3 of the above (device pulls this on boot)
BigQuery errors which drive data scientists crazy:
device_time increased OK
server_time follows OK
The light values changed quickly (blue circle) OK
BUT the sensor ts (red circle) stayed the same NOT OK
Issue
Handling concurrent micro-second timestamps across 11 sensors in real time (synced to an atomic clock) is hopeless in Linux but it can be done in RTOS (C / Assembler). We used a version from commercial helicopters where synchronous inputs are vital, running on ARM M4 with FPU and Crypto:
Why is this important?
If we don’t make this effort at the IoT sources level closely working with the cloud then data into Google PubSub => Google DataFlow will be junk.
Timestamps need to be precise for each sensor event and synchronous with each other, or data scientists get very frustrated.
Imagine driving a car but looking out at traffic movements from 10 seconds ago. Would you drive? Real time data and near real time analysis in milliseconds is crucial for anything termed “important” or “urgent”.
Share prices, aviation, military intelligence, energy supply grids, hospitals — but what about any asset which will be online in the future?
Consider that the first customer in this value chain is the data scientist. Only they can tell a product owner what might be possible with the data.
Examples using Google DataLab
The noise dB levels shown in blue (digitally rectified audio — only amplitude not frequency; so no speech) correlate well with the thermal imaging of a person in red. We can infer that the person is alive and doing something like cooking in that location without needing to film them or overhear their voice.
Two sensors now in the room and the sensor streams do not correlate well
In both data sets are about 2 million JSON readings, accurately timestamped. The costs on a Google Cloud IoT system will be about 4 EUR per year per board.
How long in the bathroom?
What happened in the toilet?
A small room in which the CO2 starts to increase with breathing, then a large emission of organic gases then the extractor fan removes all gases to a much lower level — all in 7 minutes.
This sequence then created a machine learning label “person went successfully to the toilet today” — which (excuse the Beavis and Butt-head tittering in the background) is actually one of the top 4 questions between doctors and elderly patients. Machine learning features can preserve the privacy of the person.
The vibration “peaks” in that same toilet location correlate with entering the room (lights going on and off) which hardens the ML label. If timestamps were not there it would only be a mush of information. Instead we have a 95% confidence in real time of a complex human action. This is then clear enough for an AI to “understand”.
The same can be done for sleeping, eating, living room activities which over time create share price like graphs of a person’s life (and there’s a ton of useful analytics software for finance which can be re-purposed for accurate health care trends).
How will I feel in 2 days?
Imagine asking the question “how is my mother?” and seeing that she is OK, in real time, with predictive medical insight for the future.
Without her wearing anything, no cameras or speech microphones. How valuable is that insight to a concerned relative? Or possibly a service trying to manage hundreds of thousands of patients in a city at the same time?
It is one thing to have real time data at low cost.
It is another to have analytics processed fast enough (of that data) but also in real time to produce results in less than 100 milliseconds latency at scale.
Here how this will look in production.
Factory environments:
Temperature, pressure, humidity and organic gases for each day of the week. Strong outliers in the organic gases show emissions / cleaning events.
For certifying safety and environmental compliance, BigQuery messages are easily authenticated and hashed at source using on board crypto.
Large Office:
For info on the GCP IoT architecture used check IoT on Google Cloud at scale
Thank you to 3 ex colleagues from LHC CERN who enlightened me on the importance of time stamping at source to create this IoT lab. This work led to a number of patent applications
Key:
93E3 BEBC C164 D766
publicly auditable identity here