What’s the problem?
I’ll begin with an short motivation for why exploratory data analysis (EDA) may be important for you. The data provided by sensors does never match the conditions in the real world.
Dependent on the domain you are working in the sensor data quality may be more or less important. If you are working in the home automation domain working with temperature sensor data accuracy should not be critical. If you work with sensor data in the pharma domain minor temperature differences in the processing chain of a drug might make a big difference w.r.t. the quality of the drug. In all cases where “wrong” sensor data might result in major harm you’ll usually care a lot about sensor data quality.
If you care about sensor data quality it may be analyzed using EDA which is “an approach to analyzing data sets to summarize their main characteristics”, usually as metrics or visualizations w.r.t. descriptive statistics. Analysis results have to be communicated with teammates, stakeholders, customers, etc. This can be done in a compact form using metrics like e.g. variance. However communicating analysis results using visualization is often more efficient cause the analysis results are easier to understand, especially in the case of multivariate sensor data (more about this topic later).
- Outliers (synonyms: anomalies, spikes): “Values that exceed thresholds or largely deviate from the normal behaviour.”
- Missing data: “Incomplete data.”
- Bias (synonym: offset): “A value that is shifted in comparison with the normal behaviour of a sensor.”
- Drift: “Readings that deviate from its true value over time due to the degradation of sensing material which is an irreversible chemical reaction.”
- Noise: “Small variations in the dataset.”
- Constant value: “Readings with a constant value over time.”
- Uncertainty: “A parameter, associated with the result of a measurement, that characterizes the dispersion of the values that could reasonably be attributed to the measurement.”
- Stuck-at-zero (synonym: dead sensor fault): “Values that are constantly at zero over an extended period of time.”
Dependent on the sensor type and sensor principle for the given sensor type the sensor is affected by more or less error types. Usually sensors which are based on physical principles like e.g. an ambient light sensor using a photodetector are affected less cause they usually do not suffer from drift. In comparison sensors based on chemical principles are affected more cause they usually do suffer from drift.
One side note w.r.t. drift: The definition above is incomplete. In addition to deviation from a true value over time due to degradation of sensing material there may occur drift due to saturation effects w.r.t. chemical reactions as well. Dependent on the chemical reactions these effects may be irreversible or reversible.
It’s all about system thinking
Error types may result from various root causes w.r.t. the overall system. E.g. outliers may result from operating a sensor outside the specified supply voltage range. The magnitude of noise usually depends on the supply voltage applied. The scope of the root cause is the sensor device. Missing data may result from unreliable data transmission due to lost messages in messaging technologies widely used in the IoT domain. In this case the root cause resides in the architecture or implementation of the distributed system. From a data analysis point of view you initially do not care about the root causes of “bad quality” sensor data. However EDA may help you to identify the root causes in the overall system. Have a look at the same data at the same points in time at different stages of the overall system design (sensor → sensor device → messaging broker component of a distributed system → etc.) and you’ll be able to identify “bad quality” sensor data root causes.
Several root causes may contribute to single error types. E.g. missing values will usually result from unreliable data transmission (distributed system) as well as on unreliable data processing of single sample values over time (sensor device). As almost all sensors are providing digital output signals some part of the noise results from quantization error (inaccuracy of the output signal cause digital signals are quantized and never as accurate as analog signals) introduced by the sensor’s signal processing unit which cannot be compensated. In addition another part of the overall noise may result from the dependence of the output signal from the temperature the sensor is operated at. The second root cause may compensated using a characterization curve (output signal vs. operating temperature) provided by the sensor vendor.
Sensor types provide univariate or multivariate data
Sensors like e.g. a temperature sensor provide a single value in a given point in time resulting in univariate data. Other sensors like e.g. a camera (univariate data per pixel color, overall multivariate data output) provide multiple values in a given point in time resulting in multivariate data. Usually univariate (data) analysis differs from multvariate (data) analysis. W.r.t. the sensor type the multivariate data output of e.g. a camera could be considered in a simplifying manner as collection of univariate data. The assumption is that each pixel color outputs univariate data which is affected by the exact same side effects resulting in exact same magnitudes w.r.t. error types. However if one needs highest data quality the pixels in the center of the camera chip might be impacted differently than the pixels at the boarder of the camera chips. W.r.t. relevant effects as well as potentially different error magnitudes. In the later case the data has to be considered being multivariate and the corresponding multivariante analysis has to be applied.
The need for high accuracy univariate sensor data implies the need for a multivariate system design
There are two extreme scenarios:
- Best case: We use a sensor which is characterized by it’s vendor sufficiently.
- Worst case: We use a sensor which is not characterized by it’s vendor at all and we are not able to replace it with a sensor which is better characterized.
Surprisingly sensors are often characterized by their vendors in a minimalist and incomplete manner. Many sensor datasheets do not provide characterization curves at all or over an insufficient operating range only. This means we have to deal with a worst case scenario to get an equivalent situation like in the best case scenario.
If we have the best case scenario we’ve still to compensate the known and characterized dependencies of the sensor on operating conditions and environmental conditions. Common operating conditions are the supply voltage. Common relevant environmental conditions are temperature and humidity (for e.g. gas sensors). In case of battery powered sensor devices the supply voltage could potentially vary over the device lifetime. This means for high accuracy sensor data one has to place an temperature sensor as well as an humidity sensor as close as possible to the actual sensor. In addition the supply voltage of the sensor device has to be monitored. As a result instead of dealing with 1 sensor data value one has to deal with 4 sensor data values for every point in time.
My personal journey w.r.t. interactive visualization of sensor data
At the Plasmion GmbH we are researching in the field of data processing and machine learning based on mass spectrometer data created via soft-ionization. This field is cutting edge research. The “sensor” (mass spectrometer + soft-ionization component) uses physical effects (mass spectrometer counts ions which hit a detector) as well as chemical effects (soft-ioniztion for creating the ions). W.r.t. data analysis the overall system is very complex. Mass spectrometers are specified in a minimalist manner w.r.t. the effects of device configurations and potential impacts on the output data. In addition environmental conditions like temperature and humidity may influence the amount of ions created during soft-ionization significantly. E.g. when trying to quantify the concentration of a single chemical compound in the ambient air with varying device configurations and environmental conditions (temperature, humidity) may lead to significantly different amounts of ions which are counted at the detector. As consequence system characterization is very important and needs to be performed for every new, potential application use case which implies potentially new ionization effects never seen before.
We are a 5 person startup including 1 domain expert with ML experience as well as 1 person with generic ML experience for setting up the ML infrastructure and caring about MLOps. Our higly iterative EDA process needs to be super efficient and there’s need to adjust the visualization functionality according to the current knowledge about the system over and over again. Creating and iteratively adjusting static plots with “lightweight” frameworks like e.g. matplotlib takes a lot of time. When working with commercial, “heavyweight” frameworks one has usually several options for implementing domain or product specific visualization functionality by
- either using the builtin capabilities to e.g. plot from imported flat data (e.g. imported from .csv files, databases, etc.) or to
- extend the plotting functionalities with platform specific APIs.
After analyzing both options with “heavyweight” frameworks it seemed like they are not suitable at all. I decided to start with a lightweight EDA setup based on Kedro, Jupyter Notebooks, representations of the mass spectrometer data as Pandas DataFrames or Series and visualizing them with the builtin interactive plotting capabilities of Pandas.
Fundamentals of data visualization
How to process data to enable it’s visualization, what plot type to use for visualizing specific data, principles of figure design, etc. is not in the scope of this post series. For a deep dive into data visualization fundamentals I highly recommend the free online version of the book Fundamentals of Data Visualization. The book uses examples in R but the principles addressed are generic and may be applied to plots created from pandas data structure as well.
In the following posts of the series I’ll show generic examples of how to use Jupyter Notebooks, Pandas DataFrames/Series with Pandas builtin plotting features. The examples are hosted on Binder.