DATA STORIES | MISSING VALUE HANDLING | KNIME ANALYTICS PLATFORM

Simple, correlation-based interpolation of missing values ​​in PurpleAir sensor data

An example of low-code/no-code modeling using KNIME

Samir Omanovic
Low Code for Data Science

--

Photo by Jorge Ramirez on Unsplash

This story is a readapted version of the conference paper [1] “Missing Values Interpolation in PurpleAir Sensor Data based on a Correlation with Neighboring Locations using KNIME Analytics Platform” (2023) published in 2023 46th MIPRO ICT and Electronics Convention (MIPRO). It is based on the experiences of me and my team on a small project — “Analysis of data from PurpleAir sensors”, supported by the Federal Ministry of Education and Science of the Federation of Bosnia & Herzegovina, Bosnia & Herzegovina.

At the beginning of any research there are certain dilemmas that need to be resolved. Two of those dilemmas are very important — which methods to use and which tools to use?

Should we apply some simple or some complex (complicated) method (approach, algorithm, model)? In order to answer this question, it is first necessary to answer the question: Are we looking for a solution to a problem that meets the given expectations, and which is obtained using the available resources, or do we strive to create a complex (complicated) solution, which does not even have to be better than the simple one in any way, just in order to impress the users of that solution? In most cases, a simple approach is good enough. This story presents the use of a simple approach, the application of which results with a solution that meets expectations (it is general and precise enough). The solution is obtained quickly and with low costs.

Should we program a solution (using Python, etc.) or use some no-code/low-code platform like KNIME to model a solution? In order to answer this question, it is first necessary to answer another question: Do we program a solution because that is our habit, or to make it less understandable to others, to impress others with our programming skills, or because it is really necessary for solving the given problem? With more than 3 decades of programming experience, my decision is to use KNIME Analytics Platform, because I can focus on modeling instead of programming, and visualization of modeling in the form of a workflow significantly improves discussions within the team and understanding of the modeling process.

The data

PurpleAir sensors are low-cost sensors for measurement of air quality monitoring. They use laser particle counters to provide real time measurement of temperature, humidity, PM1.0, PM2.5 and PM10. Data used in this example are only about particle pollution (PM2.5). PurpleAir sensor can be connected to a Wi-Fi network. That way, the measurements from the sensor are accessible on that network. Additionally, users of PurpleAir sensors can register their sensors on the PurpleAir real-time map [2]. The PurpleAir real-time map is a web application that displays a network of community-owned PurpleAir sensors. It enables downloading and use of data for various research projects. For more details about data, please visit https://map.purpleair.com/. Fig.1. shows the area on the map to which the data is related.

Figure 1. Map of locations with PurpleAir Sensors (image taken from the PurpleAir real-time map on 30.09.2022).

The data that was used is publicly available on the PurpleAir real-time map, and was collected on several PurpleAir sensors located in Bosnia and Herzegovina — in cities: Bihac, Bosanski Petrovac, Bosanska Krupa, Prijedor, and Velika Kladusa. Results and conclusions are based on the data from the listed locations, for the period 01.01.2021–29.09.2022, and 60 minutes averages. Data was downloaded as CSV files on 30.09.2022.

Locations on the map on Fig. 1 are positioned in a way that the city of Bosanska Krupa is surrounded by the cities of Bihać (air distance around 33 km), Velika Kladuša (air distance around 33 km), Prijedor (air distance around 47 km), and Bosanski Petrovac (air distance around 35 km). Geographical location of these cities are the main reason for their use for interpolation of missing values for the city of Bosanska Krupa, based on the data from cities: Velika Kladusa, Bihac, Bosanski Petrovac, and Prijedor.

Approach

This workflow focuses on solving the problem of missing PM2.5 values ​​in PurpleAir sensor data, using existing data from neighboring sites. Neighboring locations usually share the same climatic and geographical characteristics, which means that they generally share the same causes of particle pollution. Logically, some causes will always be related to some local properties, but if neighboring locations are close enough then we can conclude that they share many common causes of particle pollution and that they share common climatic and geographical conditions. Based on this, we can say that there should be a significant correlation between the PM2.5 data from these locations. In general, correlation should not be used as evidence of mutual causation, but it can be an indicator of sharing common causes, which is important in this case. In this case, the correlation between locations should be a good indicator of the sharing of common causes of particle pollution.

Modeling using KNIME

At the beginning of the workflow, a few CSV Reader nodes were used to read data from CSV files. Then, several manipulation nodes, such as the Row Filter, Duplicate Row Filter, and Sorter were used to remove records with missing PM2.5 values, remove duplicated rows and sort data by date and time. Rows with missing PM2.5 values were not used in further processing.

By using a series of Joiner nodes and the inner join setting for these nodes, a new set composed of joined measurements for all five locations involved (four neighboring and one observed — target) were created. That set contains only existing measured values on all locations for the same date and hour. If a measurement is missing for any of the five locations, then that date and hour is not included in this set.

After that, outliers are removed by using the Numeric Outliers node. This new set is used to calculate the correlation between the four neighboring locations (Velika Kladusa, Bihac, Bosanski Petrovac, and Prijedor) and the observed location (Bosanska Krupa), using the Linear Correlation node [3]. This node will calculate the correlation for all combinations of input columns, so it is necessary to filter the values of interest for further processing. The results indicate that there is a moderate to high positive correlation of PM2.5 values between the neighboring locations and the observed location. Obtained p-values are zero, which indicates zero probability that this result is by chance. Fig.2. shows a workflow snippet with nodes related to weights calculation based on the correlation values.

Figure 2. Part of the KNIME workflow where the correlation-based weight coefficients were calculated.

Four weight values based on the correlation values are calculated in the Java Snippet node, using the following simple code snippet:

double sumOfAll1 = c_Row0+c_Row1+c_Row2+c_Row3;

out_w1 = c_Row0/sumOfAll1;
out_w2 = c_Row1/sumOfAll1;
out_w3 = c_Row2/sumOfAll1;
out_w4 = c_Row3/sumOfAll1;

Weights are then transformed to flow variables using Table Row to Variable node. These weights variables are then used to calculate interpolated values as a weighted sum of measurements from the neighboring locations. In a further analysis of the results, it is concluded that it would be good to correct the interpolated result by multiplying it with the average ratio between the expected and the calculated value. The calculated multiplication factor has a value of 0.896. Quality indicators are then generated by the Numeric Scorer node.

Calculated weights and the multiplication factor are then used to interpolate missing values for the observed location. In the final phase, interpolated data and the existing data for the observed location were joined for further analysis of the results.

So, the interpolation is based on a weighted sum, where the weighting factors are not calculated on the basis of physical distance but on the basis of correlation between data.

Discussion of results

Fig.3. presents comparison of the final calculated (interpolated) values and the real (expected) values. It can be noted that the calculated values are near the real values in most of the points (60 minutes averages). It can be noted that there are differences and that they are mostly related to higher peaks in real values. That can be explained as a local influence (local factors related to the observed location) that cannot be calculated (predicted) based on the data from the neighboring locations.

Figure 3. Checking results by comparing calculated values (after applying the correction factor) and known values.

Conclusions

Our experiments showed that interpolating PM2.5 values in data from PurpleAir sensor, using existing data from neighboring locations and weighted sum based on a historical correlation of data has many advantages including: (1) easy implementation, and (2) possibility to interpolate large gaps in the data. Comparing interpolated values and known (measured) values is possible to conclude that the interpolation results are very good.

KNIME Analytics Platform enables very effective work during data analysis and modelling so that research focus is on modeling and model representation in the form of a workflow, and not on the programming of a solution.

References

[1] S. Omanovic, A. Midzic, Z. Avdagic, D. Pozderac and A. Toroman, “Missing Values Interpolation in PurpleAir Sensor Data based on a Correlation with Neighboring Locations using KNIME Analytics Platform,” 2023 46th MIPRO ICT and Electronics Convention (MIPRO), Opatija, Croatia, 2023, pp. 291–295, doi: 10.23919/MIPRO57284.2023.10159808

[2] PurpleAir, “Map Start-up Guide,” PurpleAir Community. [Online]. Available: https://community.purpleair.com/t/map-start-up-guide/90. [Accessed: Dec. 06, 2022].

[3] KNIME AG: “Linear Correlation”. [Online]. Available: https://hub.knime.com/knime/extensions/org.knime.features.base/latest/org.knime.base.node.preproc.correlation.compute2.CorrelationCompute2NodeFactory [Accessed: Dec. 07, 2022].

--

--