How to handle missing environmental data
They reviewed a bunch of imputation techniques to find the best
Missing data is a very common issue when working with Machine Learning Data in the real world. Sensors can break. Invalid data can be recorded. Survey information might not be filled in completely. A lot can go wrong. So what do we do? We can drop the data that is incomplete. But what if we end up with a small dataset. What if we drop very important samples? When I work with data, I almost never drop data points. At worst this adds some noise to your learning, which is probably better in the long run when working generalizable models.
Instead, I prefer imputing the missing data. This just means filling in the missing data using some rules. Your specific imputing policy is determined by a lot of factors. The authors of the paper, “A computational study on imputation methods for missing environmental data” go over 3 different data imputation policies to find the best. In this article, I will talk about interesting findings from the paper. I will also share the positives about the experiment setup, that you should take into your machine learning projects. Let me know which of the points was most interesting to you in the comments below (or through DMs). I’d love to learn more about what it sticks out to you guys. As always the annotated paper will be linked below. Be sure to check it out to get all my insights from the paper.
The Positives
Following are some of the things that the team did, that you should do in your projects/whitepapers.
Defining the Problem + Constraints Clearly
One of the best things you can do for your machine learning projects is to sketch out every challenging aspect. Mention what the challenge is, why it’s problematic, and what you would consider being an acceptable solution. This gives your project a lot of clarity. For example, the paper explains the challenges of working with environmental data very well. In the words of the authors:
“Organizing environmental data in well-structured databases is a challenging task (Blair et al., 2019). On the one hand, the natural environment is impacted by human activities, and this calls for interdisciplinary research and analysis. On the other hand, natural phenomena cover different time and spatial scales and are generally interconnected, which makes data integration difficult. This typically results in heterogeneous data sources and generally gives rise to databases of a mixed nature, with both qualitative and quantitative entries.”
Identifying the stumbling blocks can help in designing the solutions. Alternatively, you can make some simplifying assumptions, and just make a note of the complexities (we do this a lot in my work with supply chains). Whichever route you take, having clearly defined challenges helps you create the solution.
Clearly defining constraints/challenges also helps other people understand your thought process when working in teams. This makes collaboration more effective.
Accounting for Variance
Datasets can have a lot of variances. Both in terms of percentage missing and the nature/distribution of the features tracked. The authors of this paper acknowledged this and accounted for them both. When describing phase 1 of the paper, they had this to say about the experiment setup, “we selected 10 datasets from various sources in the literature and artificially obtained various degrees of missing data by randomly removing some of the entries. The set of selected databases was chosen to be representative of the typical characteristics found when analysing environmental data, such as varying dimensions, as well as heterogeneous data types and structural features.”
Note that they account for both variance in the degree of missingness (dropping differing amounts of data) and the nature (using different databases). This is extremely good practice for your own projects. Keep in mind that they were dropping data from complete datasets. This allowed them to compare results accurately.
Looking at Performance
Now to answer the question you clicked on this article for. What should you do? Overall, the paper showed missForest to be the best data imputation policy (in terms of error). The other ones they used were Multivariate Imputation by Chained Equations (MICE) (by Buuren and Oudshoorn, 1999) and K-Nearest Neighbors (KNN) (by Troyanskaya et al., 2001). The rest of this section will go into the results by different experiments.
Qualitative Datasets
For qualitative datasets, we see that increasing missingness increases the error (PFC). This is not shocking. Tic Tac Toe is an excecption to this and should be studied because of it’s interesting behavior.
The authors had this to say:
“Even if KNN is systematically the least performing IM, neither MICE nor MF stands out from the other IMs. On average over the 1000 simulations, MF is the most performing IM on “Lanza” whereas MICE outperforms MF on “Hayes” and “Tic-Tac-Toe”. However, because of the significant rise in MICE errors on the “Tic-Tac-Toe” case, it loses its advantage as the missing data percentage increases.”
Quantitative Datasets
Above are error calculations using NRMSE as our metric for quantitative data. MF, in general, outperforms the other policies in almost every case. The authors have interesting comments about colinearity and the trend. I would suggest reading the section to get them. I don’t mention them here to keep the article concise.
Mixed Data
For mixed data a combination of PFC and NRMSE is used at varying percentages of missingness. We see MF standing out as a clear winner here. To quote the papers, “A comparison between the respective performances of the three IMs on the graphs of Figure 4 show that MF outperforms MICE and KNN in every case.”
Simply put, you will almost never go wrong with using missForest to impute your missing environmental data.
A note on Processing Times
The team also looked into processing times for their code. While this is generally not a concern (imputation need only be done once) it’s still an important aspect. If you are extremely cost-constrained this is what they discovered:
TL;DR- MICE is slow.
Closing
As a Forest Supremacist, I am obviously pleased with the results. On a more serious note, this paper has a lot to teach. I struggled with what to write because I could have done 3 different articles here. In the end, this particular topic seemed the most valuable. However, make sure you read the paper (especially the case study). The authors have done very cool work. If you want a follow-up to this, let me know in the comments below.
An interesting extension to the paper could have been to evaluate the complexity of the policies being used. Below is a video explaining the Bayesian Information Criterion, which could have been a useful base here as an alternative to time.
If you guys enjoyed this article do share it with others and clap on Medium. It helps our community grow. Downloadable annotated paper below.
Reach out to me
If that article got you interested in reaching out to me, then this section is for you. You can reach out to me on any of the platforms, or check out any of my other content. If you’d like to discuss tutoring, text me on LinkedIn, IG, or Twitter. If you’d like to support my work, using my free Robinhood referral link. We both get a free stock, and there is no risk to you. So not using it is just losing free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
My Substack: https://codinginterviewsmadesimple.substack.com/
Live conversations at twitch here: https://rb.gy/zlhk9y