[EN] Special Topics in Data Science, Vol: 1

Melikşah Çelik
5 min readJun 26, 2019

--

I am starting a series of articles where I will share some detailed and special topics in data science that we use and deal with, rather than the general issues and topics found in various online sources such as blog posts, online learning platforms and so on. I hope these tips and methods would contribute to your work. If you are new in data science, you can refer to other introductory resources.

[First two paragraphs in Turkish text are about conditions and developments in Turkish Data Science communities with no loss of continuity of the topic.]

Turkish version: https://medium.com/@mlkshclk/tr-veri-biliminde-ozel-teknikler-no-1-cef1ca19c41b

In this article, I will discuss a few suggestions on Missing Value Handling and a method that can be applied.

  • Level of difficulty: 5/10
  • Practicality: 9/10
  • Academic intensity: 6/10

Missing Value Handling

There are two types of missing values in general: values that should be missing, values missing that should not be missing.

When you join the data of different samples of a population or different populations, we cannot expect that the sample-specific features are 100 percent full, where the missing values are an example of what should be missing. Any missing value analysis technique that you apply for such missing values would not be correct. For example, you keep different attributes in different tables about your customers, and you’ve joined your tables to combine all the features into a single table. Company balance sheet features will be empty for your individual customers. If your table is not ordered at that moment and if you think an instant ‘there is empty data there, let’s fill in those with some average’, then your work will be completely meaningless.

Another example is data gathered from some form fields that are not ‘required’ to fill out. If you have not required your customers to enter age information on a form, the incoming data may be missing, in other words, ‘we have allowed it to be not full.’

If there is a missing value that should not be missing, it is probably due to a mistake or an exception in API gateways, ETL procedures, or your data model / table designs, in which case you cannot apply missing data removing or filling methods. You should first make improvements on mentioned areas. For example, if one or more raw data that you use to create a derivative variable are empty after a specific date, there is probably a system / rule change behind it.

Let’s say you study on a prediction model and you have derivative features that you obtained from different raw data sources, which can actually be imported to modeling algorithms. Incomplete values in your derivative table may be due to missing values in raw data or due to your ETL processes between raw-derivative data. Suppose there is no loss of data from ETL processes. Then you need to make your corrections or deletions on the raw sources of missing values. At this stage, there is a method that can be applied before the row deletion or filling/imputation methods:

Let a data set with N features and M observations as follows:

We calculate a missing data percent or a null percentage (NP) for each observation and each feature. For your model’s efficiency and significance in production, -as a rule of thumb- it’d be good if a feature that you will apply a filling method with a null percent of at least 70%.

What we are going to do at this stage is to determine the quality of the rows and columns in terms of fullness according to a simple Pareto rule. That is, to run a 80–20 Pareto rule directly using null percentages if it make sense — or by calculating the percentiles of the null percentages. As a result, we aim to achieve a data as follows:

When we order rows and columns from those with low occupancy compared to those with high occupancy, the green area is of high quality area in terms of ‘meaningful data occupancy’, while red areas are those to be avoided.

The green field may also contain missing data. However, we know that red areas are already those with lots of data loss. Therefore, it would be more accurate not to take these -red- observations and variables into the modeling stage with their current state.

After this, you can continue your work by applying missing data filling or deleting methods that you are willing to use in the green area.

The motivation behind this method is as follows: Think of all the data preparation processes in your project, modeling, tuning and so on. You have completed your work and now you will make your model deploy in production. Properties of data that will come to your model in production must be similar to the one you -already- used during the learning/training phase. One of these properties is actually the ‘missing data ratio’ for each variable.

If you delete or fill for each missing incoming value, you may sometimes have filled 80% of the incomplete incoming data with average / median, which is a sign that the model will be meaningless. In fact, when you delete rows with any missing values, you will come across data sets where there is no rows left… So the quality of the data you use in the model comes into play as well as the quality of the method you will apply for missing data.

There are a lot of studies on data quality, and I have shared an example within missing data studies here.

Topics that I intend to write in the future:

  • Why should we use K-fold cross-validation?
  • Predictive Analytics vs Prescriptive Analytics
  • End-to-end data science project management, choices/needs of documents, what kind of a team should be formed?

Please share your questions and thoughts, thanks.

--

--