Handling missing values with Snowpark for Python — Part 1

Photo by Towfiqu barbhuiya on Unsplash

Missing values in data can often cause “confusion” in many machine learning algorithms, resulting in unreliable output. Although many algorithms have routines for dealing with incomplete data, they are not always robust. Therefore, a useful preprocessing step is to handle missing values based on the understanding of the data and how it is generated.

In this two part series, I will go through different ways of handling missing values using Snowpark for Python.

What is missing values?

Missing values is the lack of data in a feature/column and most often they are represented as null or NaN (Not a Number), this because computers need a way to know that something is missing.

However, in some cases, the data generating application have logic to handle missing values by replacing them with a “dummy” value like -1 or an empty string. Even Snowflake can do this with the use of a default value for columns in a table, meaning that the default value is used when data is inserted without values for those columns.

So, before starting to implement the missing value handling, it is important to first understand how they are represented in the data set(s) that is used.

Handling missing values

Missing values can be handled in two ways; exclude the row having them or replace, or “impute”, them with a new value. It is not uncommon to use a combination of both depending on which column that has missing values.

Let’s start with creating a Snowpark dataframe to be used with most of the examples.

>>> df = session.create_dataframe([[1.0, 1, ‘SE’], [float(‘nan’), 2, None], [float(‘nan’), 3, ‘DK’], [4.0, None, ‘SE’], [float(‘nan’), None, None]]).to_df(“a”, “b”, “c”)
>>> df.show()
---------------------
|"A" |"B" |"C" |
---------------------
|1.0 |1 |SE |
|nan |2 |NULL |
|nan |3 |DK |
|4.0 |NULL |SE |
|nan |NULL |NULL |
---------------------

Excluding rows with missing values

Easiest way to handle missing values is to exclude the rows that has them.

Snowpark has a dropna method that can be used for excluding rows that has columns with null/NaN values. It does not physically delete those rows from the underlying table, it just creates the logic for filtering them out.

dropna returns a new dataframe with the logic for excluding the rows added, which usually would be assigned to a variable, but since this is just examples I am using show to display the first 10 rows.

If dropna is used without any parameters it will exclude all rows that has null/NaN in any column. This is equal to setting the parameter how=’any’.

>>> df.dropna().show()
-------------------
|"A" |"B" |"C" |
-------------------
|1.0 |1 |SE |
-------------------

Setting how=’all’, will exclude those rows where every column has a null/NaN value.

>>> df.dropna(how='all').show()
---------------------
|"A" |"B" |"C" |
---------------------
|1.0 |1 |SE |
|nan |2 |NULL |
|nan |3 |DK |
|4.0 |NULL |SE |
---------------------

The thresh parameter can be used to set the minimum number of columns that needs to have null/NaN values in order for a row to be excluded.

>>>  df.dropna(thresh=2).show()
--------------------
|"A" |"B" |"C" |
--------------------
|1.0 |1 |SE |
|nan |3 |DK |
|4.0 |NULL |SE |
--------------------

dropna can also look in specific columns for null/NaN values to determine if a row should be excluded, this is done by the subset parameter.

>>> df.dropna(subset=["a", "c"]).show()
--------------------
|"A" |"B" |"C" |
--------------------
|1.0 |1 |SE |
|4.0 |NULL |SE |
--------------------

Replacing missing values

Rather than excluding rows with missing values, another approach is to replace or “impute” missing values. This can be done by using a static value or a more dynamic type of value, for example the mean or most frequent value.

Replacing values with a static value

For replacing missing values with a static value, the fillna method can be used. As with dropna, the fillna method does not update the underlying data, just generating the logic for it.

To replace all null/NaN values in all columns with 3, fillna is used with 3 as the parameter.

>>> df.fillna(3).show()
Input value type doesn't match the target column data type, this replacement was skipped. Column Name: "C", Type: StringType(), Input Value: 3, Type: <class 'int'>
--------------------
|"A" |"B" |"C" |
--------------------
|1.0 |1 |SE |
|3.0 |2 |NULL |
|3.0 |3 |DK |
|4.0 |3 |SE |
|3.0 |3 |NULL |
--------------------

If the data type of the provided replacement value does not match the column data type, fillna will produce a warning and skip that column. As shown above, 3 is an integer and therefore it can not be used on column C, that is of string type, however it can be used for column A that is of float type. Be aware that the other way around, replacing missing values in integer columns with float values does not work.

This can be avoided by specifying for each column what the replacement value should be, using a dictionary.

>>> df.fillna({"a": 3.14, "c": "XY"}).show()
---------------------
|"A" |"B" |"C" |
---------------------
|1.0 |1 |SE |
|3.14 |2 |XY |
|3.14 |3 |DK |
|4.0 |NULL |SE |
|3.14 |NULL |XY |
---------------------

In the cases where a missing value is represented by a dummy value, the replace method can be used. As above, no data is changed in the underlying table.

Let’s generate a new dataframe for this, with -99 and XX representing that the value is missing.

>>> df_rep = session.create_dataframe([[1, 'SE'], [2, 'XX'], [3, 'DK'], [-99, 'SE'], [ -99, 'XX']]).to_df("a", "b")
df_rep.show()
-------------
|"A" |"B" |
-------------
|1 |SE |
|2 |XX |
|3 |DK |
|-99 |SE |
|-99 |XX |
-------------

The simplest way to use replace is to just provide the value to look for and the value to replace it with. By default it will look in all columns.

>>> df_rep.replace(-99, 3).show()
Input key or value type doesn't match the target column data type, this replacement was skipped. Column Name: "B", Type: StringType(), Input Key: -99, Type: <class 'int'>, Input Value: 3, Type: <class 'int'>
-------------
|"A" |"B" |
-------------
|1 |SE |
|2 |XX |
|3 |DK |
|3 |SE |
|3 |XX |
-------------

replace works the same way as fillna and will issue a warning if the replacement value is of a type that is not compatible with the data type for a column.

It is also possible to specify multiple values to look for and what to replace those with. By default it will look in all columns for the values, so providing the warnings when data types does not match.

>>> df_rep.replace({-99:3, 'XX':'SE'}).show()
Input key or value type doesn't match the target column data type, this replacement was skipped. Column Name: "A", Type: LongType(), Input Key: XX, Type: <class 'str'>, Input Value: SE, Type: <class 'str'>
Input key or value type doesn't match the target column data type, this replacement was skipped. Column Name: "B", Type: StringType(), Input Key: -99, Type: <class 'int'>, Input Value: 3, Type: <class 'int'>
-------------
|"A" |"B" |
-------------
|1 |SE |
|2 |SE |
|3 |DK |
|3 |SE |
|3 |SE |
-------------

The subset parameter is used to control which columns to do the replacement in.

>>> df_rep.replace({-99:3}, subset=["A"]).show()
-------------
|"A" |"B" |
-------------
|1 |SE |
|2 |XX |
|3 |DK |
|3 |SE |
|3 |XX |
-------------

replace can also be used directly on a column, in combination with for example select or with_column. This can be very useful when there is different replacement values for columns.

>>> df_rep.with_column("A", F.replace(F.col("A"),-99, 3,)).show()
-------------
|"B" |"A" |
-------------
|SE |1 |
|XX |2 |
|DK |3 |
|SE |3 |
|XX |3 |
-------------

Replace missing values using existing data

The Snowpark Dataframe API, as of the writing of this post, does not have any specific methods for replacing missing values with values derived for the existing data.

However, if I for example, want to replace the missing values in column A, in the sample dataframe df, with the mean value for that column, I can do that by using the mean and iff functions.

The iff function is used first to check if the value is missing, since column A can have both null and NaN for missing it needs to check for both, and if it is missing, then use the mean value as a replacement. Since the mean function does not include null values there is not need to filter those out, just the NaN values.

>>> df.with_column("A", F.iff((F.col("A") == F.lit('NaN')) | (F.col("A").is_null()), F.avg(F.iff(F.col("A") == F.lit('NaN'), F.lit(None), F.col("A") )).over(), F.col("A"))).show()
---------------------
|"B" |"C" |"A" |
---------------------
|1 |SE |1.0 |
|2 |NULL |2.5 |
|3 |DK |2.5 |
|NULL |SE |4.0 |
|NULL |NULL |2.5 |
---------------------

If the column where I want to replace missing values with mean only have null as missing values, I can instead of iff use the ifnull function. Since the ifnull function is not exposed, currently, in the Snowpark API I need to use the call_builtin function to call it.

>>> df.with_column("B", F.call_builtin("ifnull", F.col("B"), F.avg(F.col("B")).over())).show()
----------------------
|"A" |"C" |"B" |
----------------------
|1.0 |SE |1.000 |
|nan |NULL |2.000 |
|nan |DK |3.000 |
|4.0 |SE |2.000 |
|nan |NULL |2.000 |
----------------------

When it comes to columns with strings values, a common replacement method is to use the most frequent. This can be done using the mode function and the rest of the logic is the same as for the mean example.

>>> df.with_column("C", F.call_builtin("ifnull", F.col("C"), F.mode(F.col("C")).over())).show()
--------------------
|"A" |"B" |"C" |
--------------------
|1.0 |1 |SE |
|nan |2 |SE |
|nan |3 |DK |
|4.0 |NULL |SE |
|nan |NULL |SE |
--------------------

Conclusion

I have in this post showed how to handle missing values in different ways using Snowpark for Python, using both built in methods and how to create some more advanced logic.

One thing to have in mind, is that from a machine learning perspective it is important that the same replacement values are used on the training, validation and production data.

When using static values it is rather easy, just use the same with validation and production data. It is when we derive the replacement values from existing data, like the mean or most frequent, we ned to have a way to bring that forward to be used with our validation and production data.

For example scikit-learn has imputers that is first applied on training data, using a fit method, and then it can be used on new data, using a transform method. If the fitted impute object are saved, using pickle or joblib, it can then be loaded to be reused in production.

This will be topic for the next post in this series, how you can create your own imputer, using the Snowpark API, that can be fitted on one data set and then used to impute, transform, another dataset.

--

--