A journey on Scala ML pipeline — part 3of 3: Dealing with NAs and XGBoost

Published in

Red Ventures Brasil - Tech

12 min readJun 5, 2020

Where are all my values? Photo: Wikimedia

Welcome back. If you just landed at this article, you want to take a look at parts 1 and part 2

This is also available in scala notebooks in my git, with pretty much all the text explanation as well. Some code may have been omitted from this article (especially imported packages), the notebooks were tested in their entirety.

We are made promises that XGBoost deals with NAs by itself. https://xgboost.readthedocs.io/en/latest/faq.html#how-to-deal-with-missing-value

Though it is true, XGBoost4J-Spark implementation is not 100% satisfactory in that sense. Here we have to discuss the different versions extensively, as they vary greatly on this regard. There is very few documentation on this, so most of my knowledge comes from forum discussions and source code reading.

First let’s talk about the role of missing values in an XGBoost tree. A tree will make a decision on a variable. If it is categorical, it will choose which categories go to each leaf. If it is numerical, it will determine a threshold, numbers under the threshold go left, numbers above threshold goes right.

Survival of passengers of Titanic Decision Tree Model, showing different criteria in each branch

Now, what role does missing values play in this decision? For categorical variables it should be pretty simple, it is another category, no big deal. For numerical variables it is a bit different. A missing value is neither above or below a threshold (nor equal). So it could go either way. XGBoost deals with it the proper way, it decides where it should go (based on data). If a missing value shows up in test but was not present in training, it will default to some of the nodes (left, if I am not mistaken).

Missing values on XGBoost4j

Okay, you are now an expert on decision trees. Let’s go to XGBoost4j now. The following code will work (i.e., it will run and produce results) in 1.0, but will fail in both 0.8 and 0.9

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{VectorAssembler}val na_assembler = new VectorAssembler().setInputCols((index_columns.map(name => s"${name}_index") ++ numerical_columns.map(name => s"${name}_norm")).toArray)
                                             .setOutputCol("features")
                                             .setHandleInvalid("keep")val na_stages = indexerArray ++ normalizerArray ++ Array(labelIndexer, na_assembler, xgbClassifier, labelConverter)
val na_pipeline = new Pipeline().setStages(na_stages)val na_model = na_pipeline.fit(training)

Let’s talk about what is going on. Missing values are handled very similarly in all versions, so let’s start with the basic behaviour. Here is what the process looks like:

XGBoost gets the data
If there are sparse vectors, missing values are considered, well, missing
Additionally, any values in the vectors that match the “missing” parameter will also be considered missing (the name of the parameter is “missing”, the parameter is not missing)

On version 0.9 they added a validation step

4. If “missing” is different than 0, you have to have “allow_non_zero_for_missing” to be true, otherwise it will error

Fun fact, it is impossible to make 0.9 work without setting "missing" -> 0. Even if you pass the "allow_non_zero_for_missing_value" -> true"it will fail. I don’t know what they did or how this made to release (there is a specific test for that in the repo), but it is true, I tested with Databricks 6.6 ML (which includes XGBoost4j 0.9 by default). Maybe a different distribution of 0.9 has that fixed.

Another fun fact, from version 0.9 to 1.0 they renamed “allow_non_zero_for_missing_value” to “allow_non_zero_for_missing”, so for version 0.9, remember that. I believe such parameter did not exist on 0.8.

How does it work in practice

Enough of (not so) fun facts, and back to the algorithm. Remember from chapter 1 how VectorAssembler can generate sparse vectors? Well, by default it will omit 0 as value. But that does not mean that there won’t be dense vectors with 0s. It will decide whether to use dense or sparse by itself.

Now comes the problem. Missing values will be treated as missing, but they are in reality 0. Also, there are 0s in the Dense vectors, which will be treated as zeroes unless you set “missing” -> 0. Therefore you are in great risk of having some 0s (represented in sparse vectors as missing) treated as missing and some 0s (in dense vectors) treated as 0s. For example, these 2 identical vectors (from VectorAssembler):

[0, 3, [0,1], [1,1]] -> third feature will be treated as missing
[1, 3, [1,1,0]] -> third feature will be treated as zero

By default, XGBoost is set “missing” -> NaN and “allow_non_zero_for_missing” -> false. In version 0.8, we don’t have (4) check, so it will succeed, but it will treat the same original value differently, depending on the representation (as explained above). A huge flaw. By setting “missing” -> 0, though it is not 100% correct to treat it as missing, at least you are consistently treating all 0s as missing.

All in all, I believe the additional check (4) is a good one. The only thing I would change is the default “missing” value. It is set to NaN by default, I would put 0 as default. If you don’t specify “missing” and you have sparse vectors, it will fail. Make something with default parameters fail seems weird to me.

So, what to do?

Finally, some practical advice:

For 0.8 and 0.9, don’t sweat that much. Just set “missing” -> 0, or your model will either fail or produce wrong results. If you have real 0s and real NAs in the dataset, your model will fail anyway and you will need one of the following approaches.
For 1.0 just keep in mind what you are doing. If you are using VectorAssembler, keep in mind it is omitting 0s, so the only way to have a consistent model is to have “missing” -> 0. Nevertheless, if you have both 0s and NaN, the model will fail.

Now you may ask, how can I treat 0s and missing values correctly? Great question, let me go through 2 different paths. Inspiration comes from their tutorial, which gives us 3 options.

Converting all vectors to Dense. Looks easy, we create a new transformer (which we are expert at doing right now) after the assembler, to convert all feature vectors to DenseVector. Nevertheless if you have a lot of NAs, it will be very inefficient on memory. If you don’t have memory constraints, it is an okay path to go. I briefly tested it, with poor results. It will work on 1.0, but it will fail on 0.8, and to make it work on 0.9 you will have to set “missing” -> 0, so you need to replace your actual 0 values by something else (0.001 maybe?). I will not pursue this option here, but rather focus on the next 2.
Converting missing values to something else upstream. Very pragmatic approach. Just convert all the nulls/nans to something else, so you don’t have to deal with them. I will show this method and discuss some shortcomings.
Replace the assembler by a custom transformer, so we can indicate the sparse value ourselves, instead of getting 0 for sparsity. This can be way more storage efficient than (1), depending on how many missing values you have.

Replacing missing values by something else.

We don’t have to worry about categorical variables. For StringIndexer, null is just another category, and will be treated as so, no confusion.

Risk is on Numerical Values. Remember how we made the NumericNormalizer replace missing values with NaN? Well, we could go back and make it replace it with something else, and leave it as a param for the user to decide. As we are normalising for the mean, we can leave the default replacement for 0.0, which equals to the mean. I will comment further about replacing missing values by the mean later on, but as of now, it is a value as any other.

Key risk here is that we have actual 0s in our set. As we are normalising, it is very unlikely to have a lot of values that are equal to the mean. Even in low cardinality columns, you have to be pretty unlucky to have a numerical value to be exactly the mean. But feel free to examine your data beforehand. If you rather set it to -999999, feel free. But understand the consequences:

-999999 is a number. In versions 0.8 and 1.0 you can set the missing value as -999999, but if you are using VectorAssembler, remember it may omit a few 0s (if they exist), so you end up with the same issue described above.

If you have both 0s and missing values in your data, I strongly suggest you to replace the 0s with something else (0.0001), and set the NAs to 0. That way, your VectorAssembler will omit the 0s, and 0s in the Dense vectors will be correctly treated as missing as well.

This is what our new transformer looks like.

class NumericNormalizerWithReplacement extends NumericNormalizer {
  
  val replacementValue: Param[Double] = new Param(this, "replacementValue", """Value to replace Null and NaNs""")
  
  setDefault(replacementValue, 0.0)
  
  def setReplacementValue(value: Double): this.type = set(replacementValue, value)
  
  override def transform(dataset: Dataset[_]): DataFrame = {
    transformSchema(dataset.schema, logging = true)
    
    var result = dataset.withColumn(${outputCol}, (col(${inputCol}) - ${mean})/ ${stdDev})
    ${handleInvalid} match {
      case "keep"  => result = result.withColumn(${outputCol},  when(col(${outputCol}).isNull, lit(${replacementValue})).otherwise(col(${outputCol}).cast("Double"))) //Now replacing Nulls with replacementValue
      case "skip"  => result = result.filter(col(${outputCol}).isNotNull).filter(col(${outputCol}) === col(${outputCol})) //This will remove both Nulls and NaN
      case "error" => throw new SparkException("""Error mode not supported, sorry""")
    }
    return result 
  }
}

And model will train without further problems if we remember to set “missing” -> 0.

val normalizerWithReplacementArray = numerical_columns.map(column_name => {
  new NumericNormalizerWithReplacement()
    .setInputCol(column_name)
    .setOutputCol(s"${column_name}_norm")
    .setHandleInvalid("keep")
    .setMean(data.agg(mean(column_name)).head().getDouble(0))
    .setStdDev(data.agg(stddev(column_name)).head().getDouble(0))
})val xgbParam = Map("eta" -> 0.3,
      "missing"   -> 0,
      "max_depth" -> 3,
      "objective" -> "multi:softprob",
      "num_class" -> 2,       
      "num_round" -> 100,
      "num_workers" -> 3,
      "seed" -> 123)val xgbClassifier = new XGBoostClassifier(xgbParam).setFeaturesCol("features").setLabelCol("label")val na_2_stages = indexerArray ++ normalizerWithReplacementArray ++ Array(labelIndexer, na_assembler, xgbClassifier, labelConverter)
val na_2_pipeline = new Pipeline().setStages(na_2_stages)
val na_2_model = na_2_pipeline.fit(training)

Best Option — Features as custom Sparse Vector

High of hope on this one. Documentation says it is the best approach and I believe it. Not blindly, but it makes sense. If we take off NaNs out of the vectors, then they can’t cause an error anymore. And we have plenty of evidence that it will work with sparse vectors.

The idea here is replacing the assembler entirely. Our brand new custom made assembler will return only sparse vectors, and it will omit only NaNs. As we are running the normaliser beforehand, it is guaranteed all null values are now NaNs. If you want to make it more generic (to consider both NaNs and nulls), you can change the vectorizeRow function as you will. You may even create a parameter to tell which value should be considered as missing.

Nevertheless, being very pragmatic, we will just grab the data, and put it into a SparseVector, omitting NaN values. Here’s my approach:

vectorizeRow is where the magic happens. As we will iterate over all the rows, nothing better than make it take a row and return a SparseVector. The SparseVector constructor takes 3 inputs, the size of the vector, an Array with the indices of non-zero values, and an Array with those values. You can check on code that it is very straightforward, grab the Row, get it size, transform it into array, get the indices of not-NaN values, get those values, construct and return the Vector
SparseVectorAssembler is going to be very similar to VectorAssembler. So similar that I will just extend it and override the transform part (even the transformSchema can be the same). Crazy right? Not so much if you think that is just a small twist on original VectorAssembler.

def vectorizeRow(row: Row): org.apache.spark.ml.linalg.SparseVector = {
  val size = row.length
  val row_array = row.toSeq.asInstanceOf[Seq[Double]].toArray
  val indices = row_array.zipWithIndex.filter(!_._1.isNaN).map(_._2)
  val values = row_array.zipWithIndex.filter(!_._1.isNaN).map(_._1)
  return new SparseVector(size, indices, values)
}def vectorizeRowUDF = udf(vectorizeRow _)class SparseVectorAssembler extends VectorAssembler {
  
  override def transform(dataset: Dataset[_]): DataFrame = {
    transformSchema(dataset.schema, logging = true)return dataset.withColumn(${outputCol}, vectorizeRowUDF(struct(${inputCols}.map(col):_*)))
  }}

Beautiful. Now we don’t have to worry about 0s being considered missing. But will it work? Only in 0.8 and 1.0, if you remember to set “allow_non_zero_for_missing” -> true.

val na_3_xgbParam = Map("eta" -> 0.3,
      "allow_non_zero_for_missing_value" -> true,
      "allow_non_zero_for_missing" -> true,
      "max_depth" -> 3,
      "objective" -> "multi:softprob",
      "num_class" -> 2,       
      "num_round" -> 100,
      "num_workers" -> 3)val na_3_xgbClassifier = new XGBoostClassifier(na_3_xgbParam).setFeaturesCol("features").setLabelCol("label")val na_3_stages = indexerArray ++ normalizerArray ++ Array(labelIndexer, na_3_assembler, na_3_xgbClassifier, labelConverter)
val na_3_pipeline = new Pipeline().setStages(na_3_stages)
val na_3_model = na_3_pipeline.fit(training)

auROC: Double = 0.9270027614523798

No improvement in AUC. But that was not the point. ML is not black and white, the best approach will depend on the problem and the data. Instead of 1% NAs, what if it were 10%, 30%, 80%. Filtering out NAs is probably not going to yield better results.

Final remarks

You were given 2 approaches on dealing with NAs (instead of filtering them out). I would strongly recommend using the last one. As you saw, it is really easy to implement (no matter what the documentation said about extra work), and it treats NAs as what they are, missing values. Also it removes any confusion caused by sparse vectors omitting 0s.

The other approach was replacing NAs by a value. Let me briefly talk about that as promised earlier.

I am a big fan of replacing NAs by the mean of the distribution. But this may skew the model when you have a lot of NAs. You can do something more sophisticated, replacing by N(m,s), which is a random number based on a normal distribution with mean m and std dev s. This would keep your distribution “unharmed”. If it is normally distributed, of course. If you know more about your feature distribution beforehand, I highly recommend a tailored made approach (including to the normalisation).

But you must always ask yourself: what does a missing value actually represents? Quick example, suppose you run a survey, and ask if the respondent owns a car. If they say yes, you ask how many cars. But if they say no, they won’t be shown that question. In a dataset, that may come as a missing value for all people that answered no the car question. Now, is it fair to fill the data with the mean of the answers? Definitely not, in this case it is pretty clear that those missing values should be replaced by zero.

Another case that it may be a problem is that when you have a lot of missing values. If you have 80% of missing values, for instance, filling them with the mean (with or without random noise) will certainly bias your model towards the mean. Adding the random noise won’t help much as well. If the variance of data is just random, you won’t be able to model it (as it is random). You may fool yourself, and overfit model, but in the end, it is just noise. Any good model will just tell you that the variables don’t really matter, and return you the mean of your output as an answer. Every machine learning model is just trying to do the same thing, find correlation between features and output, to build a predictive model. And if you remember correctly from your statistics books, what is correlation but the covariance of the variables divided by the product of each standard deviation? In other words, it is just a measure if the variance of two variables seems to be coupled together. When one goes above the average, does the other go too?

Therefore, when I said I am a big fan of replacing NAs by the mean, it is not without boundary conditions. For a small portion of NAs (like 1% in our example), it should be more than fine. For 80%+ of missing data, definitely don’t do it. And you may ask, what kind of problem have so many missing values? A lot of them actually. As we navigate through a world of unstructured data, you will find a lot of holes in your dataset. Imagine which kind of features Google search algorithm uses to show YOU (specifically) the best results for a specific search query. Or the features FB uses to determine which is the best Ad to display to you. Imagine all the data it has for the people, and all the holes.

But let me finish here leaving you with the mother of sparse data problems. The one that was made famous in 2009. The Netflix Challenge. Giving the ratings every single user gave to each individual movie, what is the best estimate of the rating of all other movies (the user didn’t rate) for each user? The core of Netflix recommendation system. I won’t talk about the solution itself, but just the sparsity aspect of each. Millions of users, tens of thousands of movies. Even if we make a very generous assumption that the average user rates 100 movies/tv shows, it would still leave us with (at least) 99% of missing data. How would you feel about filling the gaps with the mean?

Hope this helped give you some clarity on how XGBoost4j works, how it treats missing values, and what is the best way to treat missing values yourself.