The Lazy LazyPredict Prediction: An Exercise in Automated Python Libraries
LazyPredict is an excellent tool that automatically runs data through different types of models and returns information on each model’s performance. This can be a great tool for saving time that would otherwise be spent manually testing different models. However, as I looked into this library, I could not help but wonder what other items I could take care of utilizing python libraries.
Ultimately, I landed on Feature Engine and Feature Tools to carry out EDA (exploratory data analysis) and feature engineering. With these two libraries and LazyPredict, I ran through a little exercise I have titled “The Lazy LazyPredict Prediction.”
The Lazy LazyPredict Prediction
This provides an overview of the process I went through, for a more in-depth look, the full notebook can be found in this repository.
For this project, I decided to use Kaggle’s Titanic dataset.
After opening the data, I first looked into missing (NaN) values. A quick check showed quite a few missing for Age. Although imputing age based on different classifiers, for example: average age per passenger class or fare, may have imputed a more accurate value, I felt that it would not be in the spirit of this project. As such, I simply utilized the MeanMedianImputer from Feature Engine.
Having imputed a value for missing Ages, I drop Cabin and the two rows where Embarked is missing values.
Working with Extreme Values
Having dealt with missing information, I moved onto extreme values. Extreme values can skew model predictions as the model attempts to work with the varying scale of values. One of the ways you can work around this is to cap values to stop at a certain point, something that Feature Engine allows for.
Calling .describe() on the dataframe, I took a look at the different values. Looks like Age, SibSp, Parch, and Fare have some particularly high values.
For example, the age variable. The mean is 29.70 and the 75th percentile is 35. However, the max is 80. The other variables contain similar gaps between the max value and the rest. In particular, Fare, where the max is 512, despite the mean being 32.20.
So, time to do a little value capping with Feature Engine’s Winsorizer.
After running this, I called .describe() on the dataframe once more and obtained the following results:
This successfully brought down the extreme values to a max of three standard deviations from the mean. For example, Fare’s previous max was 512.33, but is now 181.28.
Finally, onto feature engineering. For this, I utilize the Feature Tools library.
First, I created a dictionary that told Feature Tools what type of data each column would be. This is to allow the feature engineering function to recognize what kind of interactions it should make. I assigned this to the variable “variable_types” and then set an entity for Feature Tools using these types.
Next, with the base entity set, I created some relationships that I wanted Feature Tool’s Deep Feature Synthesis to explore and create features from.
And now, all that is left to do is run the Deep Feature Synthesis tool to create new features!
Prior to this, I ended up with 8 features after data cleaning and dropping columns. However, after running Feature Tool’s Deep Feature Synthesis, I ended up with 50 features. That sounds great, but how does this perform in a model? Time for some LazyPredict.
Testing Models with Lazy Predict
First, I split my data into a train/test split utilizing sklearn’s train_test_split.
Next, I instantiated the LazyClassifer and ran it with my data. As a side note, LazyClassifier will automatically perform scaling and the such on the data. Since the process may differ from how you choose to scale/normalize your data, the end result may be different in your final model.
Finally, I viewed the output.
Looks like I will be using Ada Boost Classifier for this!
Utilizing Feature Tools, Feature Engine, and Lazy Predict I was able to narrow down a model to use as well as engineering a slew of new features.
There were items I glazed over, in the spirit of almost exclusively utilizing the three libraries, but overall I was pleased with the results.
This was just a brief look at what these libraries are capable of, particularly Feature Tools. Quite a bit more can be done with them and they all provide extensive notebook tutorials for their features, something I look forward to looking even further in depth at. For those curious here are the tutorial pages for:
- Feature Tools
- Feature Engine