Reusable ML code snippets for everyday use — Part-I
If you are an ML professional or a student, you must deal with multiple code pieces that are repeatable in use. Sometimes it kills a fair amount of time before you reach the modelling and hyperparameter tuning stage. Keeping that in mind, I have created this blog to collate many of such code blocks. If you find any important piece missing from this list, please suggest the same in the comment. I will add that here. I use Google Colab for all my analysis and exploratory coding stuff. Though I have tried most of the other IDE, it’s the one where you may settle for every type of model i.e. when you need GPU for CNN or when you need a large RAM for a large dataset. Needless to mention, it is on Cloud. It has a customized flavour for Jupyter note which makes it, even more, user-friendly. Let’s not waste time and jump to the snippets in the next section onwards.
- Download data from Kaggle, we will use the data from ASHRAE ENERGY Prediction contest. This is very useful if you are working on any of the Kaggle challenges.
2. Let’s load the CSV files and check the basic functions to inspect our data. These are the functions that give you the sense of your data in terms of Outliers, NAN ratio, Category columns. The output shown in the image is for the last command. “value_counts()” becomes very important for categorical features because most of the plots available for basic EDA are mainly for numerical features.
3. Exploratory Data analysis. This is a handy way to dive a bit deeper into your data using powerful charts. Just be mindful enough to create an “Insights Vs Action” map in your plan. It is not very uncommon where we just do the EDA and move to the next step of pre-processing without taking action for the insights.
a. First of these would be the Heatmap. It gives a very clear view of the correlation between all the pairs of features. We will use the heatmap function of the Seaborn library. Use the “figsize” to adjust the size of the map esp. when you have a lot of features.
b. Next would the Boxplot. It gives us a consolidated view of the median, quantile and most importantly about the outliers among the data. We will use the boxplot function of the Seaborn library. You may plot Boxplot in multiple flavours e.g. all columns in one chart, numerical features against Categorical a feature, etc. As we can see a lot of outliers is shown on the plot as dots. In case you are not aware of how to read a boxplot, please check this.
c. Third and last is Pairplot. It gives a zoomed view of correlation among features, we should use this more to check the correlation of each input feature with the output feature. In this case, it is the meter_reading. We will use the pairplot function of the Seaborn library. I have selected 3 features to keep the plot uncluttered but you must do this for all the non-categorical features.
4. Generic code snippets. This is the last part of this post. In this section, we will see a healthy list of Python code snippets which we need very frequently.
reset_index — Doing a lot of operations on a Dataframe might make its index unordered and that will cause a join to produce NAN rows. We must reset the index prior to the join
inplace — Almost all the functions support this attribute, just be mindful of this and use as much as possible.
timestamp — The code snippet will convert the string feature into a timestamp and then we can extract individual values for Day, Month, Year, etc. and make a new feature. This will help you a lot in feature extraction as a timestamp in itself doesn’t add much variance in many data scenarios. A day or month will do better.
loop on column — This is a very simple but quite effective approach to analyze different attributes of your features. Most of the attributes depend on column type i.e. String, Numeric. In the same snippet, you have the code to fill NAN i.e. fillna() for categorical data and numeric data. I have used the most frequent value for categorical features and the mean() for numeric features. You may change the method i.e. 0, Some constant, median(), drop the row as per your analysis of the respective feature.
This was all from me for this post. I will continue this in Part-II of the post where I will cover —Scaling, Binning, One-hot, Label encoding, memory optimization for Dataframe, Cross validations, Code for different metrics used in Kaggle and few more pieces. The focus will be more on ML models. Below is the link of the Colab notebook for the code of this post. Please do post your comment on this post and on the notebook.