Data Visualization Techniques to Analyze Outcomes of Feature Selection

How to create apps using quick visualisation for ML Projects using Streamlit during the process of Feature Selection

Anshuk Pal Chaudhuri
Nightingale
7 min readJul 6, 2020

--

A Machine Learning project is never really complete if we don’t have a good way to showcase it. While in the past, a well-made visualisation or a small deck used to be enough for showcasing a data science project, but with the advent of data visualisation tools like Power BI, Tableau, Qlik Sense, a good data scientist needs to have a fair bit of knowledge of web frameworks to get along.

And Web frameworks are hard to learn. I still get confused in all that HTML, CSS, and Javascript with all the hit and trials, for something seemingly simple to do.

Not to mention the many ways to do the same thing, making it confusing for us data science folks for whom web development is a secondary skill.

This is where Streamlit comes in and delivers on its promise to create web apps just using Python — Now it is critical to mention here, that Streamlit Web Apps is not about creating a web app with visual storytelling for final consumption of data only; but also during the process of Project Development & Iteration Phase — how well they can be used for team and managers and even client technical stakeholders — to understand the complex data modelling process in a visually intuitive way.

Objective

Hard Questions which require deep Domain and Technical Knowledge — How to make it self exploratory?

It becomes invariably difficult for team members, managers, and client stakeholders to answer questions like:

  • Which features have been picked up for a predictive model?
  • On what basis, these features were not selected?
  • Have you taken all different algorithms into consideration?
  • The features you have chosen, changing as when we receive new data?

As a data scientist — these questions can be answered, but what if I need these answers to more self-exploratory by the consumers themselves?

The answers to the above questions may require deep knowledge of the problem domain. It is possible to automatically see these features in your data that are most useful or most relevant for the problem you are working on using the process called feature selection. But can this be viewable or consumable by anyone else?

Self Service

We will quickly go through the basic concepts of Feature Selection and then we would create a visualization web application, how to display the results of Feature Selection

Feature Selection Techniques

Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.

Having too many irrelevant features in your data can decrease the accuracy of the models. Three benefits of performing feature selection before modeling your data are:

  • Reduces over-fitting: Less redundant data means less opportunity to make decisions based on noise.
  • Improves accuracy: Less misleading data means modeling accuracy improves.
  • Reduces training time: Less data means that algorithms train faster.

Feature selection is also called variable selection or attribute selection. It is the automatic selection of attributes in your data (such as columns in tabular data) that are most relevant to the predictive modeling problem you are working on.

Feature selection is different from dimensional reduction. Both methods seek to reduce the number of attributes in the data set, but a dimensional reduction method does so by creating new combinations of attributes, whereas feature selection methods include and exclude attributes present in the data without changing them.

Feature selection methods aid you in your mission to create an accurate predictive model. They help you by choosing features that will give you as good or better accuracy whilst requiring less data.

Feature selection methods can be used to identify and remove unneeded, irrelevant, and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. Fewer attributes are desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.

In this article, we will be covering one algorithm Recursive Feature Elimination, or RFE for short, how this can be tied with a visual web-app to explore data and understand the importance of feature selection,

RFE is a popular feature selection algorithm. It is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training data set that are more or most relevant in predicting the target variable.

There are two important configuration options when using RFE: the choice in the number of features to select and the choice of the algorithm used to help choose features. Both of these hyper-parameters can be explored, although the performance of the method is not strongly dependent on these hyper-parameters being configured well.

Self-Serviceable App — helping consumers to explore data and understand which features are important

Using Streamlit for Interactive Web Apps & Visualisation for Feature Selection

Data at Glance

  • The data used here is public data stocks_data.csv
  • There are 100 independent variables, all numeric in nature. These are stock prices at different timings
  • There is one binary output, which needs to be predicted — to be sold yes or no.

The goal here is to understand which variable/feature is important, and how much they are important to take a decision of selling or not selling.

Code Imports

from numpy import mean
from numpy import std
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
import streamlit as st
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from matplotlib import pyplot
import altair as alt
import seaborn as sn

Data Exploration

The data is loaded using pandas and explored the header using the following code

Data Exploration

Code Snippet to explore data

@st.cache
def loadData():
df = pd.read_csv(“stock_data.csv”)
X = df.iloc[:, 0:100]
y = df.iloc[:,-1]
return df,X,y
# Basic splitting required for all the models.
def split(X,y):
# 1. Splitting X,y into Train & Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
return X_train, X_test, y_train, y_test
def main():
st.title(“Using Streamlit Apps — Feature Selection for Classification Problems using various Machine Learning Classification Algorithms”)
df,X,y = loadData()
page = st.sidebar.selectbox(“Choose a page”,[“Homepage”, “Exploration”])if page == “Homepage”:
st.header(“This is your data explorer.”)
st.write(“Please select a page on the left.”)
st.subheader(“Showing raw data….”)
st.write(df.head())
# Basic splitting required for all the models.
def split(X,y):
# 1. Splitting X,y into Train & Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
return X_train, X_test, y_train, y_test
def main():
st.title(“Using Streamlit Apps — Feature Selection for Classification Problems using various Machine Learning Classification Algorithms”)
df,X,y = loadData()
page = st.sidebar.selectbox(“Choose a page”,[“Homepage”, “Exploration”])if page == “Homepage”:
st.header(“This is your data explorer.”)
st.write(“Please select a page on the left.”)
st.subheader(“Showing raw data….”)
st.write(df.head())

Code Snippet to explore pattern among two variables

def main():
st.title(“Using Streamlit Apps — Feature Selection for Classification Problems using various Machine Learning Classification Algorithms”)
df,X,y = loadData()
page = st.sidebar.selectbox(“Choose a page”,[“Homepage”, “Exploration”])if page == “Homepage”:
st.header(“This is your data explorer.”)
st.write(“Please select a page on the left.”)
st.subheader(“Showing raw data….”)
st.write(df.head())
elif page == “Exploration”:
st.title(“Data Exploration”)
x_axis = st.selectbox(“Choose a variable for the x-axis”, df.columns)
y_axis = st.selectbox(“Choose a variable for the y-axis”, df.columns)
visualize_data(df, x_axis, y_axis)
Comparing Two Features based on Data Points

An option is provided where user can select a feature selection algorithm, and understand the different features their importance, with necessary visualisation, what its answering:

Feature Selection Score

  • Accuracy of RFE Selection using DecisionTreeClassifier
  • Visualization (box-plot) showing the same

RFE model pipeline as a final model and make predictions for classification

  • Accuracy of Prediction RFE using DecisionTreeClassifier
  • Report of RFE using using DecisionTreeClassifier

This way, one can easily understand the features, what they are important, how much they are important. Using this code, one can extend more algorithms like SelectKBest, etc. and pick up the next algorithm as needed.

Code Snippet for selecting a model for Feature Selection, in this case RFE, and exploring how it looks like

if(choose_model == “Recursive Feature Elimination”):
st.subheader(“Which Feature is Important?”)
X_train, X_test, y_train, y_test = split(X,y)
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
scores = evaluate_model(model,X_train,y_train)
results.append(scores)
names.append(name)
model.fit(X_train, y_train)
ypred = model.predict(X_test)
predict_score = metrics.accuracy_score(y_test, ypred) * 100
predict_report = classification_report(y_test, ypred)
cm = metrics.confusion_matrix(y_test,ypred)
st.subheader(“Feature Selection Score”)
st.text(“Accuracy of RFE Selection using DecisionTreeClassifier: “)
st.write(round(mean(scores),2)*100,”%”)
pyplot.boxplot(results, labels=names, showmeans=True)
st.pyplot()
st.subheader(“We can also use the RFE model pipeline as a final model and make predictions for classification.”)
st.text(“Accuracy of Prediction RFE using DecisionTreeClassifier is: “)
st.write(predict_score,”%”)
st.text(“Report of RFE using using DecisionTreeClassifier is: “)
st.write(predict_report)
sn.heatmap(cm, annot=True, annot_kws={“size”: 16},cbar=False) # font size
st.pyplot()
Feature Selection

We can also use the RFE model pipeline as a final model and make predictions for classification.

RFE as model for prediction (if required, optional)

Conclusion

Data Scientists can use this code to create web apps with visualisation to understand the features and their importance. Also, use the feature selection model for prediction for classification problems.

Git Repository Code

The complete code is available here

The data is available here

--

--