Stuck in Exploratory Data Analysis? Try Sweetviz

Learn to perform powerful EDA with just 2 lines of code.

Shubhangisoni
DataX Journal
6 min readJun 10, 2020

--

For a person dealing with Datasets, his first basic need is to explore and understand the data. Right? That is called EDA. Exploratory data analysis (EDA) is an essential way of visualizing, summarizing, and interpreting the information that is hidden in the data set. It is one of the crucial steps in data science that allows us to achieve certain insights and statistical measures. It helps us to refine important features variable, that will be used in our model.

Multiple libraries are available to perform basic EDA but this time I am going to use one of the latest open-source Python libraries called Sweetviz (here is the link for Sweetviz GitHub repo). It takes pandas data frames and creates a self-contained HTML report.

What is Sweetviz?

According to pypi.org-

Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. The output is a fully self-contained HTML application.

The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

It creates insightful and beautiful visualizations with just a few lines of code. NO other library could provide such quick and precise analysis like sweetviz provides.

We will be analyzing and comparing the dataset with respect to target variables. I have used Jupyter Notebook to write code. Jupyter notebook is a kind of diary for data analysis and scientists, a web-based platform where you can mix Python, HTML, and Markdown to explain your data insights.

Analyzing a single dataframe

For this article, we will be analyzing the House Prices data set and Students Performance in Exams data set from Kaggle.

By using sweetviz we can analyze as well as compare data just in few lines of code. We will use student performance data set to analyze the data set and House prices data set for comparison between Test and train data set.

Let’s start!

Install Sweetviz using pip install sweetviz. Load the pandas data frames by using pd.read_csv(), then call either analyze() or compare() depending on your need. You can refer full documentation on GitHub. For now, let’s start with loading the data set:

Pandas tail() method is used to return the bottom n rows of a data frame.

Output

Now we are going to analyze the data with “math score” as our target variable. (You can refer to Github)

Running the following command will perform the analysis and create the report object. To get the output, simply use the show_html() command:

You will get the following output:

The content of the details depends on the type of the variable being analyzed. In the case of a categorical (or boolean) variable, as is the case with the target, the analysis is as shown above. You can see how amazing visualizations we can get in just a few lines of code. This was just the analysis part, lets compare test and train data set of house price data set.

Comparing two data frames (e.g. Test vs Training sets)

To compare two data sets, simply use the compare() function. Its parameters are the same as analyze (), except with an inserted second parameter to cover the comparison data frame. It is recommended to use the [dataframe, "name"] format of parameters to better differentiate between the data frames.

Taking House price data set we have 2 data frames (train and test), and we would like to analyze the target value “SalePrice”. SalePrice is the property’s sale price in dollars. Note that in this case we know the name of the target column in advance, but it is always optional to specify a target column. We can generate a report with this line of code:

Running this command will perform the analysis, comparison, and create the report object. To get the output, simply use the show_html() command:

After running this line of code, you will see something like this which will open through your default browser:

Isn’t it amazing?

It gives precise Information-

The summary shows us the characteristics of both data frames side-by-side. We can immediately compare that the testing set is the same as the size of the training set, but it contains the same features. It will specify the memory consumed, type, unique values, missing values, duplicate rows, most frequent values and will present it in a proper table along with graphs.

Having problem with correlating features?

Here’s the solution!

Hovering your mouse over the “Associations” button in the summary will make the Associations graph appear on the right-hand side:

It’s showing a dense map because our data set is having many features. You can try it with other datasets. Basically, in addition to showing the traditional numerical correlations, it unifies in a single graph both numerical correlation but also the uncertainty coefficient (for categorical-categorical) and correlation ratio (for categorical-numerical). Squares represent categorical-featured-related variables and circles represent numerical-numerical correlations. The stronger the color, the larger the correlation magnitude.

It’s possible that the few correlations are not accurate as they make some assumptions on the underlying distribution of data and relationships. However, they can be a very useful starting point.

IMPORTANT: only numerical and boolean features can be targets currently.

Detail area (categorical/boolean)

When you move the mouse to hover over any of the variables, an area to the right will showcase the details. Here, we can see the exact statistics for each class.

As you can see on MSSubClass there is no missing value but in LotFrontage there are missing values in both test and train data frame. You also get the detail of the associations for each of the other features.

IMPORTANT: You need to zoom out the page to see the full detail area for the moment.

Here are the buttons on top of the graph. These buttons change how many “bins” are shown in the graph. Now it is in auto mode.

graphs shown with different bins

Conclusion

All this information from just two lines of code!

Sweetviz is an amazing library. It will definitely help users to easily deal with new data sets. I hope you will find it a useful tool in your own data analysis.

You can refer to my Github for codes.

Documentation- https://github.com/fbdesignpro/sweetviz

Cheers,

Shubhangi Soni

--

--