PROs and CONs of Rapid EDA Tools

George Vyshnya
SBC Group Blog
Published in
7 min readOct 18, 2020
Image by Aaron Olson (https://pixabay.com/users/aaronjolson-4628445)

It’s better to solve the right problem approximately than to solve the wrong problem exactly.

John Tukey

Introduction

Since the time John Tukey coined the term of EDA in his famous book, “Exploratory Data Analysis” (1977), the discipline of EDA has become the mandatory practice in industrial Data Science/ML.

Started to build on top of John Tukey’s passion, it eventually turned into one of the most resource-consuming phases of any modern ML/Data Analysis project. According to the industry statistics, EDA typically takes around 30% effort of such projects nowadays.

In the business environments (even more then in scientific ones), the time pressure is one of the factors to drive real-world ML and Data Science projects. You must deliver the results not just of a good-enough (or, even better, of excellent) quality but also within a reasonably limited time frame.

Therefore, compressing the EDA time without draining its quality could be one of the beneficial ways to improve the overall ML project performance. That is where the promise of automated rapid EDA tools starts to sound attractive.

Typically, EDA can be decomposed into two principal logical parts.

  • finding the project-specific insights (this will always require substantial domain expertise to delve into as well as project-specific engineering to visualize data as well as do other EDA activities)
  • doing the routine checks of the data features (statistical checks, missing values, distributions, feature pair interactions etc.), using the extreme power of multiple EDA techniques and practices

The second area is where we can really expect to achieve significant time savings by leveraging appropriate EDA automation.

In this case study, we are going to demonstrate the benefits and pitfalls of automating routines of EDA for Drug Mechanisms of Action data with the specialized Python packages for rapid EDA below

Note: the full source code of the case study is provided in the GitHub repo

Problem Statement and Dataset Details

Image credit: https://www.labiotech.eu/cancer/forx-therapeutics-cancer-treatment/

It has been decided to check how useful (or equally less then useful) SweetViz and AutoViz could be in a real-world problem with quite an extensive dataset to analyze. Kaggle competition of ‘Mechanisms of Action (MoA) Prediction’ has been selected as a problem to tackle.

The essence of MoA Prediction competition is to build a multi-label classification model to predict the MoA of drugs based on observed gene expression and cell viability effects (observation features).

Note: if you like to dive into the details of the data-driven insights for this problem, you can navigate to some of the publications referred to in References section at the end of this post.

Sweetness of SweetViz

In this case study, SweetViz was quite helpful in two operational modes

  • Generating the detailed EDA report for a Pandas dataframe (in HTML format) when tried to analyze the feature-to-target label interactions
  • Generating the detailed EDA report to compare two dataframes with the similar structure (in HTML format)

Generating a detailed EDA report for a dataframe is easy. You can see it in action below

Comparing two datasets with sweetviz is equally simple, per the code fragment below

As a result, SweetViz helped to quickly draw insights below

  • Detect the difference in key statistical parameters of the same variables in the treated and control subsamples of training and testing datasets
  • Compare the skewness of numeric variables in the treated and control subsamples of training and testing datasets
  • Pick up important feature -to-class label interactions for the topmost frequent labels in the training dataset

Power of AutoViz

In this case study, AutoViz was helpful to quickly discover the essential details as follows

  • Detecting the most significant features for the ML problem
  • Proving the significance of the features with the comprehensive visualizations

It is also accomplished with a few lines of code, as presented below

In terms of feature importance, AutoViz performs it as follows

  • Rejects highly correlative features
  • From the subset of features remained, rejects the features detected as non-significant by the internally trained xgboost model
  • As a result, it outputs the dataframe with only a subset of significant features it detected

Such discoveries are helpful to set the stage for promising ML modelling experiments down the road.

It is worth mentioning AutoViz displayed the exemplar performance and execution speed. The AutoViz-based part of the analysis presented in the source code repo has taken less then 10 min whereas the operations with SweetViz took about 90 min to run on the local machine with the following hardware and system software specification

  • 2 CPU Intel Core i7–8750H 2.2 GHz
  • 16 GB RAM
  • OS Windows 10
  • Anaconda with Python 7

Problem-specific Visualization Instruments

Despite quite a lot of insights obtained via rapid EDA tools without much time spent on the effort, not everything can be captured by them. They are not magical wands, after all. There is always something you will have to do manually, for every project specifically.

In this case, the manual effort has been put to investigate

  • Label-to-label interactions
  • Class labels — to — feature interactions

Below is the code sample to demonstrate the way to build a custom multi-facet plots to visualize multiple features from the training dataset interacting with a particular class label

Also, the number of domain-specific insights for the ML problem could only be captured by a human expert looking at the data AND applying the subject matter expertise (see more details on that discussed in my Kaggle thread ‘MoA: EDA and Feature Importance Findings’).

Additional Word on Feature Importance

AutoViz did a great job detecting the important features for this project. However, it is always recommended to measure feature importance by multiple, preferably, analytical, methods.

That is why I also did the experiment with permutation feature importance. You can see it end-to-end in the source code repo .

As it happens in most of the real experiments, the lists of significant features detected by AV and via permutation importance method are not identic (although overlapping on a good number of features). This will open the avenue to comparative ML study to see which set of important features leads to a better ML model (in terms of its accuracy).

Age of Open Source Software

Both SweetViz and AutoViz are open-source libraries, and you can start using it on your own projects for free.

All of us know the benefits of the open source software in general (that’s why it was in use in 78% of businesses as of 2015, per ZDNet’s article, with trend going up as of this moment).

However, the adoption of a certain open software application or library can see the structural barriers. One of the most serious barriers of this sort is the lack of maintainability of the codebase.

In this sense, I was really amazed at how AutoViz maintainer team fulfills its mission. As I worked on this case study, I faced a severe issue. That issue has been picked up by the maintainer of AutoViz quickly, and the solid fix has been delivered in the expedited way (it signified the incremental update of AutoViz from version 0.0.68 to 0.0.70).

The team kept further testing AutoViz with the data for MoA project to see additional improvements to add (and that’s how AutoViz 0.0.71 was released).

As a result, it enabled my work on the case study in a reliable manner.

Summary and Conclusion

John Tukey liked to emphasize,

“If a thing is not worth doing, it is not worth doing well”.

EDA is the thing to do indeed, and it is the thing to do in a right fashion, well.

However, if you can automate certain aspects of EDA without dropping the quality bar on your data research / insights obtained, it is the right step to do. It will add the edge to efficiency of your industrial ML and data analytics projects.

This is the reason why successful rapid EDA tools (like AutoViz or SweetViz) would find its niche in the Data Science and ML industry in the long run.

References

1. Tips for Automating EDA using Pandas Profiling, Sweetviz and Autoviz in Python — https://analyticsindiamag.com/tips-for-automating-eda-using-pandas-profiling-sweetviz-and-autoviz-in-python/

2. Better, Faster, Stronger Python Exploratory Data Analysis (EDA) — https://towardsdatascience.com/better-faster-stronger-python-exploratory-data-analysis-eda-e2a733890a64

3. Powerful EDA (Exploratory Data Analysis) in just two lines of code using Sweetviz — https://towardsdatascience.com/powerful-eda-exploratory-data-analysis-in-just-two-lines-of-code-using-sweetviz-6c943d32f34

4. Sweetviz: Automated EDA in Python — https://towardsdatascience.com/sweetviz-automated-eda-in-python-a97e4cabacde

5. AutoViz: A New Tool for Automated Visualization — https://towardsdatascience.com/autoviz-a-new-tool-for-automated-visualization-ec9c1744a6ad

6. Webinar: AutoViz and AutoViML: Automated Visualization and Machine Learning — https://www.youtube.com/watch?v=QZhq8g9W-pQ

7. You Are Plotting the Wrong Things — https://towardsdatascience.com/youre-plotting-the-wrong-things-3914402a3653

8. Insights to Mechanisms of Action Prediction Competition -

9. Drugs classification: Mechanisms of Action — https://www.kaggle.com/amiiiney/drugs-classification-mechanisms-of-action

10. Mechanisms of Action: EDA and Feature Importance Findings — https://www.kaggle.com/c/lish-moa/discussion/190647

11. Feature Importance in the Age of Explainable AI/ML — https://medium.com/sbc-group-blog/feature-importance-in-the-age-of-explainable-ai-ml-242327a70017

--

--

George Vyshnya
SBC Group Blog

Seasoned Data Scientist / Software Developer with blended experience in software development, IT, DevOps, PM and C-level roles. CTO at http://sbc-group.pl