Making EDA using Python Easy: DTale (Part 2)

Published in

NYU Data Science Review

6 min readFeb 9, 2024

Welcome back to our journey on understanding how to use DTale to simplify obtaining analyses and visualizations using Python! In case you haven’t read it, click here to read Part 1 that covers the bases of this powerful library.

In this part, we’ll move forward and explore its main menu options that allow us to work with the entire dataset as opposed to individual columns. As a reminder, the example dataset used, obtained from IPUMS USA [1], incorporates USA census data in 2021 and involves variables such as sex, level of education, school type, employment and occupation.

DTale offers 3 main menu options that provide their own sub-options:

Actions

Dataframe functions

This option allows you to build new columns by performing operations on existing columns. Essentially, you enter the two existing columns you want to work with at the bottom and choose what action you want to perform from the given choices depending on the datatypes of your chosen columns. For numeric datatypes, this can involve performing numeric operations or transforming data using normalization and standardization. Other datatypes have different functions such as cleaning and transformation for string data or time series operations for datetime datatypes.

Merge and Stack

If you have different datasets from different sources, you can load all of them onto the same DTale window if you want to work on them individually. However, the library also offers the option to merge these dataframes column-wise or stack them vertically.

Note: While having more data is generally better, mixing data from two different populations and environments can be risky. Hence, you should be careful that you do not reduce the accuracy of the models or visualizations by combining datasets.

Summarize Data

In pandas, we usually summarize data using group-by or pivot tables to create compact and structured summaries that are easy to visualize. DTale provides the same functionality without requiring us to write the code for each group-by or pivot table. Here, you can select how you want to summarize your data, which columns you want to select for this purpose, and what aggregation function you want to apply to these columns.

Tip: As mentioned in the previous article, you might benefit from saving this summary in a new instance instead of overriding the current dataset, especially if you would like to use the original values for further cleaning or analysis.

Feature Analysis

This option aims to remove features that are highly correlated with other features (or columns). This is because highly correlated features provide redundant information, so by removing these columns, you can simplify your models and select a small set of important features that best represent your target variable.

Visualize

Describe

This option is equivalent to the df.describe() function in pandas, and is used to provide a quick overview of the contents of your dataset. Using this feature, you can view different statistics depending on the datatype of your chosen column along with other information such as box-plots and histograms of column values and value counts of unique values.

Missing Analysis

This feature analyzes the presence of missing data and uses the Missingno python package to visualize the missing values through matrix, bar, heatmap, and dendrogram charts.

Correlations

This option shows the Pearson correlation matrix for all numerical data and provides a heatmap indicating the level of correlation. This ranges from dark green to yellow as we move from a correlation of 1 to 0 and further onto red for negative correlations.

Tip: You can focus your grid by choosing specific x and y columns whose correlations you would like to view.

Predictive Power Score

The PPS library [2] is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power) and can be used as an alternative to the correlation matrix

Time Series Analysis

Data points collected over a period of time may show internal trends or variations which can be mapped out by this option. There was no time series analysis available for my data since I only took a single year into consideration, so here’s a chart from another dataset.

Note:

The “index” option should contain the column with values indicating the period of time. The column should be in a date-time format
The “column” option should be the variable to be studied in numerical dtype format
You can also aggregate the data using options such as count, sum, mean, variance, maximum, and minimum among others

Charts

DTale makes use of plotly, an open-source python library used for data visualization, to create interactive plots for multiple variables with different sample sizes and aggregating methods. It offers a variety of different charts that suit different datatypes.

Highlight

Heat Map

This option presents a heatmap for the numerical values on a column-by-column basis. The colors indicate the value of the cell’s datapoint and are most useful for visualizing data with clear patterns or structures.You can also see heatmaps for individual columns by choosing this option from the column options.

Tip: The presence of outliers might skew the color scale which may then not be able to accurately represent the relative values of the data points. Hence, if your data values vary greatly in magnitude, you may want to normalize the data before generating a heat map.

Highlight

This option highlights columns based on different factors (dtypes, missing values, outliers, range, and low variance).

Tip: When highlighting the range, you can set your custom range depending on what values you want to focus on.

This brings us to the end of my tutorial on DTale! I hope this guide helps you explore this powerful package for yourself and make your analytical journeys easier. If you have any questions or tips to help other readers, make sure to leave them in the comments below!

References:

[1] University of Minnesota, IPUMS USA, 2021, https://usa.ipums.org/usa/

[2] Lee Rowe, Predictive Power Score Implementation in Python, 2021

Making EDA using Python Easy: DTale (Part 2)

Actions

Dataframe functions

Merge and Stack

Summarize Data

Feature Analysis

Visualize

Describe

Missing Analysis

Correlations

Predictive Power Score

Time Series Analysis

Charts

Highlight

Heat Map

Highlight

Written by Ananya Mittal