Making EDA using Python Easy: DTale (Part 1)

Ananya Mittal
NYU Data Science Review
7 min readDec 15, 2023
Photo by Luke Chesser on Unsplash

Completing an exploratory data analysis (EDA) using a Pandas dataframe can seem tedious and time consuming, especially with the large amounts of code and complex visualizations involved.

Bringing DTale to the stage! As a powerful data analysis and exploration Python library, DTale generates an interactive graphical interface that can not only calculate statistics and summaries, but also generate visualizations and carry out more advanced functions such as performing regression and clustering. [1]

However, while this library is easy to use, even for users with limited programming experience, it can take time to understand the nuances of what it actually offers. This tutorial approaches the library through a fresh lens, explores it step-by-step, and highlights certain points in an effort to ensure that you find quick fixes to some of the common issues you might encounter when using DTale to perform an EDA on your own dataset.

The example dataset used, obtained from IPUMS USA [2], incorporates USA census data in 2021 involving variables such as sex, level of education, school type, employment and occupation.

Installing DTale

You can install DTale using either pip or conda and use it in both Jupyter notebooks and python terminals.

Initiating a DTale interface and loading data

Option 1: Initiating the interface with the file loaded

You can initiate the interface by loading the file automatically if you are sure what file you want to open.

Note: You should import pandas to be able to read the file contents. Additionally, the file should be in the same directory as the program or the exact path to the file should be mentioned.

Option 2: Initiating a blank interface

You can also initiate the interface without specifying what dataframe you would like to load. This offers you two options: opening the interface in your browser, or within your notebook or environment.

If you choose to open it in your notebook or python terminal, it will open a window or generate a link respectively. However, from personal experience, it’s easier to work with the interface using a browser, and hence, you might fare better choosing the first option.

Will open a new tab on a browser
Will open a window or generate a link

Note: Initiating the interface without a dataframe loaded will show an error message on the main screen. However, if you hover on the top left of the window and click on the logo “DTale,” the first option will allow you to load the data frame.

Loading the dataframe:

DTale allows the user to load dataframes either stored locally on their computer or by inputting the link of the website from where the files can be fetched. Here, DTale showcases two very useful features — one, that it allows for a multitude of formats including .csv, .tsv, .xls, .xlsx, and .parquet, and two, that it allows the user to load multiple dataframes in the same window.

Once the data is loaded, a table will be displayed similar to a pandas dataframe. However, it bears one important distinction — all the cells of this table can be edited directly like you can in an excel sheet.

Column Options:

DTale allows you to work with each individual column and perform a variety of actions. While many of these remain common and constant across data types, some of them are individual to the data type you might be working with, such as string or numeric.

Column options for strings and numerics

Lock

The lock option works very similarly to the lock feature in Excel, where it fixes the selected column to the left of the screen and allows us to scroll to the other columns while keeping that column on the screen. This is especially useful when you need to compare two columns, especially if they are placed far apart.

Hide or Delete

The “hide” option removes the column from view without deleting the values from the actual dataframe. This will come in handy when you want to focus on certain columns or hide the original column once modifications have been made to it. You can undone this by clicking the right top of the screen.

Delete, on the other hand, removes the column from the dataframe permanently (similar to the pandas drop() method).

Note:
Deleting a column makes it virtually impossible to recover it without having to load the original dataset again, so if you’re someone who has as much anxiety as I do and do not want to delete a column unless you’re absolutely sure, the ‘hide’ function will be your best friend.

For this particular dataset, I altered the numerical values of 1 and 2 in the “sex” column to the string values of Male and Female respectively in the “sex_s” column, so I’m now hiding the original column.

Rename

Just as the name suggests, this option allows the user to modify the column name to whatever they prefer, allowing for more clarity and less redundancy.

Duplicate and Duplicates

The first option, like the name insinuates, creates a duplicate column. This feature could come in handy when you want to make significant changes to the values of a column but would like to retain the original values.

The second option, on the other hand, allows you to view or delete duplicate columns, rows or column names, a feature that can play an important role when cleaning data.

Replacements

This option works in a similar fashion to the search and replace function found in MS or Google Suite as well as many other applications. Here, you input the original value in the space corresponding to “search for” and your new desired value in the space corresponding to “replace with.”

Note: The input for the “replace with” option should be of the same datatype as the values in the column selected or it will replace it as “nan,” that is, a missing value (I didn’t realize this and struggled with missing values for eons, so hopefully you’ll be pulling out less hair than I did)

Tip: At the top, you can choose to save these modifications in a “new column” instead of “inplace” which will ensure that the original column remains unchanged and can then be hidden or deleted.

In this dataframe, sex was assigned numerical values {Male:1, Female:2}, which I converted back to the respective string values

Type Conversion

This function allows you to convert the values of the selected column from one datatype to another, provided it is a valid operation.

Note: Again, you might benefit from saving these changes in a different column, especially if you would like to use the original values for further cleaning or analysis.

In this example, I converted my “sex” column from a numeric datatype to a string datatype so that I can execute the “replacements” function to convert the values from numbers to words

Describe

This option operates like the pandas describe() function and provides a statistical summary of the column, albeit with a lot more information. It is also available to select from the main menu options to provide a quick overview of the contents of your dataframe.

For string columns, it provides detailed information on the composition of the characters, the most frequent word, its frequency, and value counts along with a plot.

For numerical columns, the describe option provides the measures of central tendency and spread along with frequency tables, value counts, box plots, Q-Q plots, and histograms

Formats

This option allows the user to define how numbers and strings are displayed. It allows us to hyperlink text for string datatypes, while with numerical data, it allows us the option to display commas and exponents, represent values as currencies and decide the number of decimal places we want to work with.

Clean columns

This option is available exclusively for columns containing string data and allows you to clean values by removing punctuation or numbers, replacing spaces or hyphens, and normalizing accent characters among other options, many of which can be chosen all at once.

Well, that’s all for the column options. Good job on making it this far! But we’re not done yet. DTale also provides a plethora of main menu options which I can only do justice to in a separate article. So please wait for that, and in the meanwhile, enjoy experimenting with DTale!

References:

[1] 360DigiTMG, Linkedin, 10 May 2023, https://www.linkedin.com/pulse/d-tale-360-digitmg-1f

[2] University of Minnesota, IPUMS USA, 2021, https://usa.ipums.org/usa/

--

--