How to use plotly to visualize interactive data [python]

12 min readJan 22, 2024

--

Content

Introducction plotly vs matplotlib vs seaborn
Types of plotly charts for data visualization
Template structure of the codes to graph with plotly
Some “tricks” frequently used to graph with plotly
Examples of each plotly graph for data visualization
Link repos codes

Introducction plotly vs matplotlib vs seaborn

One of the first tasks of a data scientist’s job is to visualize data, either to show the results of training a machine learning model and graph real values vs predicted values, graph the forecast of a time series model, graph confusion matrix, etc. But, before all this, the first task is carrying out an exploratory data analysis (EDA) and being able to understand the relationships between the different features and targets visually.

To perform this task, packages such as matplotlib and seaborn are often used, which are the best known and most used. However, there is also plotly which is a library that allows you to view data interactively, allowing you to zoom in and zoom out on a trend graph, place the mouse over a point on the graph and know its value both in a scatter plot, boxplot, parallel plot, etc. Having a graph that allows the data scientist to interact with it provides a lot of value to explore the data.

These graphs can be viewed in a Jupyter notebook, as well as saved as an HTML file and interact with them from the browser, as well as saved as an image or a pdf and have a static graph such as those generated by matplotlib or seaborn. However, being able to interact with graphics has a drawback, the amount of memory needed to interact with large volumes of data. For example, you have a subplot of a scatter plot where the relationships of 40 variables and a target variable are graphed, this generates a very heavy file and, depending on the capabilities of the PC, it will cause it to stick and force you to save the static graph.

At the code level, according to my use of all these visualization libraries, I can say that by knowing one you can use any of the rest since they follow a very similar logic. On the other hand, at the level of code quantity, plotting with any of these packages requires a similar number of lines of code, perhaps plotly requires a little more, but understanding the logic behind it does not represent a major problem.

In order to show the different plotly graphs that can be used to graph data in an EDA, I have written 2 repositories, the first of which contains notebooks where how to use plotly (first there are codes where the graph is developed and then it is synthesized all in a single parameterized function that can be reused for any dataset); The second repo contains a set of python scripts that can be run and, given a dataset, make a variety of graphs (given by the functions in the first repo) and save the generated graphs, all automatically. In summary there are:

https://github.com/joseortegalabra/exploratory-data-analysis-ds: repo with notebooks where codes are developed to perform EDA with plotly
https://github.com/joseortegalabra/exploratory-data-analysis-automatic:repo only with scripts to run an EDA with plotly automatically

Types of plotly charts for data visualization

Below is the list of plotly graphics that can be developed to generate an EDA/visualization interactively. The objective of these codes is to analyze data that can be plotted as time series, focused mainly on forecasting a target variable. However, many graphs can be applied to any type of data, for example a histogram, a scatter plot, etc.

ydata-profiling

Use the ydata-profiling package to perform basic data profiling

1 Basic profiling of the data generated by this package.

Univariate Analysis

Perform a univariate analysis of the data — continuous data (analyze each variable individually)

1 descriptive statistics table
2 histograms
3 kernel density + histograms
4 original data trend
5 boxplots with monthly aggregation
6 original vs smoothed data trend
7 Autocorrelation and partial autocorrelation functions
8 Other analyzes for time series

Bivariate Analysis

Perform a bivariate analysis of the data — continuous data (analyze the relationships of the different pairs of variables that can be generated, in addition to their relationships with the target)

1 correlations
2 scatter plots
3 correlations features with lag
4 Multivariate parallel plot

Segmentation Analysis

Perform an analysis by data segments — continuous data and generate a categorical variable that indicates the different data segments. It can be segmented by percentiles or by some custom criteria. The codes allow segmenting by a single variable, either segmenting by the target variable, for example predicting a temperature, but it is known that due to expert knowledge this temperature can be classified into sub-zero temperatures, medium temperatures and high temperatures and understand how the relationships change. of the data for the different temperature groups. Currently you can only segment by one variable at a time. Types: segment by feature / segment by target

0 generate data segmentation
1 segmented data distribution
2 descriptive statistical table segmented data
3 histograms and boxplots segmented data
4 trend segmented data
5 segmented data correlations
6 scatter plots segmented data
7 parallel plot only when target is segmented

Categorical Analysis

Transform all continuous variables into categorical ones and analyze the relationships of the data by categorizing them either by percentiles or by custom groups. Here all the features are categorized and two analysis groups are generated depending on whether the target is transformed into a categorical variable or not. Types: Analyze categorical features vs continuous target / Analyze categorical features vs categorical target

0 generate categorical data
1 crosstab frequency features and target
2 frequency between 2 categorical features
3 univariate analysis categorical features vs continuous target
4 univariate analysis categorical features vs categorical target
5 bivariate analysis categorical features vs continuous target
6 bivariate analysis categorical features vs categorical target
7 parallel plot features categorical vs categorical target
8 woe iv — categorical features vs binary target

Template structure of the codes to graph with plotly

Notebook reference: https://github.com/joseortegalabra/exploratory-data-analysis-ds/blob/main/0_template/2_template_functions_plotly.ipynb

The main structure of the functions that generate plotly graphs is to receive as input arguments the dataframe with the data and parameters needed to graph and return a plotly figure object without altering the original input dataframe

Why return a plotly object? Because returning the figure from plotly allows variety in the decision of what to do with the generated graph, this can be displaying it on the notebook where it is being run, saving as html for interactive visualizations on the PC, saving as an image or pdf for static visualizations , etc. Following this format allows you to have structure and freedom in the use of the generated graph.

def plot_individual_hist_segment(df, var_segment, feature_hist):
    """
    Plot individual hist
    Args
        df (dataframe): input dataframe
        varg_segment (string): name of the column in the input dataframe that indicate the differents segments in the data
        feature_hist (string): name of the feature in the input dataframe that will plot its histogram

    Return
        fig (figure plotly): fig of plotly with the plot generated
    """

    # TODO: ADD CODE TO GENERATE FIGURE
    fig = px.histogram(df, x = feature_hist, color = var_segment, barmode='overlay', opacity=0.4)

    # update title
    fig.update_layout(
      title_text = f'Histogram: {feature_hist}',
      title_x = 0.5,
      title_font = dict(size = 20)
    )
    return fig

Access to plotly data

On the other hand, another trick is to access the data of the plotly figure object (figure.data[0]) which allows you to access the graph data and modify it to, for example, change the color of the graph (this without having to change the arguments of the plotly function that generates the graph, but directly change the figure object already generated)

Example changing color for trend line in a scatter plot

# generate figure plotly
fig = px.scatter(df, x = feature_x, y = feature_y, marginal_x = "histogram", marginal_y="histogram", trendline="ols")

# change color trendline to brick red
fig.data[-1]['marker']['color'] = '#d62728'

Subplots

To plot a single graph you can follow the structure mentioned above. On the other hand, to make multiple graphs/subplots it can be done in two ways:

The first of them is by creating a plotly object (go) graph, which when following the documentation and examples is the option shown most of the time.

fig = make_subplots(rows = number_rows, cols = number_columns, shared_xaxes=False, subplot_titles=list_features)

# add each histogram
    for index_feature, feature in enumerate(list_features):

        # get indexes subplot
        row = (index_feature // number_columns) + 1
        column = (index_feature % number_columns) + 1
        
        # TODO: ADD CODE TO GENERATE GRAPH USING PLOTLY OBJECT (go)
        fig.add_trace(go.Histogram(x = df[feature], name = feature), row = row, col = column)

You can use a plotly express (px) graph, which is not adapted to plot subplots but you can extract the data from the figure object that is generated and add it to the subplot. The main problem with this is that all subplots generated with plotly express maintain the same color tone, while making subplots with the first option mentioned allows graphing by automatically varying the color between each subplot (e.g. trend graphs of N features and graphing each trend of a different color in each subplot). The above can be a problem, but, depending on the type of subplots and what you want to display, it may be useful to maintain a fixed color scale in each subplot.

fig = make_subplots(rows = number_rows, cols = number_columns, shared_xaxes=False, subplot_titles=list_features)

    for index_feature, (feature_x, feature_y) in enumerate(list_pair_features):

        # get indexes in the subplot (in plotly the indexes starts in 1)
        row = (index_feature // number_columns) + 1
        column = (index_feature % number_columns) + 1

        ## get cross table freq between feature_x and feature_y. with margins. It is possible to select between normalized values or not
        if ct_normalized:
            ct_freq_features = pd.crosstab(df[feature_x], df[feature_y], normalize=True, margins = True)
        
        ## tranform cross table freq between pair of features into a heatmap
        fig_aux = px.imshow(ct_freq_features, text_auto=True, aspect="auto")
        
        # add heatmap to fig global
        fig.add_trace(fig_aux.data[0],
            row = row,
            col = column
        )

Subplot size

In addition, the size of the generated graphics can be adjusted in pixels and for each screen the number of pixels varies, so generating images on a large monitor and then viewing them on a notebook screen can cause the image to appear distorted.

When using plotly express and with a single graph the resolution is usually adjusted automatically

On the other hand, when generating subplots it is necessary to adjust the resolution. However, only the resolution of the figure spanning all subplots is adjusted and then the individual plots are adjusted according to the dimensions of the figure. Thus it is necessary to adjust the size according to the number of subplots that are going to be displayed.

fig.update_layout(title = 'Tittle plot of subplots',
                      height=1450 * number_rows,
                      width=1850 * number_columns)

Subplots plotly express with multiple data

You can generate graphs in Plolty Express with multiple data, for example a trend graph of 5 features together, this generates a single plotly figure but when viewing the data attribute (figure.data) you can see that each trend has its own element , then if you also want to add this to a subplot, you must do:

for index_feature, (feature_x, feature_y) in enumerate(list_pair_features):

    # get indexes in the subplot (in plotly the indexes starts in 1)
    row = (index_feature // number_columns) + 1
    column = (index_feature % number_columns) + 1

    # get fig individual - plotly express
    fig_aux = px.scatter(df, x = feature_x, y = feature_y, color = var_segment, opacity = 0.1)
    
    # add scatter to fig global
    for index in range(len(fig_aux.data)):
        fig.add_trace(fig_aux.data[index],
            row = row,
            col = column
        )

Multiple plots for huge data

Some multivariate graphs are very heavy to generate as subplots, for example trend graphs or scatter plots where a lot of data from many features is graphed in a subplot that generates a very heavy file and slows down the PC. Therefore, for this type of graphs the solution is to plot each scatter plot/trend plot of each feature individually and not all together in a subplot. Another solution is plotly a sampled data and do this some times.

Examples of each plotly graph for data visualization