Beyond “tidy”: Plotly Express now accepts wide-form and mixed-form data

Nicolas Kruchten
Plotly
Published in
8 min readMay 26, 2020

Plotly Express is the built-in high-level data visualization interface for Plotly.py, a leading interactive data visualization library for Python. With today’s release of Plotly.py 4.8, Plotly Express now gracefully operates on wide-form and mixed-form data – not just “tidy” long-form data.

These new capabilities dramatically expand Plotly Express’ promise of ‘interactive data visualization in a single Python statement’, by removing the need to wrangle your data into a particular form before plotting.

When we first released Plotly Express in March of last year, it supported only “tidy” long-form Pandas data frames as input, and our early adopters quickly asked about wide-form support. Since then we’ve added support for non-data-frame array-like inputs such as NumPy arrays and Pandas series and indexes; for GeoJSON-formatted geographical data; for hierarchical part-of-whole data; and for image and multi-dimensional xarray data. Today we are finally able to deliver to our users and customers some capabilities they’ve been asking for from the beginning. It took weeks of refactoring and fine-tuning of defaults, but I’m very proud of the outcome: Plotly Express’s wide-form support is flexible, expressive and totally consistent with the rest of its API.

🧐 Terminology

There are three common conventions for storing column-oriented data in tabular form, usually in a data frame with column names:

  • long-form data has one row per observation, and one column per variable. This is suitable for storing and displaying multivariate data i.e. with dimension greater than 2. This format is sometimes called “tidy data”.
  • wide-form data has one row per value of one of the first variable, and one column per value of the second variable. This is suitable for storing and displaying 2-dimensional data.
  • mixed-form data is a hybrid of long-form and wide-form data, with one row per value of one variable, and some columns representing values of another, and some columns representing more variables.

Plotly Express can now operate natively on all three of these formats.

A medal table, two ways, along with the Pandas code to convert back and forth.

Here’s a side by side example of the same Olympic Short-Track Speed Skating medal data in both long-form and wide-form, along with the Pandas code that will convert back and forth. Until today, if you had a wide-form dataset like the one on the left and wanted to plot it using Plotly Express, you would have had to use the Pandas .melt() operation to “tidy up” your data first. This is now no longer necessary!

📐 Wide-Form Data

The core, backwards-compatible change to the Plotly Express API that makes this possible is that, for 2D-Cartesian plotting functions the x or y arguments can now accept not only a single column reference or column, but a list of column references or columns. Here’s what this looks like in practice, with the wide-form version of the dataset above:

Here Plotly Express has read the row- and column-index names and used them to label the x-axis and legend, and these labels are automatically reflected in the hover labels as well. In the code above, we could in fact omit x and y and simply call px.bar(wide_df) with no additional arguments, as here we are passing in the default settings, for illustrative purposes. On the other hand, a few extra arguments can come in handy for styling a Plotly Express plot the way you like, still with just one Python statement:

Note that in the plots above, the 3 columns of data passed in to y (the three medal types) are automatically differentiated by color, and the legend title is automatically set to the column-index name. This is the common convention for other plotting systems like Pandas’ own built-in .plot() family of functions and even venerable old Microsoft Excel. If this were a visualization of long-form data with Plotly Express, we could assign this medal dimension to any visual attribute of the plot, and in adding wide-form support, we found a way to retain this ability: you can set any keyword argument to medal in the function above! Let’s try it with facet_col for example:

We can do this for any attribute of any kind of 2D-Cartesian chart supported by Plotly Express, not just bars. For reference, the 2D-Cartesian plotting functions are scatter, line, area, bar, histogram, violin, box, strip, funnel, density_heatmap and density_contour. Here is a line chart using the new built-in stocks dataset, which has one row per week in 2018–2019, and one column per major tech stock:

🥄 Mixed-Form Data

This ability to map data dimensions to visual variables, no matter the form, is what allows Plotly Express to operate not only on wide-form data like the simple medal table above, where every column represented a value of the medal dimension of the data, but also mixed-form datasets, where some columns represent values of a dimension and some don’t. Let’s take a look at the new built-in mixed-form dataset called experiment, which represents the results of a hypothetical three-experiment study on 100 participants:

Here, the index tells us that each row contains data for a single participant. The first columns contain data for the experiment dimension, so if we only had these three columns, we could describe this as a wide-form dataset with dimensions participant and experiment, but there are also the other two columns, gender and group, which if they were on their own would make this a long-form dataset with dimensions participant, gender and group. Taken together, we can describe this dataset as mixed-form. Plotly Express works well with this kind of data too! Starting with just the wide-form portion, let’s make a violin plot:

We see that in this case, the list we passed to y is not a column-index with a name, so Plotly Express has used the Pandas .melt() convention and labelled this dimension "variable". By default, wide-form violin plots are not colored, as the violins are clearly identified by their x-position, but we can assign color="variable" if we prefer to double-encode this dimension:

What about these extra columns in the dataset? Well, it turns out we can assign them to visual variables also. Since we’re now encoding variable as color, we can assign a different dimension like group to the x position of the violins, and we can facet by gender. We can also use the standard Plotly Express labels argument to override the default value and variable labels. Finally, we can use Plotly’s interactive hover-label features by adding the participant ID (df.index, which is named participant) to the hover_data, so that we can hover over outlier points and quickly identify them:

With just a few characters’ difference, we can produce a totally different plot, faceting by experiment and coloring by group:

And since we didn’t have to rearrange our dataset to produce the plots above, we can smoothly transition into a long-form-based analysis among experiments, say to compare the results of one experiment to another:

The examples above just scratch the surface of what can be done with Plotly Express on wide- and mixed-form data, so be sure to check out the wide-form documentation for more details. But before you go, one more thing…

🐼 Plotly.py Now Has a Pandas Plotting Backend

With the changes above, Plotly Express behaves a lot more like the default Pandas plotting backend with respect to the x and y parameters of the corresponding functions. In light of this, we are taking advantage of the new pandas.options.plotting.backend feature introduced in Pandas v0.25, and offering an official plotly backend for Pandas plotting. This means that you can import Pandas as you usually do, set the plotting backend to "plotly", and when you call df.plot(), Plotly Express is invoked and returns a plotly.graph_objects.Figure object, ready to be customized, rendered, or exported to a static file. Here’s an example:

A note on API compatibility: The Plotly plotting backend for Pandas is not intended to be a drop-in replacement for the default one; it does not implement all or even most of the same keyword arguments, such as subplots=True etc. The Plotly plotting backend for Pandas is a more convenient way to invoke certain Plotly Express functions by chaining a .plot() call without having to import Plotly Express directly. Plotly Express, as of version 4.8 with wide-form data support implements behaviour for the x and y keywords that are very simlar to the matplotlib backend.

❤️ A Perfect Fit for Dash

Dash is Plotly’s open-source Python framework for building rich, interactive analytical applications, no Javascript required! Figures produced with Plotly Express – from wide- or long-form data, with or without the Pandas backend – are always directly compatible with Dash: just pass them in to the figure property of a dash_core_components.Graph() component!

📦 Getting Started

We hope you’re as excited about these new capabilities as we are! You can check out the complete release announcement on our community forum, or you can head straight to our installation instructions to get your hands on version 4.8 right now.

--

--