Wrangle Data in Jupyter Notebooks With PixieDust Rosie

Want to easily transform CSVs in Jupyter Notebooks? “We Can Do It!”

Published in

Center for Open Source Data and AI Technologies

4 min readAug 2, 2018

We use the term wrangling in the sense of “grasping and maneuvering”. Generally, data needs a lot of prep work before it can be analyzed, and wrangling is the first step. It is where we inspect the data, remove unneeded columns, discard rows, and de-structure columns. Any data scientist can confirm, wrangling large datasets can be a time consuming, sometimes frustrating, and compute-intensive process. So we (a team of engineers at IBM) built PixieDust Rosie’s Wrangle Data Tool to make that process faster and more efficient.

Demo of the PixieDust Rosie Wrangle Data Tool

Introducing PixieDust Rosie Wrangle Data Tool

The Wrangle Data Tool is built as a PixieApp in the Jupyter Notebook environment. To use the tool in a notebook you must install pixiedust, rosie and pixiedust_rosie, which are all open source projects. The following commands show how to install them within a Jupyter Notebook cell:

Once installed, you can use the wrangle_data method in the pixiedust_rosie package. The method can take a path to a local file or a download link to a file. We currently only support CSV files as input.

Input used for tutorial:
pixiedust_rosie.wrangle_data("https://raw.githubusercontent.com/pixiedust/pixiedust_rosie/master/sample-data/HCUP_Tutorial.csv")

The wrangle_data method will, under the cover, start a PixieApp in the output of the cell where you can perform various operations on a sample set from the passed CSV file. The main screen of the UI displays a schema table and a table of the selected sample data. The schema table contains the column name, the Rosie type of the column, the Pandas DataFrame type of the column, a flag to indicate if there is a suggested transform, and a set of actions to perform on the column.

Rename, delete, and transform

Each column can be renamed, deleted, and transformed with the buttons under the Actions column in the Schema table. The method will create a new CSV, so the passed CSV will not be modified.

The transform action allows you to extract new columns from an existing column in the data set. If the column has a suggested transform (indicated by the check in the Suggested Transform column) then you can use the suggestion. You can always extract new columns by building your own Rosie pattern in the transform screen.

Transform large data sets in Jupyter Notebooks? “We Can Do It!” says Rosie the Riveter.

Learn more about the Rosie Language

Example transform

In the example above, we want to split the BreakOutCategoryId column into an alphabetical code (Category_Prefix) and a numeric code (Category_ID). To begin, we need to define a general pattern that contains the names of the new columns we want to create and any context patterns. A context pattern can be a string literal or a pattern that we want to ignore. Once we define our general pattern, we can click Extract Variables.

Now we must define a pattern for each of the extracted column variables. In this case we want to define a pattern to extract the alphabetical characters and a pattern to extract the numeric characters. Once we defined the patterns, we can click Create Columns.

If the pattern is successfully applied to a row, you will see the new data in the Sample of New Column(s) panel. Any row that did not match the pattern will have a value of None, and any errors in the pattern-building process will be displayed in this panel. The Rosie Help panel contains tips and a link to Rosie’s documentation. Once you are satisfied with the result you can click Commit New Column(s) to commit the new columns to the Schema.

“We Can [Did] It!”

Once you have performed all the necessary operations on the data, you can click the Finish button. All the operations you performed on the sample set will be applied to the full data set and a new CSV will be created. The finish screen contains a path to the new CSV file, the variable name of a Pandas DataFrame created from the new CSV, and a list of lines that were excluded from the new CSV for not matching the overall pattern. A new cell will be created that will allow you to use PixieDust to display and visualize your new data set.

Coming to a notebook near you

The PixieDust Rosie Wrangle Data Tool brings the power of data processing to the flexible environment of Jupyter Notebooks. Check out our GitHub repo to learn more and contribute!