Self-service data preparation with IBM Data Refinery

Carmen Bommireddipalli
4 min readNov 15, 2017

--

If you are like most data scientists, you are probably spending a lot of time to cleanse, shape and prepare your data before you can actually start with the more enjoyable part of building and training machine learning models. As a data analyst, you might face similar struggles to obtain data in a format you need to build your reports. In many companies data scientists and analysts need to wait for their IT teams to get access to cleaned data in a consumable format.

IBM Data Refinery addresses this issue. It provides an intuitive self-service data preparation environment where you can quickly analyze, cleanse and prepare data sets. It is a fully managed cloud service, available in open beta now.

Analyze and prepare your data

With IBM Data Refinery, you can interactively explore your data and use a wide range of transformations to cleanse and transform data into the format you need for analysis.

You can use a simple point-and-click interface for selecting and combining a wide range of built-in operations, such as filtering, replacing, and deriving values. It is also possible to quickly remove duplicates, split and concatenate values, and choose from a comprehensive list of text and math operations.

Interactive data exploration and preparation

If you prefer to code, in IBM Data Refinery you can directly enter R commands via R libraries such as dplyr. We provide code templates and in-context documentation to help you become productive with the R syntax more quickly.

Code templates to help users with R syntax

If you’re not satisfied with the shaping results, you can easily undo and change operations in the Steps side bar.

The interactive user interface works on a subset of the data to give you a faster preview of the operations and results. Once you’re happy with the sample output, you can apply the transformations on the entire data set and save all transformation steps in a data flow. You can repeat the data flow later and track changes that were applied to your data. To accelerate the job execution, Apache Spark is used as the execution engine.

Profile and visualize data

Data shaping is an iterative and time-consuming process. In a traditional data science workflow, you might use one tool to apply various transformations to your data set, and then load the data into another tool to visualize and evaluate the results. Over many cycles, this continual tool hopping can become frustrating.

IBM Data Refinery soothes the pain by integrating both data transformations and visualizations in a single interface, so you can move between views with a simple click. You can use the Profile tab to view descriptive statistics of your data columns in order to better understand the distribution of values. You can continue to apply transformations and the corresponding profile information adjusts automatically.

On the Visualization tab you can select a combination of columns to build charts using Brunel (open source visualization library). IBM Data Refinery automatically suggests appropriate plots and you can choose between 12 pre-defined chart types. You can adjust the appearance of the charts using Brunel syntax.

Connect to your data wherever it resides

IBM Data Refinery comes with a comprehensive set of 30 prebuilt data connectors so that you can set up connections to a wide range of commonly used on-premises and cloud data stores. You can connect to IBM as well as non-IBM services. If your data service is hosted on IBM Cloud (formerly IBM Bluemix), you can directly access the data service instance from IBM Data Refinery.

Once you specify a connection and connect the data object to your data, you can start to analyze and refine your data wherever it resides.

Try out IBM Data Refinery! Sign up for free at: https://www.ibm.com/cloud/data-refinery

--

--