Data tidying in Data Science Experience — R edition

28 March 2017


On May 20, 2015, then U.S. Chief Data Scientist DJ Patil tweeted,

“I’ve found in my experience that cleaning the data is 80% of the hard work.” (1)

Since then, this metric has been ubiquitously quoted because it’s both accurate and data cleansing is crucial to the success of any analytics project.

Photo credit

This post focuses on data tidying, which is a key component of the data cleansing process. Data tidying is the final stage of data cleansing which readies it for modeling.(2)

Data tidying includes, but is not limited to:

  • Meaningful column names
  • Atomic data elements
  • Consistent observational granularity
  • Relevant features and if required, a target variable

Why does data need to be tidy? Because most modeling algorithms work best with well formed data that is in a table format. Exceptions to this include analytics on any unstructured data source like text analytics.

Overview

This hands-on tutorial demonstrates the basics of data tidying in Data Science Experience using an R notebook.

Specifically, you will be shown:

  1. How to access data that is local and external
  2. Approaches to identifying and fixing data issues using R
  3. Best practices on documentation, sharing, and reproducibility

Upon completion, you will have a working knowledge of how to tidy a data set. Also, you will be provided with all the code presented here. All source data used here is retrieved from an Amazon Web Services public repository. It is available in the public domain and is attributed to Lending Club “Lending Statistics.”

The complete tutorial, which includes the Jupyter R notebook and associated data sets, is available for download on GitHub.

Prerequisites

  • Familiarity with the R programming language
  • A DSX account — sign up for your free account
  • Basic understanding of Jupyter notebooks
  • Optional: Working knowledge of GitHub

Create project

Begin by creating a new project within DSX making sure to select “Object Storage” as the Storage Type.

The Target Container assumes the name of the project but with the spaces removed, e.g. “TidyData.” Click “Create” to continue. You will be brought to the project’s dashboard:

Associate GitHub

Although not required, I do recommend that GitHub be used for notebook versioning. It is free to join.

To use GitHub:

  • Create a repository (GitHub)
  • Generate an access token for the repository (GitHub)
  • Add the token to your profile (DSX -> profile -> settings -> integration)
  • Add GitHub repository to project

Adding the URL is accomplished by clicking the “Settings” tab in the dashboard, scrolling to the bottom, and then pasting the URL in the “Connect to a GitHub Repository” field:

Click “Add” to tie the repository to the project.

Add local data file

There are several ways to get data into a DSX project. The simplest way is to add it manually:

Browse local file system and select data file to load:

In the overview page, you will see the file has been uploaded. Check it off, then click “Apply”:

The CSV is now part of the project and is accessible to any notebook within the project. There is only one more thing needed before coding can begin; create an R notebook.

Create R notebook

From the “Overview” page of the project, click on “add notebooks” to create a new notebook:

Give the notebook a meaningful name and description. Be sure to choose “R” as the “Language and accept the remaining defaults:

Click “Create Notebook” to continue. You will be prompted with an empty notebook:

Recall that the data set “sample.csv” is part of the project; it is stored in a container within a Bluemix Object Storage container. Access to the container requires credentials.

In the upper right of the notebook header there is an small icon made up of “1’s” and “0’s.” Click on this icon and a panel will appear and you will see the “sample.csv” file. Beneath “sample.csv” click on “Insert to code” then choose “Insert Credentials.”

The result is that the first code cell “In[1]” has been populated with credential information necessary to move data in and out of the object store:

The object “credentials_1” contains a lot of information, I am only showing a bit of it here because it contains sensitive credential information that I don’t want to expose. Keep this in mind if you choose to share a notebook.

If you chose to use GitHub, you can publish the notebook at any time. To publish, click the publish icon on the header of the notebook, then choose “Publish on GitHub”:

You will be prompted with a dialogue in which you add a commit message. You also need to choose an option for “Cell content.” Choose “All content except hidden code cells.”

By choosing this option, the notebook is saved without sensitive, or “hidden cell” items as indicated by the token “# @hidden_cell”.

A quick check of GitHub shows that the notebook has been published:

From the point forward, you can check in changes to GitHub at your discretion. The last step in the bootstrap process is to install and load R packages that make moving data in and out of Bluemix Object Storage easier:

The three new codes cells install and activate two R packages “devtools” and “objectStoreR.” 
 The first is part of the R eco-system. The second was developed by the IBM Data Science Experience team and is open source under the Apache License 2.0.

Load and Explore Data

Explore “sample.csv”

Everything up to this point has laid the plumbing enabling us work with R, in a Jupyter notebook, accessing both internally and externally. Recall that we uploaded the file “sample.csv.” Let’s see how easy it is to retrieve it from object storage and use it in an R data frame:

We now know how to pull data from object storage but what if the data is external and can only be accessed by URL? Below, I demonstrate how to pull data back from an external data source, put it into a data frame, and how to self-document:

Access external data via HTTPS

Access external open source loan data using R:

The data set was imported as a CSV file “loanData.csv” but it had to be converted to an R data frame “loanDataRaw” to make it usable in R.

Loan data exploration

Datafile “loanDataRaw” has 2500 records and 14 columns.

Great, but what does the data look like?

This brief output demonstrates that the data, and its metadata, are not tidy. A few of the many items that should be addressed include:

  • Column headings: Inconsistent case and use of periods
  • Interest.Rate: Numerical data stored as strings
  • Loan.Length: Can be simplified and represented as numerical data
  • Fico.Range: Single column range would be more useful as distinct columns

The complete raw data set can be viewed on GitHub.

With a better understanding of the data and its issues, we first need to copy this data to storage before making the necessary changes.

The helper function “objectStore.put” makes quick work of copying the raw data set to object storage. The output of the “status” function is “201” which indicates success.

Let’s dive in and start tidying up this data.

Tidy data

Step 1: Fix the column headings, remove the periods, and change the names so that they are consistent:

You can see several of the column names before and after the treatment. This consistent naming will make things easier moving forward.

Convert % character data to numeric

Step 2: Remove “%” sign from “InterestRate” and “DebtToIncomeRatio” then cast as numeric:

Unpack packed fields

Step 3: Break up FICORange into three new columns: FICOLow, FICOHigh, FICOAvg:

Discover and remove null data

Step 4: Find and remove rows that have null values:

Convert character categories to numeric

Step 5: Convert string values to a numerical representation for several variables: LoanPurpose, EmploymentLength, HomeOwnership, and LoanLength.

I show just one transformation for brevity. The other three transformations are detailed in the complete tutorial R Notebook that supports this tutorial.

A quick check of the data shows that the transformations have been applied:

  • Column headings: Consistent case
  • InterestRate, DebtToIncomeRatio: “%” removed, numerical represenation
  • LoanLength: Numerical representation
  • FicoRange: Single concatenated string transformed to three numerical columns
  • Plus several other transformations

The data is tidy and ready for analysis. Before I proceed, I want to write the tidy data set to BlueMix Object Storage. This step is not always necessary because we can always regenerate the final data set. However, I may want to perform analysis in another notebook but avoid having to process the raw data again.

As indicated earlier in the post, a “201” return message indicates it was successfully saved to object storage.

Plot data and fit linear model

Basic plotting

Using the tidy data I will now demonstrate basics graphing, analysis, and the creation of a linear model.

Simple plot of Interest Rate vs. Avg FICO Score:

Simple correlation

Check for correlations between variables in the data. This is an example of the need for numerical data. Recall the string to numerical representation step:

Checking for correlations is an important step that helps identify variables that are tied to both independent and dependent variables; known as confounders. This type of analysis is outside the scope of this blog, but I do want to point out how this output might be used to indicate confounding data.

The highlighted output shows the relationships between InterestRate, AmountFunded, LoanLength, and FICOavg. Nothing seems too suspicious here. It is reasonable to expect that the interest rate charged for a loan is functionally dependent on the variables marked in green. However, a good question to ask is whether or not AmountFundedByInvestors and LoanLength are necessary for analysis. They have similar correlations to InterestRate. Without further analysis, this question can’t be answered, but it certainly warrants further exploration.

Linear model

In any data analysis or in machine learning, an output is a model that can be used to make predictions on new data elements as they are generated. The R programming language makes it easy to fit data to a linear model. Let’s look at Interest Rate vs. FICO Avg Score:

Plotting with linear model overlay

Plot the data points and overlay the linear model:

In this tutorial I demonstrated how Data Science Experience, along with an R notebook is used to tidy data and ready it for analytics.

Specifically:

  • Project creation in Data Science Experience
  • GitHub association to a project
  • Bluemix object storage credentials: retrieval and use
  • Accessing data in Bluemix, and externally via HTTPS
  • Basic methods of data exploration using R
  • Data operations on R data frames
  • Basic introduction to data analysis using R

I strongly encourage you to pursue a deep dive on the topics presented in this tutorial.

Helpful Links:

Data Citations:

Lending Club — Lending Club Statistics. Accessed via https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv?accessType=DOWNLOAD on 03/25/2017

Giving Credit:

[1] DJ Patil. (2015). Twitter. https://twitter.com/dj44/status/601119768955920384

[2] Hadley Wickham. (2014). Journal of Statistical Software: Tidy Data. https://www.jstatsoft.org/article/view/v059i10