$600 billion a year. According to The Data Warehouse Institute , that is how much money dirty data ends up costing U.S. Companies annually. Many people think of dirty data as an IT problem, but it is a problem for all business units, across every type of industry or organization, both private and public. When U.S Companies were asked which areas within their business were most impacted by bad data, respondents pointed to marketing and lead generation, with impacts to customer relationships not far behind . This finding is further supported by a survey of marketers, where 54% of respondents said that “lack of data quality/completeness” is the most challenging obstacle to success .
Using dirty data can lead to unwanted consequences, such as:
· Incorrectly giving or denying loan applications
· Targeting the wrong audience for a marketing campaign
· Failure to detect customer churn
· Building an inaccurate or biased predictive model
These consequences can waste your time, frustrate your customers, and cost you money!
Data scientists surveyed by Kaggle reported that the biggest barrier faced at work was “dirty data” . It is easy to imagine, a machine learning model trained on bad data, will make bad predictions.
How IBM Watson Studio and Data Refinery Can Help
You can find and remove dirty data with Watson Studio and Data Refinery. These tools can be leveraged to provide quick visualizations summarizing your data to help you understand the quality and distribution of your data. Watson Studio provides a number of tools to perform data exploration. Typically, one would start with the Data Refinery to get a quick visual perspective of the data. This method samples the first 5000 rows of data to quickly show a “data profile”. A similar analysis can be performed in SPSS and if one would like a much deeper analysis, Studio provides Jupyter notebook for visualizations with libraries like Matplotlib, Seaborn, Bokeh, and any other of one’s favorite visualization libraries.
In our motivating example, a data scientist has been asked to develop a machine learning model to predict the probability that an individual will default on a bank loan. Before building a model, the data scientist must perform exploratory data analysis to determine whether or not the data is rich enough to answer the business problem — which individuals will default on a loan. The data scientist must first spend some time visually inspecting the data.
In this post we will use Watson Studio with Data Refinery to explore a loan application data set. Further we will create a set of pre-processing rules to be applied to our data. These rules will be shared across any models we build on this data, assuring consistency across work done by different data scientists. This work will help us understand our data, ultimately preparing us to build a model to assess whether a loan application should be approved or denied.
Initial data load and data type exploration
Before one can visualize and refine data in Watson Studio with Data Refinery, one must create a project. Once the project is created, load the data into the Watson Studio environment as a new data asset. Click on the data asset to open the data in Data Refinery. The first tab is a preview of the data as shown Figure 1.
We can already make a number of observations this preview. Our data was imported as String but we can use Refinery rules to convert numeric fields into an appropriate Numeric. Once we select “Refine”, the column type is automatically converted for us, as shown in Figure 2. Refinery does not make changes to our data without our consent.
In our initial data feed, the column “term” contains values like “36 months” and “60 months”. Machine learning models will work better treating this field as a numeric, so we can create an operation to extract out the 36 and 60 as numeric values. This Refinery operation will be stored and consistently applied to our data — no longer will each data scientist write their own regular expression cleanup rules!
We apply this operation by selecting the column “term”, then “Operation +”, and finally “Split column” to remove the string part of the field by selecting the number of positions from the right to remove as in Figure 3. A similar operation will be performed on “int_rate”.
We can also see that some fields are “categorical” (they only have a couple of distinct values). These should be encoded as categories for the sake of modeling, for example the columns “home_ownership” (values “Own”, “Mortgage”, and “Rent”), “verification_status” (values “Verified”, “Source Verified”, and “Not Verified”), and “loan_status” (values “Fully Paid” and “Charged Off”) have a well-defined and limited set of allowed values.
These columns can be encoded in-place with a numeric value that represents one of the string values. One benefit of this is that it is simple, but be aware that the more categories present, the larger your values will become and the machine learning algorithm may misinterpret a variable encoded as 5 to be given 5 times more weight than a variable that is encoded as 1. Another option is to one-hot encode categorical variables where each category value becomes a new column and is assigned a “1” or “0”. This has the benefit of not weighting a value improperly, but it’s not very efficient when there are many categories, because for each category, a new column is created, and this process will create many new columns. Let’s select “loan_status” column and “Operation +” to change the categories “Fully Paid” and “Charge Off” to ‘1’ and ‘0’, respectively by selecting the Operation “Replace substring”.
We notice that some fields are sparsely populated or not populated at all, like the columns “debt_settlement_flag_date”, “settlement_status”, and “settlement_date”. We can go back to our business team with this finding — they may be surprised to find out this data is not available. The team can explore additional avenues to find this data if they feel it is critical. We can also just remove these columns from our analysis since there is not enough information here to help a model.
A repeatable flow
We can explore all the fields and capture similar observations. Once all fields have been Refined, we can Schedule a Job to create a Data Refinery flow that will apply the changes we have made to the entire dataset. Once the dataset has been refined it will appear in the Project Assets as a Data Refinery flow.
This data refinement only needs to be done by one team member. Now the rest of our team can use the same consistent view and encoding of the data, thanks to operations encoded in the Data Refinery flow!
Profiling the data
In the Refinery flow we created rules and operations to explore data samples and consistently apply data types. We can also visually explore the data using a Profiler.
Select the Profile tab and “Create Profile”. This will create a profiling task that will provide you with statistical information about each field. The output is a quick visualization of each data field including the number of unique values, the max, the min, the mean and standard deviation. This allows you to gain a better understanding of the data and distribution of the values in each column. Notice that the average loan amount is around $13,000 and the majority of the loans are for a term of 36 months. The distribution of loan terms is interesting and we will explore further later.
For some of the fields the summarized stats will not make any sense, for example the mean and standard deviation of an id field are not meaningful, but the summarized statistics are a great way to start analysis.
Visualizing the data
Another way to visualize the data in Refinery is to view the data graphically. This will help you quickly see insights about your data from graphs and charts. Start the process by selecting the Visualizations tab. Here we are given several options for built-in chart types. Let’s create a Histogram chart and take a look at the distribution of accepted and rejected loans.
Notice that there are more rejected loans than accepted loans. This means that dataset is unbalanced and there is more data available for the rejected loans than the accepted loans. Be careful, as this could introduce bias in the machine learning model toward the reject loan class. One way to overcome the unbalance is to over-sample the minority class or to under-sample the majority class.
Let’s now visualize the credit risk grade with a Bar chart and take a look at the various credit risk ratings. In this example, the grades range from A to G.
Notice that most of the individuals that applied for a loan had a credit risk grade of A, B, or C. This may not be surprising — most loans fall into “good” or “average” grades. As you might expect from a successful bank, there are fewer “bad” loans with E through G grades. We note that we may not have enough data on bad loans to make solid predictions.
Our analysis becomes more powerful when we remember our previous insight that “loan term” had a skewed distribution. When we visualize across two dimensions (“loan grade” and “loan term”) we get further insights. It is interesting to see that individuals with a grade below D have longer term to repay the loan and that individuals with a grade of A have a shorter term to repay the loan. Is this a correlation or a causation? We don’t know yet, and now we have a great question to explore further. We would not have known to ask this question if we weren’t able to so quickly visualize and explore our data.
Now that we have a better understanding of the data through refining and visualizing, the Data Scientist can begin to build a machine learning model to predict the probability that an individual will default on a bank loan.
There are many ways in which one can utilize Data Refinery to understand what is in the data and the distributions. For a deeper analysis, one can use the Watson Studio Jupyter Notebook by loading the data into a pandas data frame. Studio will automatically create the python code for you. Create a notebook, and click on the data icon at the top right of the screen (signified by the 1001 icon). You can perform similar analysis on the data as you did with data refinery — the big difference is that it will be on all of your data and not just the first 5000 records.
If you’re interested in consulting with IBM on your AI projects, you can learn more about and sign up for an engagement with the IBM Garage or IBM Data and AI Expert Labs & Learning. The team of experts at IBM is experienced in implementing Watson for a variety of industries and use-cases and would be happy to help with yours.