When we start a data science project to build machine learning models, we typically begin with exploratory analysis to understand data. During this process, we skim through datasets to quickly highlight the type for each attribute, spot outliers and discover candidate relationships. To get a feel for the data, we also run univariate and multivariate techniques like descriptive summaries, histogram graphs, line charts, and pairwise scatterplots.
Typically, as data scientists, we would write code by hand in R or Python to explore the data. Alternatively, we could use BI tools to visualize findings and spot any relationships.
Both options have some limitations. They require:
- time to write code or create charts
- coming up with test hypotheses
- statistical knowledge to test and understand results
Db2 Augmented Data Explorer (ADE) is an IBM offering that sits somewhere between data science tools and BI tools. It understands natural language requests and returns real-time search results while a user is entering a query. These results are also augmented with statistical insights that highlight what is important in the returned data.
With ADE, we don’t need to write code, drag-drop fields to create visualizations or spend time testing many hypotheses manually. Instead, we ask questions in natural language, and ADE recommends the best insights.
For this blog, we chose a Kiva loan dataset hosted on Kaggle. This is a public dataset, and we hope it will make it easier for others to replicate our results. Using ADE, we will join, explore, and visualize the data in order to identify characteristics of Kiva loans. We will show how exploratory analysis is a stepping stone towards building machine learning models.
A quick note is needed here to call out that this blog is not related to nor sponsored by Kiva or Kaggle.
About the data science challenge
The challenge invited members to build machine learning models to estimate poverty levels in regions where Kiva had active loans. This requires inference based on limited information for each borrower.
There are four datasets (csv files):
- Loans: data on some of Kiva’s loans
- MPI region: location of sub-national regions; MPI (Multidimensional Poverty Index, a supplemental measure of poverty)
- Loan theme: loan data that can be matched to loan regions
- Loan regions: regional data related to loan themes
Automatic metadata and join inferring
Once data is imported, ADE automatically infers joins by inspecting column names and data in the columns. It infers a join between MPI Region and Loan Region tables.
ADE further determines additional metadata such as measurement levels and roles for fields.
Automatic univariate analysis
ADE automatically runs analysis for all fields (~40) and ranks findings based on their effect scores (no coding required).
One of the findings shows Sub-Saharan region has the highest count of loans.
This prompts us to explore loans by country.
Easy natural language search, automatic visualizations
We simply ask, “what is the total loan amount by country”. Results show Philippines has by far the greatest loan amount, followed by Kenya.
Notice ADE automatically detects country names and shows a map.
Automatic natural language summary
A top finding tells us that Sector has no meaningful relationship with Loan Amount. Notice ADE generates a natural language summary to translate results from statistical tests.
This insight is useful for predictive modeling, as we can avoid using Sector as a predictor for Loan Amount.
Automatic bivariate analysis
ADE detects that these two fields are continuous and evaluates their relationship for predictive and association strengths. It automatically runs some statistical tests, such as the Fisher transformation of the Pearson correlation and the F-test for regression. The natural language generation (NLG) highlights a weak association for the two fields. The linear regression model confirms that there is no relationship between loan amount and terms in months.
Automatic time series analysis and forecasting
ADE identifies a time-series relationship between Loan Amount and Posted date. It detects seasonality, trends, and outliers. Finally, it uses an exponential smoothing model to forecast the next six months of loan amount.
The natural language summarizes insights: loan amounts are trending down, coupled with low value in July 2017. This may cause us to probe why this is happening.
ADE helped us quickly get familiar with the data and laid the groundwork for further data science tasks. It assisted in discovering relationships across tables and recommended interesting results to explore further. It also showed various charts for analysis and summarized useful details in natural language.
Very quickly, we shortlisted a few variables for further analysis and inclusion in models to estimate poverty levels.
IBM Db2 Augmented Data Explorer is now in beta. We invite you to try it out, and share your feedback with us!
Recommended additional reading: “Embed NLQ and Augmented search into your applications, with Db2 Augmented Data Explorer”.