Evolving from Descriptive to Predictive Analytics: Part 3, Fast Start Data Management

Published in

IBM Data Science in Practice

4 min readApr 19, 2018

By Chad Marston and Shaikh Quader

The post that follows is the third in an ongoing series about a shift in focus from descriptive to predictive analytics. We hope you’ll check out part 1 and part 2.

At this point in our journey, we had leadership support and the skills needed to transform from descriptive to prescriptive analytics. The next step was identifying the tools our team would need to succeed.

Machine learning is a subset of data science in which machines learn from data. If you only have in place a data management strategy for feeding descriptive analytics, longterm success will very likely mean that strategy needs to evolve. A majority of solutions for descriptive analytics use purpose-built data marts that provide analytics for a specific function of the business. With machine learning, you’ll quickly find you need to combine data you’ve never combined before. In particular, you might need to bring together disparate internal, external, structured and un-structured data, which can present a major challenge. Don’t let that stop you from moving forward. As Mark Twain once said, ‘the secret of getting ahead is getting started.’ We recommend moving forward in parallel paths: Evolve your hybrid data management strategy and tools while building machine learning solutions on the data foundation you have today.

In later blogs, we’ll surface the details of building a data management strategy for machine learning, but for now we’ll focus on what you need to create machine leaning solutions as quickly as possible. Our team didn’t wait for a broader scalable solution to be in place. We needed to deliver business results as quickly as possible to provide early returns on our data science investment and generate excitement around the possibilities of machine learning.

During our initial research into machine learning, we spoke to others who had already made the transition to predictive analytics. A common theme was that the data management and governance challenges consumed 50–75% of most projects. We had a team of data scientists looking to apply their skills and a business eagerly awaiting results, so we needed a way to reduce the work of accessing, understanding, and preparing the data.

Data Storage

We needed to accomplish 3 basic functions with our data: move, store, and govern. We already had a robust relational Db2 datamart to support our descriptive analytics, and having a world class database solution such as Db2 gave us a great starting point, but there was certainly work to do, since we needed to greatly increase the scope and currency of our data. We quickly realized we should co-locate our data with our tools. Later, we’ll discuss ‘data gravity’ that should pull the tools into its ecosystem, but in the short term we did the opposite. We moved our datamart to co-locate it with the environment that contained our data science and visualization tools. This immediately improved our data access performance and reliability.

Data Movement

Our datamart was a decade-long product of many data engineers contributing their preferred data load solutions along the way. We moved the data using Db2 command line scripts, Cognos Data Manager and InfoServer DataStage, and I can’t over emphasize the importance of having a robust data strategy and solution since data will need to move quickly and reliably to support your machine learning initiatives.

We also needed to standardize on an enterprise ETL platform. Our primary needs beyond robust ETL functionality were:

The ability to monitor and manage multiple jobs from a single graphical interface
A highly scalable parallel framework
Native integration into a broader suite of information governance tools

Taken together, that made DataStage the perfect solution. We didn’t take the time to rebuild everything up front, but instead continue to migrate all ETL jobs into DataStage while creating new solutions directly in DataStage. This reliability and efficiency has turned out to be essential to our machine learning efforts.

Data Governance

It took us a bit longer for the next realization: the need to manage data quality. You can’t expect high-performing ML models without data quality and completeness. We’ve also found that even limited success with ML invites others to engage in your projects. This meant we needed an easy way for new data scientists to understand previously cleansed data ready for ML as well as the structure and quality of new data.

To address those new challenges, we implemented another part of the Information Server suite of products, Information Analyzer, which has helped us to quickly understand the structure, frequency and quality of our data. We can then focus our efforts on understanding the content and cleansing the data to ensure that it’s fit for use within our ML framework.

Storing, moving and governing data is a foundational aspect of machine learning. Using Db2, DataStage and Information Analyzer as our hybrid data management and unified governance tools gave us an efficient and stable foundation for our short-term machine learning efforts.

In the next part of the blog series, we’ll discuss another crucial tool that’s crucial to your early efforts, a data science tool.

Evolving from Descriptive to Predictive Analytics: Part 3, Fast Start Data Management

Data Storage

Data Movement

Data Governance

Written by Chad Marston