Predictive Modeling Guidelines & Best Practices

Published in

Dow Jones Tech

5 min readNov 15, 2018

Machine Learning is reshaping how organizations are tackling business challenges in almost every industry. Here at Dow Jones, with growing volumes of structured and unstructured data flowing in our systems everyday, we are gaining tremendous benefits by using these ML tools and techniques to optimize new product development, streamline internal operations, and deliver better insights to customers.

In particular as we think about how to personalize customer experiences for The Wall Street Journal, Barron’s and Marketwatch, and improve data insights in our Professional Information Business (Factiva, DNA, Risk & Compliance), we are learning when and where to properly leverage Machine Learning to help shaping the future of our business.

In anticipation of our even further expansion into the world of ML, we worked with our AI Center of Excellence to develop a set of standardized guidelines and rules to promote proper and sustainable machine learning modeling practices across Dow Jones.

Now we would like to share our work with you, in the hopes that following the below workflow and guidelines will help you to get a head start on implementing a successful predictive modeling application.

DATA PREPARATION

Rule #1: Design Metrics.

Before working with any sort of predictive modeling, you need to know what you are optimizing your model for and understand your existing data. When beginning any sort of predictive work, be sure that you can answer the following:

Have I generated my hypothesis? What do I want to optimize?
What is the source of my data? How is it collected?
What format is it in?
Are there any security or privacy concerns associated?
What data is relevant to the problem I am looking to solve?

Rule #2: Understand your data.

You have to understand what your data is saying. It is one thing to know what data you have, but it is more important to know what that data says about the world you are looking to represent. Ask yourself the following:

Have I done any exploratory analysis of my data?
Is the post-transformation data still “true” and reflective of an objective reality?
Have I recorded the details of the collection and transformation processes?

2. PREDICTIVE MODELING

Rule #3: Keep the first model simple and get the infrastructure right.

The first model provides the biggest boost to your application, so it doesn’t have to be fancy. A simple model will provide you with baseline metrics and behaviors that you can use to test more complex models later on. Make sure you aligned the tools and architecture with all impacted parties across the organization.

Can I start with selecting 1~2 features to test my hypothesis?
Are my selected features reproducible in the future?
Do I have a baseline model? How acceptable is it?
Where do I store my model and how can I scale it?
Are the tools and architecture agreed by others who are involved?

Rule #4: Test the infrastructure independently from the machine learning.

Your infrastructure has to be testable. Make sure you have tests for the code for creating examples in training and serving, and that you can load and use a fixed model during serving.

How do I generate representative testing data and test the model performance?

Rule #5: Selecting and refining your model.

Agree with your stakeholders what a “production-ready model” will look like in terms of its predictability. Understand if your model is solving for the problem you are trying to optimize. If not, try to revisit the objective of your model and re-select your features.

Have all my stakeholders agreed on the acceptable level of the model’s predictability?
Have I rigorously assessed the model’s precision or overall accuracy? Does the predictive performance meet the business needs?
Am I confident of the robustness of the predictive power in any testing conditions? If not, how can I refine the model until it meets production-ready standards?

Rule #6: Explainability of the output.

What is your model prediction telling you? Be sure you understand your model more than a “black box”. Understand and be ready to explain why your prediction behaves in such ways so you that can leverage the insights and develop appropriate strategies to meet the needs.

Can I explain the what my model is doing?
Can I implement any actionable strategies based on these insights?

3. PRODUCTIONISATION

Rule #7: Monitoring: know the requirements of your application.

Understand how much the performance will degrade over certain time interval. This information can help you to understand the priorities of your monitoring. If you lose significant quality because the model is not updated within a fixed period, make sure to update or retrain your model to meet the performance requirements. You should also be cautious of leveraging your tools and resources in order to achieve long term sustainability.

Am I actively reporting on how much the performance degraded over time?
Have I agreed with the business on the frequency of model updates to maintain consistent performance?
Am I actively maintaining my data storage and monitoring costs?

Rule #8: Responsibility.

Take full ownership in communicating your subsequent findings to impacted parties, including the customers. Make sure to be responsible (both budget-wise and technical maintenance-wise) for maintaining the sustainability of your model in the long term.

Have I documented everything so that I can backup and explain my findings?
Who will be impacted by my model in the long term?

Hopefully by applying these rules to your projects, you will find success in delivering robust and scalable predictive modeling projects for your business.

If you have any questions along the way regarding these suggested guidelines, please do not hesitate to reach out to David Hsiao, Nick Varney, Alex Siegmen, Kabir Seth, or John Wiley.

Thanks to Katie Burke.

Predictive Modeling Guidelines & Best Practices

Written by David Hsiao