Dear Business Managers: Your Data Is NOT Ready for Machine Learning (here’s how to fix it)

John Ang
7 min readMar 10, 2019

--

So, you’re a business executive in a department in a medium to large sized corporation. This quarter, you (or your boss) has decided that you need to move beyond using data for reporting. You now want to begin to conduct advanced analytics on the data to make it yield more value.

“We need to LEVERAGE the insights contained in our data to optimize our business and DRIVE profitability!”

— Your boss, probably, after his/her quarterly strategy meeting

Throughout the time I’ve spent engineering AI products, I’ve heard this line launch dozens of projects. These projects are then funded and staffed up with data professionals. Then, somehow, they begin to fall behind schedule and under-perform shortly thereafter.

In my experience, most problems begin with data that is not ready for deployment into the project.

It begins with data that is not ready

Because I spent my early career on the business side, I also know that it is not easy as a non-technical person to determine when the data is ready. This is especially difficult when every tech vendor promises your off-the-shelf ML solutions. Even more so when your competitors start blogging about “advanced ML integrations” in their products.

Maybe it’s not bad data. Maybe it’s bad talent! That question was what drove me to learn for myself how to build AI products. And good news for all managers - the bottleneck is usually the data, which is far easier to fix than bad talent.

Below is a checklist you can use to determine if your data is ready for a machine learning project for yourself. No data professional needed.

1. Do You Have The Right Data?

I remember a meeting I once took with a non-profit organization. They wanted to optimize outreach and engagement across their digital and physical platforms. During this meeting, I asked the business manager, “what metrics are you tracking and what are you trying to optimize exactly?”.

The response from across the table was “I don’t know, I need you to look at The Data and tell me!”. The manager said this while waving her hands at some imaginary bytes in the air in front of her. (Bless her soul, I could actually hear the capitalization in her words).

Unfortunately, there is no such thing as “The Data”. But even if it existed, it would not be a good use of data professional’s time to fish around raw data in hope of a cure-all. Your data professional can deliver far more value per man-hour when focused on extracting insights from “good” data.

So, the first question that the business manager should have asked was “What am I trying to optimize?”. In this case, it would not have been enough to respond with “Engagement”. It needs to be more specific, in other words, it needs to be something that can be described with data. It could have been “I’m trying to optimize the number of people who show up for our events” or “I’m trying to increase the number of people who are following us on Facebook and Instagram”.

Important note: If you’re not actually counting how many people show up for the event (vs just signing up), you cannot optimize. It’s common sense. But it is surprising to see how often organizations take data for granted and fall into this trap.

Question 1: What am I trying to optimize?

Once our goal is set, we have to ask ourselves if we have data that measures the difference between a optimal and non-optimal outcome.

For example, metrics that measure the number of people showing up for an event could include fields such as

  1. Days since last event
  2. Number of marketing messages opened
  3. Event location
  4. Event type, etc.

When you find the sweet spot in those combination of metrics, you’ll be able to optimize the number of people signing up for an event.

In this regard, having an inventory of the data you have on hand will be useful. Both for yourself, and also as a starting point for any data professional to begin work.

Question 2: Do I have metrics that change as the optimization target changes?

2. Do You Have Enough Data?

We all know how the slide rule evolved into pocket calculators as technology advanced. Similarly, machine learning is just advanced mathematics and statistics put in a more convenient form.

To do statistics on data, we need multiple samples of the data to begin to identify patterns. Bad news: There are few rules of thumb to answer the question of “how much data is enough?”. This is due to the interplay between the strength of the signals present in each dataset and the ability of different models to pick up on those signals.

The good news: There are clear indicators of data quality that can contribute significantly to the precision of your final product

  1. You have more than 1 data point. You can’t do much with just 1 data point
  2. Your data is not patchy. If your data has missing values, data professionals will have to make a guess on what the missing values are. Errors here have the potential to compound significantly.
  3. You have a historical record going back at least 1 year. While there is much debate on how the ideal volume of data, there is little you can do to change how historical data was collected. 1 full year of records is, at the very least, representative of a full business cycle in terms of trends.

Question 3: Is my data patchy? How deep is my historical record?

3. Can You Access Your Data?

You’ve identified your optimization goal and confirmed that you have the necessary data. Now, you need to look outside the data for external factors that could limit your access to it. In my experience, these limiters fall in two main categories

  1. Compliance-related: Sometimes the data you need belongs to a different team. Getting access to that data is blocked by compliance related processes.
  2. Technical-related: Even if you had compliance clearance, the data stored by other teams might be in different formats. It might also need specialized software or software skills to properly access.

These external limiters are usually one of the most frustrating elements of data related projects to deal with when they crop up. More and more departments today are empowered (or disempowered, due to data privacy laws) within an organization to collect data. So, it is also more critical than ever to make friends and contacts across departments, and particularly with the back end data teams where they exist ;)

Question 4: Is my access to data blocked by compliance related or technical obstacles?

4. Can You Maintain and Update Your Data?

Most data professionals consider it the end of their job when they’ve worked the data into a usable model and delivered it to the business user. But you have one more key checkpoint. Can you repeat the process with new data?

Don’t forget that the entire value of the optimization project was to be able to act on live data to influence future outcomes. If any live data that you are feeding into these models suffer from quality or access issues, the model would have nothing to crunch in terms of parameters, and prediction accuracy would suffer.

Additionally, this live data eventually turns into historical data that you can use to refine your models. So, it is essential that while the data professionals are working with the data, strong processes and official channels are created on the business side to ensure that the data is properly updated and archived.

Question 5: Can you repeat the process with new data?

BONUS: A 4-step process to increase your success rate with data leveraging

1. Clarify your objectives and make sure you have the right data

2. Gather the data into a single place through an audit-able process. Otherwise, your project might end up being non-maintainable or non-updatable

3. Bring business-people and data-people together as often as possible to provide context to each other. Data people often know how to bring structure to raw data that business people have difficulty visualizing value in. Likewise, business people have key domain-specific insights on characteristics of the data that a data person might never be able to discover independently.

4. Invest in your educating your employees about Machine Learning and AI. The more they know about how to use these free and open source tools, the more they will be empowered to bring advanced analytics to the workflows and processes that they are in charge of, and the best insights and ideas can begin to bubble to the top!

If you liked what you read, please give me a clap below so that I know what topics to write about next! You can also connect with me on LinkedIn here. Always happy to chat!

John is currently an AI Engineer in AI Singapore’s AIAP program. He started his career as an investment banker at Barclays in NYC, before heading the corporate development team at a fast-growing biotech startup in Singapore. He enjoys bridging between technical and non-technical roles. All opinions here are his own.

--

--

John Ang

John uses his experience in Investment Banking, Corp Dev and AI/ML Engineering to think about how best to deploy data to survive in this brave new marketplace