AI — Game Changer or Money Waster

How to Avoid “Crayon Eating” in Your AI Program

Hashmap, an NTT DATA Company

Published in

Hashmap, an NTT DATA Company

6 min readMay 15, 2018

by Chris Herrera

To be completely clear, the outcome of the title is completely up to you.

Artificial intelligence is truly the next horizon in computing. By adopting AI, an organization or person is simply delegating decision making to an algorithm. Thanks to recent advancements in machine learning, not just in algorithmic development, but also the frameworks that make this technology more accessible than ever before, the AI hype has reached a fever pitch. But before you go out and buy the latest copy of “AI for Dummies” or the newest turnkey AI widget (spoiler alert: this is not a real thing), there are a few things that you should be prepared for.

The Path to AI is Paved in Good Data Practices

Much like a person, an algorithm will make bad decisions when fed with incorrect, incomplete, or generally bad data.

Now, what data you might be asking…

No, this is not the often-heard “lets-store-everything-and-figure-it out-later” data store (we will call this the Garbage Can Database — or GCD). Yes, at one point, storing everything was probably a neat experiment; however, times have changed and it’s no longer enough to say I have 12 petabytes of data in my GCD!

Chances are your data collection processes are happening inside of a company, and the technology that runs these processes, and the people that operate the same processes are funded by an entity. This entity is probably in the business of making money or trying to effect some change…

Step 1 — Ask Yourself a Critical Question

So this is a good first step: What is it that we are doing here, and why are we doing it?

Do we develop life saving drugs?
Are we in charge of maintaining a fleet of aircraft?
Do we make a fancy doo-dad that wakes people up with soothing chants?

I’m willing to bet all the donuts in my pocket that there is SOMETHING that your stakeholders are working towards.

So let’s break this down:

We have a business process, let’s call it process A. Process A has 5 steps: {1,2,3,4,5}. At step 3 of process A we need to know the value of some metric (lets call it M) and baseline (lets call it B) it against a value. So is M greater than or less than B.

Volila! Magnifique! You are done with Step one of this journey to AI!

Organizational Maturity Is Critical

The truth of the matter is that distilling a problem down to a very fine point takes a decent level of organizational maturity. Generally, any decent organization can find a group of people that can come up with a general understanding of a process, but ones that are truly going to harness the power of AI are going to be able to lay out their key process just as we did above.

Great, as the saying goes…

Step 1: Define the Process

Step 3: PROFIT!!!!

Wait a second, we missed something….

Step 2 — What is My Baseline?

How do I know what B (from my fancy business process above) is, or more importantly what it should be….

Well this is where machine learning (ML) comes into play…the necessary component of any good AI.

Much like a human, a neural network learns from data. What you use to teach it, greatly affects the results. So let’s think about this. If I showed little Jimmy, a 2 year old child, a red crayon, and told him it was a hot dog, and repeated that over and over to him, he would eventually eat that red crayon thinking its a hot dog (assuming once someone told him hot dogs are food).

AI is no different, if you feed it incorrect, unclassified, incomplete data…your AI algorithm will be sitting in the corner eating crayons (with little Jimmy).

How Do I Avoid “Eating Crayons” with AI Algorithms?

So how can we manage this? Well, investing in understanding the data flows, and more importantly what data is important to specific PROCESSES, this is the difference between harnessing the power of ML and AI, and just tinkering around the edges.

Now, you might be saying, well my good old Garbage Can Database has all this information…that might or might not be true. Generally, data collection without purpose leads to a often found nasty situation called unexplained variance.

In the database world, you have a lot of rows, but you have missed a few columns. Again, I can’t stress this enough, understand the process that drives the data collection as it will focus the effort and increase the chance of success.

Alright! We are at the PROFIT step!!!!

Almost (not really) but super close (sorry, not really).

Now that you have your data, that data still needs to be cleansed (or wrangled as the cool kids say). You likely have missing values, values that are out of range, or bias on interpreted data — we exploring this in detail in the post entitled “Discovering the Keys to Solving for Data Quality Analysis in Streaming Time Series Datasets”. This is the portion of the process that really eats up time. Estimates put “wrangling” at 70–95% of the total analysis time.

Now if the process was understood (meaning that the underlying data is also understood), we could then focus on cleaning up that data during a Transformation/ETL phase so that the analysts could avoid the bad data all together.

Another little factoid about the majority of ML algorithms, is that the training set is generally best when you have roughly an even number of samples for each identified state. What this means is that if I have a system that is normally on, but I have samples from the off, turning on, and turning off states, and my data availability spread is {80% on, 5% turning off, 5% turning on, and 10% off} I will need to downsample my on state to bring into line random sampling, for example.

You Must Understand the Process AND the Data

Understanding the process is half the problem, but you really need to understand the process and the data. By performing the base level of statistical cleansing on your data set, understanding the metrics that we are trying to optimize, and understanding the features that influence that metric, THAT is AI/ML nirvana.

I wholeheartedly believe that an organization with a sound data strategy built on core business drivers can leverage AI and ML. Again, at the end of the day, when leveraging AI/ML, you are essentially saying, I am prepared and confident that I can delegate a particular decision making process to my software and in doing so, I will improve my process/investment/output/you name the metric.

An Actively Curated Data Lake Is Critical

There is room, and dare I say, a necessity to build a curated data lake (not the GCD I spoke of before) when you are starting on this journey. However, my challenge to you is don’t let your lake be the end of your journey. Maintain it, care for it, feed it with proper business guidance and direction and it will take you to all the shores it has to offer.

AI Should Enable Better Business Outcomes

Build your processes, strive to understand the data, and build the models. Technology should enable business, and AI can enable faster and more repeatable decision making, but choose your battles, and understand the investment.

Remember AI is a journey, but the end of that journey is better business outcomes, not another shiny object locked away in your data center.

At Hashmap Inc, we understand data and more importantly we understand that data, even the same data, is different for every industry. From ingest, to analysis, curation, utilization and consumption — we can help you on this journey.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap at https://medium.com/hashmapinc.

Chris Herrera is a Senior Enterprise Architect at Hashmap working across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers. You can follow Chris on Twitter @cherrera2001, connect with him on LinkedIn at linkedin.com/in/cherrera2001, or catch his weekly IoT on Tap podcast.