22 min readJun 22, 2016

LEAP into Data Science!

In this article I explain why the common approach to internalizing the cost of data science and advanced analytics is wrong and what companies and software engineers can do to actually get traction with advanced analytics.

Do you remember the very first web application you ever used? Not software which may have been a videogame or productivity application but the very first time you went to a dynamic website and interacted with it?

If you’re not very old, it might have been Facebook, Instagram, or Snapchat. And if you’re a bit older, you might have been posting on Slashdot or using AOL instant messenger.

The further back you go, the less it felt like the rich web and mobile applications we use today. Even when AOL was the premier service, every tech company had this in common: the developers building those apps built them around four very simple operations.

Create, Read, Update, Delete

Modern frameworks, often just reference these operations as C.R.U.D. pronounced like a Puritan swear: “Ah, crud!”

The beauty of these four operations is that they explain almost everything one could ever want to do with data on the web and almost everything one could ever want to do with data, period. At least they did.

That’s all changed.

Now we don’t just want to just build simple CRUD applications, we want to do much more. We want the sites and services we use to give us ever more relevant information. Originally, the challenge was just finding the right “static” sites through search. But soon that also meant we expected these web applications to understand patterns in the data that had been Created and Updated within the application itself. The first movers in the space beyond CRUD created advertising, retail, and entertainment empires so much better than those before them. Their applications have something other applications just don’t have. Somehow they feel smarter.

Facebook, Amazon, Netflix, and Google (FANG)

Facebook was built off of posts (Create and Update operations) but figured out monetizing with advertisers when their feed (Read operation level: Addicting) became dynamic and personalized
Amazon created the first retail experience with unlimited selection AND a “boutique” experience by recommending things you’d probably want to buy on every page (Read operation level: Spooky)
Netflix most recently created the first worthy HBO competitor by making their own shows (Create operation level: Chill) based on ideas they “knew” would work because they’ve been monitoring how people respond to their recommendations
Google has been successful the longest by generating relevant and personalized search results (Read operation level: Invasion of privacy) and interleaving “organic” results with advertiser-sponsored results

They and all their ilk are smarter. But not just them. There are also sports teams, casinos, credit card companies, etc., all outperforming their competition two-to-one on average by doing something with data that is better than what their competitors are doing with data. What they are doing goes by many names:

Data Science
Machine Learning
Artificial Intelligence
Advanced Analytics
Big Data
Data Mining
And more

But no matter what you call it, it’s something more than CRUD.

This means that for apps you build, strive to bake smarts into them from the beginning and backfill smarts into your existing apps. Mere CRUD functionally should be considered a competitive liability. And for services you consume, they should have smarts built in and make full use of the data you’ve collected.

This knowledge gap between the FANG brotherhood and everyone else has bothered me for several years. Yes, they hire very smart data analysts but their real competitive advantage is their data. Other groups, both big and small, also generate data but they haven’t been as thoughtful about using it.

Further, I think hardware engineers, software engineers, and product owners/managers are some of the most brilliant people I know. They have built all the stuff we’d gladly keep even if it meant giving up indoor plumbing!

Maybe I’m being overly optimistic and/or just naive; but I think a huge part of the problem facing everyone else who is building things is that they don’t have a paradigm for thinking about how to improve what they do. CRUD has served the software world amazingly well and it will continue to be the mental model all applications are built on, regardless of whether it is a desktop, mobile, or VR application. Where I might be naive is in thinking that these bright folks who make great things don’t need a silver bullet in the form of a person or even in the form of a skillset. They just need a more useful mental model.

Right now, the silver bullet is to hire one of these elusive data scientists we hear so much about. It makes me cringe every time I hear a startup or established company reduced to saying all they need to do is hire a rock-star data scientist. It makes me sad every time an already competent software engineer wonders if they should go do a data science “bootcamp” to stay competitive.

What is it about this new world of data that has people talking crazy talk?

Companies, a pretty good rule of thumb is that if you can’t write a reasonable job description, you probably can’t support, onboard, or feed a person who “fits” that job description. The truth for you is:

There are very few rock star data scientists who can actually build you anything.
Those that can aren’t going to work for you.
Your competitive advantage is not hiring someone to parameterize an algorithm for you. It’s your F***ing data!

A part of you has to know this. Don’t you suspect that the recent physics/chemistry/math/CS PhD, bootcamp grad, or tenured-professor-turn-data-scientist isn’t likely to hit it out of the park for you just like your junior developers, inexperienced managers, and mid-career transition professionals aren’t going to be hitting many home runs for you? To paraphrase a saying in tech: “No one ever got fired/lost an investor for hiring a data scientist” — but they probably should have been.

Software engineers, becoming a full-stack dev with a half-dozen languages under your belt is an awesome and insanely great skillset to go after. 99.9997% of you just use a database that gets the job done (e.g. MySQL, Postgresql, Sqlite, SQL Server, or MongoDB) or one that someone forces you to use to get the job done (e.g. Oracle, DB2, etc.). If I told you needed to learn how to build a database from scratch, you’d probably call me a not very nice name. If I saidyou should become a certified Oracle DBA, most of you would still swear at me. You don’t need to do either of those things to make full use of a database. So why do you think you need to learn a craft like “data science” to be better? It’s not a language and it’s not a framework; so you probably don’t need to learn it to make use of it.

A few sharp words about IP

But what about IP!?

Investors and Board Members let me say this very plainly for your benefit:

Hiring someone to use their toolkit of choice (e.g. weka, python, R, SparkML, Dato, Alpine, Amazon, Google, or, God forbid, Azure ML) or to write a “proprietary” implementation of pre-existing ideas (e.g. recommender, classifier, stock predictor) only feels like IP. Maybe it is. But just barely; try and defend it. Coding up a white paper is not IP. It’s probably just a trade secret. At best you can copyright your code like everyone else.

Look, almost any doofus with enough R skills — R is an esoteric scripting language for statistics — to be dangerous, and access to stack-overflow, can create a predictive model for you. you might feel a warm feeling when you hire that person and an even warmer feeling when you see their creation predict something. But it’s mostly likely brittle, non-scalable, crap that favors one algo they know over algos they don’t know and it probably took too long to build and will end up performing poorly in the real-world. And that meat bag is very unlikely to be your key advantage unless you hired them from a University in Toronto Canada. Even then, someone disrupting your space probably hired their equal or better from the same University or maybe a third guy from that same University decided to start a non-profit and open-source the whole kit and caboodle. Doh!

Your competitive advantage and potentially the basis of any durable IP revolves almost singularly around your data and how you managed to generate it.

One more time in all caps:

IT’S ALL ABOUT THE F***ING DATA!

So should you just give up? Of course not. But what can you do?

Make the LEAP!

The reality is that you’re probably not going to do data science and you are unlikely to be able to source talent for it. But you probably already have a lot of the skills needed within your organization. I’ve created the LEAP mental model as a way of hopefully helping organizations understand what the can and should do to be able to make full use of advance analytics.

Here’s what you can do first. Memorize this simple acronym: L.E.A.P.

Label, Explore, Analyze, Predict

And when it comes to data science, advanced analytics, etc., these are the crucial steps you need to remember. LEAP is the “CRUD” of advanced analytics. Are you, personally, going to be able to do all of them? No. And why should you? Do you personally write all the applications in your company by yourself, do all the sales, and all the reporting? You do? Congratulations! Someone better at delegating and execution is going to put you out of business.

Everyone else except for one-man-and-a-briefcase-.com guy keep reading.

I mean it Briefcase Guy, you’re not invited. You’re not going to be your own data scientist no matter how you update your LinkedIn profile.

Label

This journey starts with data.

“Can you predict the weather?”

Well, do you have weather data?

“No.”

Then no. No, I can’t.

“Can you build me an XYZ engine that totally increases/decreases XYZ automatically?”

Do you have quite a bit of data about XYZ?

“No.”

Then again, no. Sorry, but the answer is no.

Machine Learning and Artificial Intelligence are not magic and neither are data scientists, clever quote memes notwithstanding. You may have spoken with a data scientist and she made your head hurt. High IQs aside, all we do is use computers to find patterns in data. So, the data is kind of crucial.

I’m not going to make you learn machine learning jargon. In fact I’m going to ask that you forward this to your smartest data-head colleagues and ask them, diplomatically but forcefully, that they adopt LEAP jargon when they talk to you about what they are doing or proposing — with one exception. I am going to ask that you learn one word they’ve probably said: Label.

What’s a label? It’s exactly what you think it would be.

Say you’ve asked someone to meet you for coffee and you’ve never seen a picture of this person (and I guess pretend it’s also the 1980s) and you describe yourself to her over the phone or in a personal ad (remember it’s the ‘80s). What would you say? What features are most prominent and less likely to be found in the general population that will identify you? If it were me I’d say: “I’ll be the bald man in square rimmed glasses with a close cropped beard, light brown skin, wearing a graphic tee, flip-flops, shorts, and sitting somewhere near a wall.” When my phone friend arrives she will look around and in her brain think:

“Stranger, Stranger, Stranger, Gonzo, Stranger. Go back! That was him.”

Based on only a few features of my appearance she was able to label me appropriately: Gonzo.

Features are just data. But don’t let your brainy data scientist say “features.” He probably will because it sounds cool and data science-y. Just raise an eyebrow and remind him we are going to say “data” or maybe “data points” — he will squirm for a minute but he’ll get over it. Labels are whatever those data describe. In the above story there are just two labels: Stranger and Gonzo.

You can have a wide variety of labels:

Spam, Ham, Forum post, Social feed, etc.
Dates
Prices
Sentiment (e.g. positive vs. negative)
Fraudulent vs. truthful applications
Items in an image or video
Words in a recorded phone call
Etc.

Really, the sky’s the limit. So when I say that data is the only thing that matters, let me clarify: labels are the data that matter. If you have a bunch of data that describes people in a coffee shop but no names, it’s pretty useless for most applications except for a specific use case I’ll describe in the section on analysis.

What if you don’t have labels?

Well, there’s something you can do about that. Are you a domain expert on the thing you wish you could teach a computer to do? Perfect! You are probably one of the most qualified labelers around. Remember, data is important but labels are required. They are the secret sauce in any predictive solution. There are also generalized and specialized services that can help you label data if you don’t have time or you have too much data and don’t want to be a full-time labeler for the rest of your life.

Labeling options

You and your colleagues — conveniently reduces the workload and probably the way to go if you only have a few hundred things total.
Services — Mechanical Turk, CrowdFlower, OpenSpace neé CloudCrowd, CrowdSource outsource the workload and can process up to millions of labeling tasks in a reasonable period of time using humans.
Business rules!

How much label data do you need? Generally, the more the better. But if you can get even just a hundred things labeled you will have a good start. Don’t forget to start with what you already know. Maybe you already have some rules of thumb you can use to label data you already have. Better yet, maybe those rules of thumb are already labeling data for you via some process already functioning in your business. For example, if you have a process for sorting things like orders, applications, or purchases, start by using that data for your labels or begin storing it somewhere so that you can soon use it as labels.

You don’t have to be able to create this to be useful.

Explore

Exploring data is an incredibly important step. Usually this involves charts or “visualizations” of your data and transformations of your data like what you might do with an Excel pivot table. If you are even semi-handy with Excel, you can begin this process on your own. And if you’re relying on the data you’re thinking about using for operations, then you probably already have some “exploratory analysis” completed — just take another look at the reports you’re already running.

While sometimes data visualizations, like Facebook’s interconnected members one (see above), can be insightful (e.g. pythons and polar bears aren’t on Facebook) and not just pretty. However, you can often get away with simple bar charts and even pie charts to find trends — just don’t tell your data scientist friend you used a pie chart, because they have a professional obligation to hate pie charts.

With these exploratory efforts, two elements are most important:

The first is looking for trends in your data (spreadsheets) that relate to your label (a single column in Excel, likely).
The second is asking basic questions while trying to answer them visually with your data.

Example exploratory questions you can answer by yourself or with the help of your favorite Excel guru or BI analyst

How (many, often, frequently) does what I’m interested in happen within a period of time? Does it vary by day/month/season? (Try a bar chart using dates.)
How often does X happen when Y happens? (Try a pivot table.)
Which level/category/amount of X is most frequent? (Try a pivot table + a bar chart!)
Can X (a number like a price or count) be carved up into meaningful categories (i.e. new labels) that explain even better the questions above? (Try using an “if” statement to carve up X into groups like “income brackets,” “heavy users,” or “top spenders”.)
Other than X, what other columns of data are you interested in predicting? Can you find some trends that relate to that?

If you’re really eager to get started, don’t feel overly competent in Excel, and have access to Google sheets, you’re in luck. Google has baked data exploration into their their Sheets product and it takes zero effort to get started. You just click on the “explore” icon in the lower right-hand corner (current: June 2016) and voila — instant data exploration with plain language explanations of trends in your data. It’s pretty slick. Give it a try on these Tuberculosis datasets in this shared Sheet.

To get this information and more you just need to upload your data to Google Sheets!

Things you probably can’t explore

The place where you will likely not be able to explore on your own relates to “unstructured” data. That is data that does not fit comfortably into a Rows and Columns configuration. Examples are images, video, audio, and text.

The one item in this list on the border of doability is probably text. If you get clever with Excel “match” statements or some regular expressions, you can probably pull a word or words out of a bunch of text that might be really useful as stand-alone data points (e.g. error codes in logs, browser agent in web logs, product names in reviews, etc.).

But remember, without labels most data scientists are going to be out of luck when it comes to exploratory analysis as well. So if you want to use or need to use unstructured data, start labeling! Without those labels there’s not much a data science solution, human or otherwise, can do.

Analyze

There are several jargon-laden terms for what amounts to analyzing data for patterns. You may have heard some of them:

Train a model
Machine learning algorithm
Naive bayes classifier, random forest, SVM, etc.
Deep learning (convolutional neural nets, deep neural nets/dnn, etc.)

Remember, this is all just pattern recognition. So whether a data scientist is using battle-tested regression (like what you learned in college), or one of the latest innovations in neural nets, they are just configuring a computer to analyze data in hopes of creating something that can accurately Label new rows of data.

The reality is that while you are probably a world-class labeler, you probably don’t want to learn how to make computers do analysis. Right now there is a huge demand for computer analyzers (e.g. data scientists) just like there is/was a huge demand for computer programmers/software developers. But don’t despair. Quite a few teams are working on making data scientists irrelevant. Unlike software development, which is akin to writing a novel and therefore pretty difficult to automate, machine learning is, by comparison, much more automatable. Data science is in the category of complexity similar to games like chess or Go, both of which are now done better by machines than by humans. You probably can’t wait five to ten years to get started; but if you can, you’ll probably be able to to “hire” a data scientist robot for about the cost of an Excel license.

So if you’re not going to do analysis beyond maybe dropping a regression line on a bar chart in Excel, what should you think about when it comes to analysis?

Two things:

Problem type
Acceptable accuracy

Problem types

The more you can zero in on the type of problem you need to analyze, the quicker you’ll get to something usable and valuable. While there’s almost an infinity of nuance below each of the following categories, it’s really not needed to steer analysis in the right direction. Once you have labels in hand and a decent case supported by a trend or two, you can demonstrate with your exploratory efforts, choosing one of the categories below will probably be super simple.

Classification — figure out a pattern that identifies a discrete category of things e.g. “things/movies/news you might like”, “man or woman’s face”, etc.
Regression — figure out a pattern that identifies a continuous (i.e. zero to infinity and all the decimals along the way) relationship between things e.g. “how much will this stock sell for,” “how much will this customer spend,” “how much should this person pay for insurance,” “how many items will we sell next quarter,” etc.
Anomaly Detection — figure out a pattern to recognize outliers in a sea of things that are “ok,” e.g. fraud detection, intrusion detection, system monitoring, etc.
Clustering — figure out how data relates to other data without Labels. This is often a great way to come up with some Labels you might want to predict e.g. “bald males between the ages of 35 and 65 who frequent coffee shops” or “hispanic females between the ages of 18 & 35 who live in the midwest.” This is what I hinted at above when I said you can do analysis without labels.
Information Extraction/Artificial Intelligence — a special kind of applied machine learning that usually relies on a pre-built analytic models, e.g. “what’s in this image,” speech to text, etc.

So, while you won’t likely perform the analysis, if you have good Label data and a couple ideas about what would be useful to predict from your data exploration, you will be able to effectively manage finding someone to help do the analysis or pick a point-solution or generalized platform that you or your team can use to outsource the analysis.

Outsourcing analysis

Point-solutions — often SaaS providers who specialize in analyzing a specific kind of data for a specific kind of problem e.g. analyze Google analytics data to identify click-fraud, analyze sales force data to identify most qualified leads, analyze video interviews to find the best hires, etc.
Consultants — freelance or agency consulting will often help you do everything they can get paid to do, but if you do the leg work of identifying a clear problem or set of problems you’d like to solve and have labeled data ready to go and some initial exploration that demonstrates interesting trends, their efforts will pay dividends much faster and the bill will be much smaller; win-win… for you.
Generalized prediction — platforms like Amazon, Google, Microsoft, IBM; services providers like Data Robot and Big Squid; or software like Chorus 6 and Dato that ultimately results in a hosted model (deployed in their cloud or sometimes in your datacenter).

Unfortunately, none of the above solutions are super-duper simple or cheap. Analysis is currently and likely to remain a bottleneck because of lack of talent, lack of suppliers, and the costliness of the underlying computer power needed to do the analysis. However, the more you understand your problem, get your labels in order, and confirm there’s something worth looking into from your data exploration, the cheaper and faster you’ll be able to clear the analysis hurdle.

If you’re a software engineer and can do the Label and Explore steps on your own, don’t want to wait to get started, and are handy at integrating with often picky and obtuse APIs, I can make two recommendations:

Amazon Machine Learning — it has decent documentation and for problem types that it supports you just need to tell the API which data is your Label (Y) and which are your features (Xs).
Google AI — if you want to incorporate speech, transcription, or image classification the Google APIs are better than most.

Acceptable accuracy

The second thing you should think carefully about when it comes to analysis is just how accurate your analysis needs to be to be useful. The accuracy is often measured with a statistic (e.g F1 score, R squared, AUC — Area Under the Curve, Precision, and Recall) that spans the range from 0 to 1. So if you get back a statistic of 0.51 it would mean the analysis is 51% accurate. How accurate do you want the analysis to be?

The knee-jerk reaction is that say it needs to be 100% accurate. Or if you’re one of those enthusiastic sports fans who talks primarily in metaphors then you might think it needs to be 110% accurate.

The reality is you actually don’t want an analysis that is 100% accurate. If a solution provider tells you that they have created something that is 100% accurate, then they likely wasted your time and your money. Just like people, computers make mistakes. When computers don’t make mistakes in an analysis it’s likely your data science practitioner has made a careless mistake and “overfit” the algorithm to the data. That is, they created a model with their analysis that works really well on their little walled garden on their laptop but the moment it sees new data in the wild it will fall flat.

Plain English levels of accuracy

Sorting efficiency — You may not care if the top item in a list is “the best thing.” It might be enough that everything in the top ten sorted items/recommended items are better than than the next ten items. When this is the case, relatively low accuracy analysis might be super powerful. Don’t obsess about accuracy when it doesn’t really matter.
Better than a coin toss — 51% accurate is actually good enough to be amazing at the stock market, sports betting, and other very difficult to predict things. If you have more than two outcomes your baseline will be lower than 51% but still just as meaningful. For example, if you can consistently guess the roll of a six-sided die better than 17% of the time, you should go to Vegas and rake in the winnings.
Robust — You’re more sensitive to the wrong kinds of mistakes or something that is inconsistent than purely chasing accuracy measures alone. See False Negatives, False Positives, and measuring mistakes below.
Super-human — This, naturally, requires that you set a human benchmark through some kind of test. Currently several AI solutions like playing chess, Go, and identifying things in images are all performing at super-human accuracy levels.

False Negatives, False Positives, and measuring mistakes

The only other aspect of accuracy you need to evaluate has to do with the kinds of mistakes an algorithm will make when analyzing your data. False positives means the algorithm thought something was when it actually wasn’t. False negatives are just the reverse. It is conceivable that one analysis with one algorithm could have significantly higher “overall” accuracy than another but all that gain in accuracy might tank either Precision (measure of false positives) or Recall (measure of false negatives) to an alarming level.

For some things it doesn’t really matter to the business objectives. But for some business objectives a False Negative is much worse than a False Positive or vice versa, e.g. missing a fraudulent claim might mean millions of dollars in fines but flagging a claim as fraud to only find out later is was truthful just pisses off a customer; still bad, but maybe not millions-of-dollars bad.

You should determine ahead of time what matters for your business objectives and if certain mistakes are costly in terms of opportunity, liability, or compliance then you will want to know specifically how a analysis performs with respect to the kinds of mistakes you’re most concerned about.

Prediction

Our fourth, and last, step is prediction.

This is actually pretty simple as long as you either have a competent data scientist/data science team on tap or used a full-stack solution provider to do the Analysis step. However, if you made the mistake and hired that physics major who only knows Matlab, getting to prediction is likely to be very expensive in both time and money. So maybe skip that hire and let the physics major go do physics. But even your recent Data Scientist graduate is unlikely to be ready to go from zero to hero with a fully deployed prediction solution. She might know her way around R and making nice-looking charts but the last hurdle is deploying that model so that it can be used to automate machine-to-machine or human-in-the-loop processes you set out to automate in the first place.

If you insist on hiring a data scientist whodoesn’t know how to deploy his model in a useful way, there are some solutions that help you to get your investment in advanced analytics to the end of the row. But you might want to think about what you would pay a software developer who could only create things that functioned on their laptop and factor that into your offer.

Prediction deployment solutions

Generalized Prediction — all providers (e.g. Google, Amazon, Microsoft, etc.) will have a solution to shift from Analysis mode to Prediction mode. All will host a prediction solution in their cloud and a few have an option for deploying a prediction solution to your datacenter — if you are unfortunate enough to work for an industry so behind the times that they can’t understand how Amazon Web Services and Amazon GovCloud are likely an order of magnitude better/safer/more reliable than your datacenter in the basement.
Push-button deploy — a growing number of solutions are being developed to allow your data science team to program in their language of choice, be it R, Python, Lua, Scala, etc. and deploy those scripts as services your software dev team can actually integrate with. A few providers you can google are: Domino Data Labs, DeployR, IBM, PredictionIO, etc.
Artificial Intelligence as a Service — pre-built models (some can be incrementally trained on your data to improve accuracy) that you integrate with so that you can add data from an AI model to your existing data e.g. Image classification and Speech to text. Take a look at Google’s offerings as a place to begin.

Once your solution is deployed you will be able to use it to make new predictions with a new row of data or a batch of new rows of data.

Row-wise prediction is great for machine-to-machine prediction but can also be extremely useful for real-time human-in-the-loop processes (e.g. qualifying a loan application, calculating the predicted contract value of an account, etc.). Batch prediction is often used for human-in-the-loop processes to prepare a bunch of predictions that can be used directly or loaded into software (e.g. lead scores for all the contacts in a new call list, predicted performance reviews on a list of job candidates, etc.).

No matter how you use prediction to increase automation or to just calibrate human decision making, when you do, you’ll be closer than ever to joining the elite group of truly data-driven companies that appropriately leverage data and analytics. While only 4% of companies can claim to be a part of this elite group, they are twice as likely to be among the top financial performers in their industry, three times more likely to hit their goals, and five times faster at making decisions. Wow! Sounds like a group worth joining.

That’s the LEAP framework. Let me know what you think of it.

For those of you who have zero desire to learn advanced analytics but really want to use it to improve your business, was this breakdown helpful?

For those of you who are doing or want to do data science, does the LEAP framework seem sufficient enough to be useful without skipping important steps or getting too bogged down in the weeds?

Please share this and let me know if it’s been helpful.