Data Science D-Mystified, Muppet Style
Welcome to Muppet Labs, where the future is being made today!
Bunsen and Beaker are the dynamic duo of science. They had the game. They had the threads. And they had mad flow… Well Beaker did, but bald can be beautiful, too!
Below are five D’s for demystifying data science. These are key concepts to help you sort out the credible from the comic. I also added five links to some Muppet Lab skits to entertain and help you remember (just click on the pics).
Definition
Whether KPI, Hypotheses, or Model Inputs — they all follow the same rule. If you can’t define it, don’t trust it. Also be careful about taking supposedly simple things for granted. There are many ways to define an ‘average’. Anything that is ‘indexed’ should be subject to scrutiny. And always question exactly what was really tested! If you can’t easily explain it, it is unlikely to really answer or predict anything.
Denominator
What is in? What is out? Denominators are often filled with inadvertent but dangerous nonsense. Always question anything that was left out. And simply dropping the word ‘outlier’ is not a satisfactory answer. This is even more critical if you are trying to size the impact of applying test results or models. You need to know did they carve my population up like a Halloween Jack-o-Latern?
Distribution
While the ‘Long Tail’ has been popular among marketers, it has been equally dangerous among analysts and modelers. T-tables, confidence intervals, and standard deviations are typically based on ‘Normal’ distributions. If you are testing attributes like height or weight, or are fairly certain you have a uniform or homogeneous population — this may work out fine. But human behaviors and metrics based on things like income and spending are rarely ‘Normal’. Segmentation or the use of statistical methods for other types of distributions can work… but if you aren’t certain that ‘Tail’ may belong to a tiger.
Delivery
If it isn’t perfectly clear what you are going to employ an analytic model for — don’t resource it! Data is great. But delivering actionable analysis requires solid infrastructure, execution, strategy, and relevance. Many a genius model has been little more than a sharpened banana.
Duplicate
If you can’t duplicate it, it wasn’t real. This is pivotal to the scientific process but often lost in the halls of business. Remember cold fusion? At the very least, try triangulation. Differing methodologies producing similar findings is fairly solid. But, only if you can repeatedly produce the same result, have you hit an analytic Ode to Joy. Warning — until you can confirm the result, don’t let it anywhere near the P&L!
Originally published at www.linkedin.com.