Why it’s important to hire data engineers early

“What challenges are you tackling at the moment?” I asked. “Well,” the ex-academic said, “It looks like I’ve been hired as Chief Data Scientist… at a company that has no data.”

Image for post
Image for post
“Human, the bowl is empty.” — Data Scientist. Image: SOURCE.

I don’t know whether to laugh or to cry. You’d think it would be obvious, but data science doesn’t make any sense without data. Alas, this is not an isolated incident.

Data science doesn’t make any sense without data.

So, let me go ahead and say what so many ambitious data scientists (and their would-be employers) really seem to need to hear.

What is data engineering?

If data science is the discipline of making data useful, then you can think of data engineering as the discipline of making data usable. Data engineers are the heroes who provide behind-the-scenes infrastructure support that makes machine logs and colossal data stores compatible with data science toolkits. …


Tips for identifying fakers and neutralizing their snake oil

You might have heard of analysts, ML/AI engineers, and statisticians, but have you heard of their overpaid cousin? Meet the data charlatan!

Attracted by the lure of lucrative jobs, these hucksters give legitimate data professionals a bad name.

Image for post
Image for post
Image: SOURCE

[In a hurry? Scroll down for a quick summary at the bottom.]

Data charlatans are everywhere

Chances are that your organization has been harboring these fakers for years, but the good news is that they’re easy to identify if you know what to look for.

Data charlatans are so good at hiding in plain sight that you might even be one without even realizing it. Uh-oh!

The first warning sign is a failure to understand that analytics and statistics are very different disciplines.


In a nutshell, it’s all about loneliness

The curse of dimensionality! What on earth is that? Besides being a prime example of shock-and-awe names in machine learning jargon (which often sound far fancier than they are), it’s a reference to the effect that adding more features has on your dataset. In a nutshell, the curse of dimensionality is all about loneliness.

In a nutshell, the curse of dimensionality is all about loneliness.

Before I explain myself, let’s get some basic jargon out of the way. What’s a feature? It’s the machine learning word for what other disciplines might call a predictor / (independent) variable / attribute / signal. Information about each datapoint, in other words. …


Renaming that pesky little number and relearning how to use it

Image for post
Image for post

Is p for probability?

Technically, p-value stands for probability value, but since all of statistics is all about dealing with probabilistic decision-making, that’s probably the least useful name we could give it.

Instead, here are some more colorful candidate names for your amusement.

Image for post
Image for post

Painful value: They make you calculate it in class without explaining it to you properly; no wonder your brain is hurting. Honorable submissions in this category also include puzzling value, perplexing value, and punishing value.

Pesky value / problematic value: Statisticians are so tired of seeing ignoramuses abuse the p-value that some of them want to see it abolished. …


On the nature of analytics, part 2 of 2

Before we dissect the nature of analytical excellence, let’s start with a quick summary of three common misconceptions about analytics from Part 1:

  1. Analytics is statistics. (No.)
  2. Analytics is data journalism / marketing / storytelling. (No.)
  3. Analytics is decision-making. (No!)

Misconception #1: Analytics versus statistics

While the tools and equations they use are similar, analysts and statisticians are trained to do very different jobs:

  • Analytics helps you form hypotheses, improving the quality of your questions.
  • Statistics helps you test hypotheses, improving the quality of your answers.

If you’d like to learn more about these professions, check out my article Can analysts and statisticians get along?

Misconception #2: Analytics versus journalism/marketing

Analytics is not marketing. …


A look inside one of the most powerful tools of the tech trade

In a nutshell: A/B testing is all about studying causality by creating believable clones — two identical items (or, more typically, two statistically identical groups) — and then seeing the effects of treating them differently.

Image for post
Image for post
When I say two identical items, I mean even more identical than this. The key is to find “believable clones” … or let randomization plus large sample sizes create them for you. Image: SOURCE.

Scientific, controlled experiments are incredible tools; they give you permission to talk about what causes what. Without them, all you have is correlation, which is often unhelpful for decision-making.

Experiments are your license to use the word “because” in polite conversation.

Unfortunately, it’s fairly common to see folks deluding themselves about the quality of their inferences, claiming the benefits of scientific experimentation without having done a proper experiment. …


Not causation.

Experiments allow you to talk about cause and effect. Without them, all you have is correlation. What is correlation?

IT’S NOT CAUSATION. (!!!!!)

Sure, you’ve probably already heard us statisticians yelling that at you. But what is correlation? It’s when the variables in a dataset look like they’re moving together in some way.

Image for post
Image for post
Two variables X and Y are correlated if they seem to be moving together in some way.

For example, “when X is higher, Y tends to be higher (this is called positive correlation) or “when X is higher, Y tends to be lower (this is called negative correlation).

Image for post
Image for post
Thanks, Wikipedia.

If you’re looking for the formula for (population) correlation, your friend Wikipedia has everything you need. But if you wanted that, why didn’t you go there straight away? Why are you here? Ah, you want the intuitive explanation? Cool. …


Now’s a good time to rethink our assumptions about fact and fiction

In my previous article, I explained why you shouldn’t look to statistical inference for truth. Given the prevalence of statistical techniques in scientific research, what does this mean for science?

Image for post
Image for post
Image from an xkcd t-shirt, which you can find here.

(For those who insist that you need credentials to have an opinion about science, this jerk of an author holds graduate degrees in neuroscience and mathematical statistics. Glad we got that out of the way.)

Scientific theory

A hypothesis is a description or explanation, but it needn’t be true. If it amuses me, I can hypothesize that no human is taller than five feet. …


Why statistics will never give you Truth

Here’s the audio version of the article, read for you by the author.

Prepare a box of tissues! I’m about to drop a truth bomb about statistics and data science that’ll bring tears to your eyes.

Image for post
Image for post
Meme template: SOURCE.

INFERENCE = DATA + ASSUMPTIONS. In other words, statistics does not give you truth.

Common myths

Here are some standard misconceptions:

  • “If I find the right equations, I can know the unknown.”
  • “If I math at my data hard enough, I can reduce my uncertainty.”
  • “Statistics can transform data into truth!”

They sound like fairytales, don’t they? That’s because they are!

Painful truths

There is no magic in the world that lets you make something out of nothing, so abandon that hope now. That’s not what statistics is about. Take it from a statistician. (As a bonus, this article might save you from wasting a decade of your life studying the dark arts of statistics to chase that elusive dream.) …


Adventures in wishful thinking, nonstationarity, and pattern-finding

Imagine that you’ve just managed to get your hands on a dataset from a clinical trial. Exciting! To help you get in character, I made up some data for you to look at:

Image for post
Image for post

Pretend that these datapoints map out the relationship between the treatment day (input “feature) and the correct dosage of some miracle cure in milligrams (output “prediction) that a patient should receive for over the course of 60 days.

#The data:
(1,28) (2,17) (3,92) (4,41) (5,9) (6,87) (7,54) (8,3) (9,78) (10,67) (11,1) (12,67) (13,78) (14,3) (15,55) (16,86) (17,8) (18,42) (19,92) (20,17) (21,29) (22,94) (23,28) (24,18) (25,93) (26,40) (27,9) (28,87) (29,53) (30,3) (31,79) (32,66) (33,1) (34,68) (35,77) (36,3) (37,56) (38,86) (39,8) (40,43) (41,92) (42,16) (43,30) (44,94) (45,27) (46,19) (47,93) (48,39) (49,10) (50,88) (51,53) (52,4) (53,80) (54,65) (55,1) (56,69) (57,77) (58,3) (59,57) (60,86) ... …

About

Cassie Kozyrkov

Head of Decision Intelligence, Google. ❤️ Stats, ML/AI, data, puns, art, theatre, decision science. All views are my own. twitter.com/quaesita

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store