Great post, but a few immediate reactions.

One great danger among the “data” folk that I see, which the post addresses perfectly, is that there is a proclivity to worship the data itself as an idol, mistaking it for (the entirety of the) reality. The points that you raise are a great corrective to this, but…

There is also a great number of people who don’t care much about the data. They have definite ideas about how the world should be and are happy to ignore the data if the data does not suit their worldviews and/or insist on being “data driven” only when (some of) the data fits their worldview.

I think the most important aspect of “data literacy” is to teach people to both see the data and its limitations. In the end, the data we have is the data we have. There are many potential explanations that account for how we came to see the things that we see, from many different potential “realities,” but with different probabilities. (with “reality” including the different processes of collecting the data and all the incompleteness that they entail). So what we are really doing, whether explicitly or not, when we draw inferences from the data is assigning probabilities to these possible “realities.”

We “predict” confidently when we think one possible reality (or a family of realities that are “close enough”) is unusually more probable than others (that aren’t “close enough.”). The big limitation, of course, is not having wide enough perspective to see the entire array of possibilities and the potential paths connecting them and the observed outcomes (and the associated probability distributions). This is, I think, where your point #3, about incluvisity, is critical…but with a big caveat: even if all members of the team come with different backgrounds, they all need to be able to think “probabilistically” so that they are open to the possibilities that they hadn’t thought about and be able to think through how those possibilities can give rise to the data observed (and the associated probabilities). (NB: the point raised by Zoran Nesic below about seeing different sentences from incomplete collection of letters underscores this point. People who learned a language “non-natively” will see different things from “native learners,” i.e. the probability distributions of “what they saw” conditional on their backgrounds and the same collection of letters, will be different. As the famous koan about Sussman and Minsky holds, the priors are an important information, and drawing inferences on the priors themselves is a valuable clue. I still find it baffling that people program neural networks so that the weights are distributed randomly, or uniformly, personally.)

So being data literate should mean three things: data we have is data we have and we can’t change that; there are many different interpretations of the same data; but we CAN quantify the different interpretations, via conditional probabilities, say — which open up a lot of opportunities for more nuanced data analyses. The first point is where many naive data people are stuck at, I think. The second point is where naive data skeptics are obsessed with. The third, encompassing more nuanced analyses of data that recognizes their limits (but not given to blind hand waving that all different interpretations are the same functionally), should point to the way forward: recognize the limits of data analyses, but make the limits themselves the subject of data analyses — quantify not just what you do know, but what you don’t know, take the uncertainty more seriously, and study it systematically.

Just my two cents.

PS. My favorite example of the data yielding to many potential realities that could have given rise to it is the overused example based on coin tosses: So, you get 67 heads in 100 coin tosses. What does it get us? We can calculate precise probabilities given different potential values of P(H). We might find that P(H) = .67 provides the highest probability that we should have come to see the data that we see, but it is not impossible that P(H) = .5, P(H) = .1, or P(H) = .01 could have yielded the result — just that some of these are much less probable, which we can calculate with some precision (with some assumptions, e.g. independence of coin tosses, or not.). (Only P(H) = 1 and P(H) = 0 are ruled out categorically). One reason that stats people have used p-values is that, properly used, this provides a potential antidote to overconfidence in our own hypothesis: since p-value is P(observe data|null hyp is true, i.e. our hyp is wrong) Sadly, people have misunderstood and abused p-values

To be fair, “null hyp being true” is potentially a very broad category: if we think P(H) = .67 exactly, then one’d almost never get p-value of any usefulness regardless of data. So what we really mean is that we hypothesize P(H) in the interval (.67-x, .67+x). But this subverts the appearance of precision when we say P(H) = .67. In a sense, the x in this statement is far more important than .67 sometimes, but also raises some questions. the hypothesis that P(H) is in the interval (.1,1.0), given 67 heads in 100 tosses, is probably “true,” under all manner of circumstances, with very small p-value, but with limited usefulness. That P(H) is in the interval (.65, .69) could be more useful, if you need a more precise value of P(H), but only if the p-value (that is, P(H)>.69 +P(H)< .65) is small — which is the big question that needs to be answered — and often, cannot be with much confidence. So the “prediction” of P(H) has to, at minimum, come with two separate values that are really two sides of the same coin: we say P(H) = (some interval) and P(P(H) not in this interval) = X, along with some clues as to how the latter was arrived at (full distribution of P(H|observed data) if that is feasible, would be even better, and pymc3 is making this increasingly easier, I think, although with the attendant caveats, obviously). This is what I didn’t like about the recent war on p-values, although, knowing how the concept has been abused for so long, I couldn’t say it was unconditionally a bad thing.