Keeping Up With Data — Week 10 Reading List
5 minutes for 5 hours’ worth of reading
Outcomes, not outputs! This is what matters in data science. But we easily get distracted by day-to-day operations. So, a reminder is useful. Be it in the form of a funny picture like the one above or the recent book by Bill Schmarzo: The Economics of Data, Analytics, and Digital Transformation.
- ‘Big’ Data Can Be 99.98% Smaller Than It Appears: Intuition tells us that larger samples are more reliable. But what we mustn’t forget is the importance of how the sample has been selected. To assess the saltiness of a soup, even just a spoon is enough. But only if it is well stirred! Similarly, the opinion of a sample population can be generalised only if the sample is random. If there is even a tiny — 0.5% selection bias— opinion of 2.3 million, it is no better than a truly random sample of 400. So, keep that in mind and be aware of the ‘big data paradox’: the more data, the surer we fool ourselves. (Bloomberg)
- Introduction to Causality | the science of measuring and optimizing cause & effect: A couple of years ago there seemed to be a Pavlovian reflex about causality triggered by the word “correlation”. Unfortunately, this reflex is often not backed with knowing when to consider a causal analysis and what are the risks of using non-causal decision making is crucial. So be aware of the confounders. They are not always as obvious in real life as they are in funny spurious correlation examples. (Bayes Server)
- Test Your Data Until It Hurts: Data quality is a hot topic. But how does it look in practice? What does ‘good data’ look like? What to control? How to set the thresholds? This is not something that you can brainstorm from a table. You need to get your arms — elbow deep — in the data. There are two important points in the article — first, set the quality control mechanisms independently on the data pipelines, and second, expect the tests to be failing in the first couple of weeks. You should even set them up so that they will fail! Don’t worry about false positives. Worry about false negatives! (Micha Kunze @ Towards Data Science)
Every Thursday night when I’m writing up my reading list, I ask myself not to leave it till next Thursday evening. And here I am again. But the good thing is that I don’t have to worry about if for another week now!