In this short post, I want to highlight for you some clever things that Yoshua and his collaborators did to win a Machine Learning competition from a field of 381 competing teams. Perhaps these ideas will be useful for your own work.
In a world where powerful Deep Learning frameworks (e.g., TensorFlow, PyTorch) are a free download away, their competition-winning approach demonstrates nicely that your edge may come from how well you model the specifics of your problem.
(Caveat: This work was done in 2015. Given all the advancements in Deep Learning and computing hardware since then, Yoshua and team would probably solve the problem differently if the competition were held…
In Part 1, we looked at how to use Causal Inference to draw the right conclusions — or at least not jump to the wrong conclusions — from observational data.
We saw that confounders are often the reason why we draw the wrong conclusions and learned about a simple technique called stratification that can help us control for confounders.
In this article, we present another example of how to use stratification and then consider what to do when there are so many confounders that stratification becomes messy.
Let’s say you work at a multi-channel retailer. You have been analyzing customer-level sales data and notice the…
I used to work at a global management consulting firm many years ago. As a new associate, when I presented the results of my work, I’d often be stopped in my tracks with, “That is interesting. But what is the so what here?”
“So what” was shorthand for several related things.
New associates quickly developed the habit of considering the so-what of their findings before presenting anything. While painful and humbling at first, this turned out to be a very useful habit. …
Isaac Asimov is one of my favorite writers. I recently finished reading It’s Been a Good Life, a compendium of excerpts from his letters, speeches and unpublished writing, curated by his wife Janet Jeppson Asimov.
The book is worth reading in its entirety — it is full of insights, candid self-reflections, pithy statements of his life philosophy, and accounts of pivotal life events. I picked a few below that particularly resonated with me and if they click with you as well, please do read the book.
On the joy of reading (reflecting on his childhood growing up in Brooklyn):
To those who are not bookworms, it must be a curious thought that someone would read and read, letting life with all its glory pass by unnoticed, wasting the carefree days of youth, missing the wonderful interplay of muscle and sinew. There must seem something sad and even tragic about it, and one might wonder what impels a youngster to do it. …
Analysts and data scientists operating in the business world are awash in observational data. This is data that’s generated in the course of the operations of the business. This is in contrast to experimental data, where subjects are randomly assigned to different treatment groups, and outcomes are recorded and analyzed (think randomized clinical trials or AB tests).
Experimental data can be expensive or, in some cases, impossible/unethical to collect (e.g., assigning people to smoking vs non-smoking groups). Observational data, on the other hand, are very cheap since they are generated as a side effect of business operations.
Given this cheap abundance of observational data, it is no surprise that ‘interrogating’ this data is a staple of everyday analytical work. And one of the most common interrogation techniques is comparing groups of ‘subjects’ — customers, employees, products, … — on important metrics. …
When you set out to solve a data science problem, it is very tempting to dive in and start building models.
Don’t. Create a common-sense baseline first.
A common-sense baseline is how you would solve the problem if you didn’t know any data science. Assume you don’t know supervised learning, unsupervised learning, clustering, deep learning, whatever. Now ask yourself, how would I solve the problem?
Experienced practitioners do this routinely.
They first think about the data and the problem a bit, develop some intuition about what makes a solution good, and think about what to avoid. …
This question comes up often.
It is typically asked by starting data scientists, analysts and managers new to data science.
Their bosses are under pressure to show some ROI from all the money that has been spent on systems to collect, store and organize the data (not to mention the money being spent on data scientists).
Sometimes they are lucky — they may be asked to solve a very specific and well-studied problem (e.g., predict which customer is likely to cancel their mobile contract). …
[6/5/2017 update: I was asked if I had a PDF version of the one-liners below. Here it is — Data-Science-One-Liners.pdf]
Experienced data scientists use Unix/Linux command-line utilities (like grep, sed and awk) a great deal in everyday work. But starting data scientists, particularly those without programming experience, are often unaware of the power and elegance of these utilities.
This was a shock to many. Chess fell to AI many years ago but Go was thought to be safe from AI for a while and AlphaGo’s success set off a flurry of questions. Is AI much further along than we think? Are robots with human-level intelligence just around the corner?
Experts have lined up on both sides of these questions and there’s no shortage of perspectives. …