In my career, I’ve been fortunate enough to work on some fun problems: I studied the mathematics of rivers during my Ph.D. at MIT, worked on uplift models and open-sourced pylift at Wayfair, and implemented novel homepage targeting models & CUPED improvements at Airbnb. But in all of this work, the day-to-day has never been glamorous — in reality, I’d often spend the bulk of my time finding, learning about, and validating data. …
When I was at Airbnb, I had the wonderful opportunity to work on a new team reporting to Brian Chesky. It was exciting — we were scoping a new product line, so we had to make game-changing decisions every day. But as the team’s data scientist, I became responsible for procuring data to guide our product direction, and this meant a lot of analytics work.
The first week was a grueling test of my ability to context switch: I had to not only find obscure tables and write tons of queries, but even regex through
beautifulsoup scrapes and Qualtrics API…
Some context: I started a company. We posted 17 times on LinkedIn to test out social media as a lead-generation channel. I wanted to know: what can we learn from the performance of these 17 posts?
The data scientist in me naturally screamed excuses.
There was no randomization here! There is every kind of bias here! I need to weight this by a propensity model. But even then, I don’t even have overlap!
But as Dataframe’s head of product, I killed this inner version of myself and came up with what I think is a reasonable approach to small-data analytics…
I’m going to use the recent Robinhood/GME fiasco as a hypothetical example in sharing a couple of practical ideas around experimentation I’ve picked up over the years. Disclaimer: this is obviously hypothetical. I’m just always looking for good teaching examples, and this felt like a fun use of time. :)
I.e. suppose they ran an A/B test to determine the causal effect of their trade restrictions. In this case, the test spec would look something like this:
Treatment: users can’t trade GME.
Control: users can trade GME.
After randomly assigning users to treatment and control, they could then…
First, a question:
What constitutes “good” data science?
If you were to ask your seasoned data scientist friends, you’d likely get a fantastic soundbite or two. Good data science is communication. Good data science is valuable. Good data science is creative. But from an organizational perspective, these aren’t exactly it.
Why? Because data science work is not just the final product. It’s about the context around it — the products and people using it, the decisions it drives, the insights it provides. It is science, after all, and the scientific method has careful documentation at its center. Imagine if Marie…
If you are a data scientist, it’s your job to extract value from data. But in your career, you’ll likely spend (and have spent) a non-trivial amount of time looking for and getting context around said data. In my experience, this commonly plays out as a wild goose chase — you ask the most senior person on your team who refers you to someone else, who refers you to someone else, who tells you they don’t know what you’re talking about. You then repeat this process until you find someone who can refer you to a useful table.
We’ve all had to write python code with heavy nesting, like this:
Piping (with Pipey) lets you write this using a pipe operator
>> as follows:
[1,2,-4] >> Sum >> Abs >> Print
This syntactic sugar is called piping, and it allows you to pass the output of one command as the input of the next without nesting functions inside functions inside functions. It’s not natively supported in Python, so we wrote a library to support it.
Piping orders commands so they follow the flow of logic, making code substantially more readable and declarative. …