And why your data science team needs it.

🗃️ First, how Airbnb’s data discovery tool changed my life.

In my career, I’ve been fortunate enough to work on some fun problems: I studied the mathematics of rivers during my Ph.D. at MIT, worked on uplift models and open-sourced pylift at Wayfair, and implemented novel homepage targeting models & CUPED improvements at Airbnb. But in all of this work, the day-to-day has never been glamorous — in reality, I’d often spend the bulk of my time finding, learning about, and validating data. …


Tips and Tricks

Or: lessons learned by writing ad-hoc queries for Airbnb’s CEO.

Don’t burn down your company with your bad queries. (Image by author)

When I was at Airbnb, I had the wonderful opportunity to work on a new team reporting to Brian Chesky. It was exciting — we were scoping a new product line, so we had to make game-changing decisions every day. But as the team’s data scientist, I became responsible for procuring data to guide our product direction, and this meant a lot of analytics work.

The first week was a grueling test of my ability to context switch: I had to not only find obscure tables and write tons of queries, but even regex through beautifulsoup scrapes and Qualtrics API…


Do it by hand, and don’t overdo it

Some context: I started a company. We posted 17 times on LinkedIn to test out social media as a lead-generation channel. I wanted to know: what can we learn from the performance of these 17 posts?

The data scientist in me naturally screamed excuses.

There was no randomization here! There is every kind of bias here! I need to weight this by a propensity model. But even then, I don’t even have overlap!

But as Dataframe’s head of product, I killed this inner version of myself and came up with what I think is a reasonable approach to small-data analytics…


tl;dr You don’t need (and probably can’t use) an A/B test to know that Robinhood churned its user base by restricting GME trading.

I’m going to use the recent Robinhood/GME fiasco as a hypothetical example in sharing a couple of practical ideas around experimentation I’ve picked up over the years. Disclaimer: this is obviously hypothetical. I’m just always looking for good teaching examples, and this felt like a fun use of time. :)

Suppose Robinhood ran an experiment, randomly letting only some users trade GME.

I.e. suppose they ran an A/B test to determine the causal effect of their trade restrictions. In this case, the test spec would look something like this:

Treatment: users can’t trade GME.
Control: users can trade GME.
Metric: uninstalls.

After randomly assigning users to treatment and control, they could then…


And how Dataframe can help. 🐳🔥

First, a question:

What constitutes “good” data science?

If you were to ask your seasoned data scientist friends, you’d likely get a fantastic soundbite or two. Good data science is communication. Good data science is valuable. Good data science is creative. But from an organizational perspective, these aren’t exactly it.

✍️ Good data science is, first and foremost, good documentation.

Why? Because data science work is not just the final product. It’s about the context around it — the products and people using it, the decisions it drives, the insights it provides. It is science, after all, and the scientific method has careful documentation at its center. Imagine if Marie…


Introducing metaframe: a markdown-based documentation tool and data catalog for data scientists.

Repo: https://github.com/rsyi/metaframe

If you are a data scientist, it’s your job to extract value from data. But in your career, you’ll likely spend (and have spent) a non-trivial amount of time looking for and getting context around said data. In my experience, this commonly plays out as a wild goose chase — you ask the most senior person on your team who refers you to someone else, who refers you to someone else, who tells you they don’t know what you’re talking about. You then repeat this process until you find someone who can refer you to a useful table.


Why you should be piping and how to do it.

What is piping?

We’ve all had to write python code with heavy nesting, like this:

print(abs(sum([1,2,-4])))

Piping (with Pipey) lets you write this using a pipe operator >> as follows:

[1,2,-4] >> Sum >> Abs >> Print

This syntactic sugar is called piping, and it allows you to pass the output of one command as the input of the next without nesting functions inside functions inside functions. It’s not natively supported in Python, so we wrote a library to support it.

Piping orders commands so they follow the flow of logic, making code substantially more readable and declarative. …

Robert Yi

Chief Data Officer / Co-founder at Dataframe. Author of whale, pylift. Formerly: data @ Airbnb, Wayfair; Ph.D. @ MIT, physics @ Harvard. twitter.com/imrobertyi

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store