A Few Reflections on Data Science

Published in

Open House

9 min readJul 8, 2017

I recently did an interview with Peadar Coyle on his blog (link), where we discussed a variety of topics on data science. Sharing here as well since it may interest our readers.

What project have you worked on do you wish you could go back to, and do better?

Pretty much any project I’ve worked on in the past :) Two projects stick out though.

Uniform API for Data Fetching

When we fit a model, call it y = f(X), the data (X, y) are often taken for granted to be well-formed. How do you design a service that generates consistent (X, y)? Turns out this is not straightforward, and is specific to the domain and the data-capture systems.

The ideal solution would satisfy both batch & real-time needs; make it easy to ship new features to production; and enable rapid prototyping. While I scrapped together the initial version at Opendoor, our amazing team of data engineers have really taken it to the next level. We look forward to report back our findings soon.

Interactive Visualizations of ML Algorithms

A few years ago, I tried putting together a visualizer for random forests using d3. I had the tremendous fortune of working with Mike Bostock for a bit, and was inspired by his ability to make abstract concepts tangible through interactive visualizations. At the time, I was working with these big sets of random forests, and wanted a get a better feel of the model outputs. So I rendered hundreds of decisions trees on screen, where when you hovered over one node, all other nodes belonging to the same features across different trees would be highlighted. It was pretty neat! But the prototype suffered from performance issues plus my own technical incompetence.

More broadly, I’m really excited about better interactive tools for ML algorithms because they’ll help us deeply understand them as tools. During my undergrad studies in electrical engineering, we used to play with circuits a lot. In the labs, you could make a change to the input or to the circuitry, and see the corresponding change in output on the oscilloscope instantaneously. When we run a simple regression, wouldn’t it be great to get immediate feedback on the fitted line if we were to drag, add, delete a data point?

I’m hopeful we’ll see more innovation in the read-eval-print loop for data science.

See demo at 5m20s. One day, we’ll approach Bret Victor’s ideas in data science.

What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

My view may be slightly biased as I’m a PhD drop-out ;) In grad school, I became increasingly frustrated at the divergence between what’s interesting and what’s impactful. But that’s a whole separate conversation.

For folks looking to enter industry, nothing replaces hands-on practice. I would strongly encourage students to look for internships, participate in Kaggle competitions / Google Summer of Code, seek open source projects to contribute to. If you’re in school, take a wide variety of classes, especially computer science and project-based courses.

The higher order bit here is that the industry faces a different, evolving set of challenges than academia. The focus is typically on solving a business problem.

Here are additional pointers depending on the reader’s interest.

Business Intelligence & Decision Science

The grammar of graphics and tidydata lay a great foundation for reasoning about data. I personally learned more by working through Hadley’s ggplot2 book than from many of my stats classes at Stanford.
Develop business acumen and communication skills. Sometimes academics prefer to stay in the rarified air of theory and mathematics. Success in the analytics profession requires the ability to (a) meet hard business challenges head on, (b) break them down into smaller, quantifiable sub-problems, (c) rapid analysis, (d) present findings in a way that the audience can engage with, (e) take feedback and iterate.

Machine Learning & Engineering

Code, code, code. As I mentioned in Doing Data Science: ML is founded in math, expressed in code, and assembled into software. Being able to build robust software systems is becoming more important, as tools and algorithms are increasingly available.
While a strong grasp of theory will help narrow design choices, nothing beats rapidly exploring hypotheses. This demands coding proficiency, which from experience is a differentiating trait of highly productive data scientists.

Also, don’t let your field tie you down! Beware of sunk cost fallacy. Though PhDs may have invested years studying a certain field, the techniques investigated through a graduate program may not be transferrable to a new domain. The most important quality of the PhD is persistence in doing research. Remember, it’s re-search — search and search again. That’s what defines a great problem solver.

What do you wish you knew earlier about being a data scientist?

There are so many! Here’s a few.

How to build great predictive services

While we spend a lot of energy in grad school studying techniques, advanced methods often yield only incremental lift over a simple solution (and in many cases comes with complexity that becomes a heavy tax; see also the classic paper on ML tech debt). I think the big focus on modeling techniques contributes to the phenomenon of solutions chasing problems, rather than solutions being designed from the needs of the problem. Here’s a rule of thumb that I’ve come to adopt: “You know that algorithm that all the papers make fun of in their intro? Implement that and forget the rest of the paper.”

Perhaps influenced by schooling, we as data scientists often dream about having these flashes of brilliance that identifies a proof! QED! In practice, what delivers results is an error-focused, iterative process of continuous model improvement (see my post on Iterative Model Development). It’s the unglamorous engineering & detective work of starting with the biggest outliers of the model, and reasoning from first principles to eliminate them. Model debugger describes the role better than data scientist. It’s about the toil.

Forming a perspective based on incomplete data

Intellectual honesty, scientific doubt and a healthy dose of paranoia are generally great things to have. But beware of analysis-paralysis and failure to put a stake in the ground. Decisions need to be made in a timely fashion. In many cases we’re operating with 80% information (if lucky!), and your teammates are counting on you for a recommendation.

Earlier in my career, I would be reluctant in forming and articulating a strong perspective. Partly due to skepticism inculcated through school, and partly because it didn’t seem like it was my job as a data scientist to do so (more on titles being a constraint later). Making an actual policy recommendation seems so messy relative to the clean code and beautiful plots staring at me on the monitor. But I’ve since learned that this is an abdication of responsibility. Our job is to help the company make data-informed decisions, which means thinking through the implications of an analysis, consulting stakeholders, and coming up with a point of view.

Communication as craft

The value of an analysis is measured by whether it influences decisions. Even the most brilliant analysis becomes ineffective if not delivered to the audience in an accessible manner. This Jeff Atwood blog post explains the concept well.

It’s our job to provide a persuasive, data-informed argument!

Other things!

There are many other things I wish I knew earlier. How do I pick up software engineering skills? What does a great data scientist look like? How do I progress to become better? How do I foster effective debate and engagement of my work? I’ll omit them for now since this is getting long…

**How do you respond when you hear the phrase “big data”? What about “AI”?**

On Big Data

There’s big data, and there’s Big Data. If you’re referring to the latter, I think it’s a bit passé at this point (with some exceptions).

Turns out that more data beats better algorithms most of the time. As an industry we have worked really hard to make count(x) group by y scale to terabytes of data. But as alluded to earlier, the tools and infrastructure are increasingly commoditized. We’re ready to move onto higher parts of the application stack vs. focusing on the base layer. (e.g., Opendoor! See also the vertical AI piece by Bradford Cross.)

On AI

Turns out machines are tireless and can count much more reliably than human beings. This has implications as we enter the age of abundant data. This can get philosophical quick (read Homo Deus :)! But there are both benefits and hazards we’ll need to navigate.

What is the most exciting thing about your field?

As technology and education improves and become more accessible, there’ll be an increased supply of data science and machine learning talent. These individuals will become the next generation builders and leaders. Algorithmic sophistication is going to seep into all parts of our daily lives. The products they create are going to be smarter, easier to use and more personal (Opendoor being an example).

How do you frame a data problem? How do you avoid spending too much time, manage expectations, and knowing when is good enough?

Alan Kay once said: A change in perspective is worth 80 IQ points. Framing a problem well is probably the most important part of the solution.

Within the context of building predictive services, defining the objective function with a clear metric that’s ideally back-testable is half the battle. It provides a foundation for the rest of the work, which entails applying the simplest approach, iterating until convergence to some threshold that’s set by business needs. The art comes in how to define the ML problem in a way that aligns with business outcome (ideally tracing all the way through to the top- or bottom-line).

In terms of when is good enough, that depends on the need. Running a company is kind of like trying to solve an NP-hard resource optimization problem. We have to be rigorous about ROI for each initiative that we spend energy on.

In terms of managing expectation, it’s hard. It folds into longer term project and team planning. What is the business problem we’re trying to solve? What does success mean? Where do we need to be today, a quarter from now, a year from now? Who are the stakeholders? How should we provide updates and receive feedback?

You’ve spoken about people not needing to be constrained by titles. Could you expand on this? What sort of additional skills should someone with ML skills be learning? What have you learned working at Opendoor?

Peadar is referring to this sequence of tweets.

The start of the mini tweet-storm.

On Being Boxed in by Titles

Titles should enable, not constrain. We are all problem solvers first. A title acknowledges that an individual is skilled in a certain area. But one shouldn’t let that define their boundaries. When misused, titles could be a escape hatch to avoid doing the things that matter. For instance, a misinformed data scientist may think of “productionizing their insight” as unnecessary implementation detail, while a misinformed software engineer may think of defining data quality SLA for data systems as esoteric. In practice, there’s a metric to move, a question to be answered. Titles endow neither immunity nor magical problem-solving powers. What matters is clarifying the job to be done.

In a PhD program, there’s a tendency to put blinders on and focus on one problem, specified by one professor in one department. In industry, solutions tend to be multi-disciplinary. We need more of a T-shaped individual (deep expertise in one area, complemented by breadth in adjacent ones so they can collaborate effectively with teammates), vs. an I-shaped individual.

A lot of what we do as data scientists is to take human intuition and generalize them, seeing which withstand the backtest or an experiment. And to do this well, we need to be open to new ideas and continuously develop new skills. As allude to earlier, some of the key ones are (a) business intelligence and (b) software engineering (including frontend!).

But I would be remiss not to mention the following

Scrappy + Pragmatic + Business Acumen > Technical Expertise

The best data scientists are relentlessly resourceful and impact- / solution-oriented. Mindset shifts from “I need to gain skill x” to “I am going to solve problem y”; from “not my job” to “run towards where the impact is”.

About Opendoor

It’s been an incredible journey so far at Opendoor. We are on a mission to empower everyone with the freedom to move by building a seamless, end-to-end customer experience that makes buying and selling a home stress-free and instant. The experience of growing our team, scaling up as a leader and servicing thousands of customers has been really rewarding. It’s the perfect blend of crazy-hard technical challenges and creating positive impact in people’s lives.

We’re only getting started at Opendoor! If any of this seems exciting, check out http://opendoor.com/jobs, or email me at ian@opendoor.com.