Software Engineering Invades Data Science — Notes From DataEngConf
When I entered the lobby of DataEngConf in San Francisco (April 7–8, 2016), the first thing I had to do at the registration desk was to decide if I wanted my badge to be green (Data Science track) or orange (Data Engineering track).
I wondered what the percentage of Engineering badges in Science track talks and vice versa would be. Too lazy to get all data-scientific on this (take pictures and counts of attendees at every talk while dealing with some badges being reversed or obstructed and people going in and out), I chose a “qualitative approach”. In every talk I’ve been to I looked around and saw a healthy mix of green and orange badges.
If the conference’s goal was a cross pollination between the engineering and science aspects of working with data, it lived up to it.
It also invites a question whether the separation of people working in the field into Data Engineers and Data Scientists is being less and less grounded in reality.
I noticed a thread that went through several talks at DataEngConf that I find interesting:
Data Science in real world is a product design and engineering discipline.
A Data Science team is producing data products that have to run in a world with imperfect information and make decisions that affect people in real ways. The products must not only do something useful, but also gain humans’ trust and make them feel good about using them. This requires a mix of product design, engineering and UX thinking in addition to algorithms and pipelines. As a result, Data Science teams adopt and adapt more concepts and processes from “normal software engineering”.
Most of the talks were fascinating, but I want to highlight the five that are especially relevant to this point.
Franziska Bell — Uber
Franziska Bell — the data science lead of the intelligent real-time monitoring team at Uber — described Argos, Uber’s internal monitoring system, and it sounds like “a monitoring system to end all monitoring systems”. It tracks millions of time series and decides when to alert people without requiring manual configuration of thresholds. The system does it by learning the normal fluctuations (day of week, hour of day, seasonality, etc) and trying to distinguish between a “normal outlier” (e.g., a Giants game) and an outage.
This data product wakes up DevOps people and engineers at night. Too many false positives will cause the users to ignore the alerts. Too many false negatives will make the whole organization lose its trust in the system. What are the acceptable false negatives and false positive for an outage alerting system in a high growth company with a particular size and geographic distribution of team is a product design question. So far, 7 out of 10 alerts are true positives, meaning that they alert about real outages and Argos seems to keep its users’ trust.
Michael Manapat — Stripe
Michael Manapat manages Stripe’s Machine Learning Products team. He spoke of his team’s experience building the fraud detection system.
Michael’s team had to make sure that Stripe’s customers could live happily with their data product and trust it. Two questions they had to address were how to communicate to the customer (the merchant) why a transaction was predicted to be fraudulent, and should the customer be given any levers to control their risk tolerance and how to communicate the the levers’ meaning?
Since the fraud detection system kept evolving, the machine learning model had to be managed like any other code artifact. The team introduced versioning and worked on making each version of the model reproducible, which means saving not only the source, but also the training set a version was trained on.
Another aspect of managing a model in production is how you keep training a model on a data from the world that the model itself affected? Stripe addressed this by allowing some transactions that were predicted to be fraudulent through in order to gain counter factual evidence and then weighing these sample in training the next version.
Tommy Guy — Microsoft
Tommy Guy of Microsoft’s Analysis and Experimentation team described how the team uses a concept borrowed from “normal software engineering” — asserts. They created a domain specific language that allows a data scientist explicitly state assumptions about her input data. Tommy called it Data Asserts. Some examples of data asserts are: two data sets can be joined, field X has no null values, field Y contain distinct values, field Z is normally distributed with a certain mean, and my favorite — sanity of timestamps.
Sharath Rao — Instacart
Sharath Rao is a data scientist/engineer at Instacart where he works on Recommendation Systems, Search Relevance and NLP. His talk covered building a product recommendation engine that helps users discover new products.
It was interesting to hear how much careful thought went into defining the success metrics — what are we trying to optimize and how we know if it’s working.
In addition to the relatively well known product recommendation use case, Sharath’s system performs a role unique to Instacart — recommend replacements when the exact product that a customer ordered is missing from the store. The recommender has to communicate with (and gain the trust of) two groups of users — Instacart’s customers and Instacart’s shoppers — employees who go the the stores and pick the products.
Customers are given the option to let the shopper pick the replacement or to search and choose the desired replacement .
Giving users control over the replacements actually made them more likely to let the shoppers (with the recommender system’s help) pick replacements. A consistent history of successful replacements strengthened their trust in the system.
Sameera Poduri — Jawbone
Sameera is on Jawbone’s Data Science team and it was a particularly nice to hear her talk, because I used to be a part of the great team of Data Scientists and Engineers who built some of the products she talked about. Jawbone’s Smart Coach is addressing a common complain about fitness trackers that just recording and displaying data is not useful enough. Smart Coach serves insights and concrete behavior change recommendations to UP users.
Since this data product interacts with humans and tells them what to do, a lot of findings from the behavioral science are used to design the interactions. For example, when suggesting a change in their routine, users are always given a way to say no. This preserves their autonomy and makes them more likely to accept the recommendation.
The pace of experimentation went up when the team designed a pipeline that allowed the writers and product managers come up with a new recommendation idea, target to to a particular segment of users and make it appear in the app.
Looks like Clarke’s Third Law adapted to Data Science should read:
Any sufficiently advanced Data Science team is indistinguishable from a software product development team.
— — —
Eugene leads Data Science at Directly, an on-demand customer support service used by AirBnB, Pinterest, Nextdoor, and others.
You should follow him on Twitter here: https://twitter.com/eugmandel