How I scaled Machine Learning to a Billion dollars: Strategy

Saurabh Bhatnagar
14 min readApr 25, 2018

--

Rent The Runway is valued at $700 million and is well on its way to be a unicorn. I joined in 2012 to bring data science to then, a small company of less than 50 individuals in the main office, supporting 2 million customers. At peak, we had two junior data scientists besides me. When I left in February this year, RTR was serving 8 million customers. The membership program that I was a founding member of, and returned to last year, had grown to nearly half the overall revenue. For comparison, a membership-only competitor has about 70 data scientists currently.

Scaling is never easy, not even for Amazon and Google. But it is certainly harder to do so thoughtfully and with fewer resources.

If you work for GooFaceApplezon, this series likely won’t be relevant to you. But if you’re not, read on.

In this post, I will discuss strategy. Next one will be about choosing Infra and tech. And finally, I’ll talk about models, software and importantly, the unique challenges of ML tests. You will learn something you can apply to your work, depending on your seniority and role.

Data Science for retail?

This was not an obvious question in 2012. But RTR execs knew they were a different breed.

Like a traditional e-Commerce company, RTR has inventory, customers come on site and need recommendations, pricing, coupons, etc, all the way to checkout. Unlike a traditional e-Commerce however, that isn’t the end of the story.

The dress needs to be shipped out at the event date. To do this we have to get the dress from the previous customer who has a probability of being late, dry clean it (probability of failure), and then make sure we can take order for the next person. Oh, and incidentally, we happen to be the largest dry cleaners in the World! That means supply side logistics with inventory flow from various stations.

Often called Netflix of Fashion, one notable difference is that the stakes are very high. If you watch first 15 minutes and didn’t like the movie, skip it and your night is not ruined. If the dress doesn’t work for a customer for that Wedding she is going to (or is the Bride), she likely won’t come back.

original image from Sandy Woodruff

Rent The Runway is a fashion-technology-engineering-supply chain-operations-reverse logistics-dry cleaning-analytics business.

Such problems need custom solutions to scale. We needed artisinal hand-rolled ML. I was brought in due to my previous experience with Barnes & Nobles.

Machine Learning at RTR

Here are a few data products that I coded and shepherded.

Carouseled recommendations based on user’s style for our membership program

As far as I know the personalized order of carousels and products for fashion is unique to us (with hat tip to Netflix and Spotify). It allows for near realtime recommendations and fast personalization. To boot, I launched this in 30 working days leveraging past work!

Personalized event recommendations for our a-la carte business (done with Anthony, a very talented data scientist, now at Google Maps)

You may also like recommendations (with Anna, now ETL at Spotify)

‘Women Like Me’ sort to help with Fit (with Kaleigh, now Google)

Demand Forecasting, inventory management, price prediction (many folk, most notably Anthony and Rob, one of the fastest learners I have met)

Search a dress by Image (with Gabe, work/travel balance guru, and Sandy, now at Google) and the browser extension (with Nizar, now at Betterment and Sandy)

Using AI to get Instagram posts and refine with Humans, before showing it as a dress review (with Hindi, Caroline and Sam, read about it https://sanealytics.com/2016/09/09/human-in-a-i-loop/)

And many more like inventory buying, queue solvers, warehouse allocation algorithms, etc. And all that runs every day without needing a lot of upkeep.

Complexity

The nice thing about working in something that is cool is that everyone wants to help. This is a double edged sword.

Let’s say for a certain data project, there are only two folk (one data scientist and one product person). There only needs to be one line of communication. Things are simple, two minds as one.

Let’s add one more person (say, Infra). There needs be three lines of communication. Everyone needs to be in step with everyone else.

How about with 4 people? Turns out we need 6 lines of communication. This is a little bit of an exaggeration because not everyone really needs to talk to everyone else, but there are still complex dependencies. I’m a computer scientist first, so unfortunately I do O(N) for fun. Now, what if you add one more person/team?

Take home lesson: The complexity of communication increases exponentially proportional to number of teams involved

Strive for simplicity.

This same analysis can be done for codebase, servers, etc. You will see a recurring theme of preferring simple over complex. As far as teams go, your natural tendency would be to align with other engineers but if you have to prioritize, make sure you spend more time with the product folk.

What should you work on

Every process begins on the data side of the equation. What is worth solving is more important than how well you solve it. If it’s the right thing, the team will be behind you for the hack and the long haul to help figure out a better solution. If not, this is not worth pursuing. Find a new problem (sorry neural stain identification to speed up dry cleaning).

Once you are done understanding and analyzing what the problems are, you need to pitch them to the business stakeholders. I’m serious. Treat it like a VC pitch. It needs to something that shows what the potential upside is (you have data after all). Next, a quick demo of POC. It need only run on your laptop. Finally, you will get resources, time on the roadmap and fellow hobbits for the journey.

If you haven’t had to do this, it’s because someone has already fought this round for you. Thank them for they are noble.

For user level recommendations five years ago, I had to stand an MPP data warehouse (mySQL -> Vertica), write some hairy code to ingest large text files and show that there were actually discrete clusters of users. I made a R shiny app demo to show what their recos would be. I then tested it over email to show a huge lift (double digits) in click through rate to get folks excited.

Ancient proverb: You can’t do data science without data

Once the product is on the roadmap, align the team on the metric you want to measure. Your hypothesis is that moving this metric will move the dollars. Make sure you state it this way and everyone nods on that ONE metric. This exercise eliminates complexity because you can’t tune your algo for everything. And the business needs to follow why you’re harping on about recall.

Oh BTW, the English language isn’t helpful here but it’s what the message needs to be in (and repeated over and over again). I’ve learnt to stay away from accuracy. And I can never explain AUC in simple English. So that rules them out.

What you tune for internally is your problem. But this is the metric you are asking to be held up to, and you need to validate if improving this actually helps the business (dollars).

Protip: Obsess over this but not too much… any metric you choose will be wrong over time. So pick one and move on.

Strategies for KISS (Keep It Simple Stupid)

This is fantastic. You finally have a well defined problem, an objective, a team. This is almost like Kaggle. Now time to impress everyone with that Reinforcement Learning Variational GAN you have been aching to try out.

No

Linear model first. I count logistic or its Bayesian cousin in the same breath. This needs to be the baseline you will beat over time (KISS approved).

Why is this important? Because in production, you have to build a lot more around it (we’ll come to that in the third post in the series). So you need to deploy something end-to-end. That means, from getting data, training model, predicting, writing checks, shipping those predictions to the right services and measuring the impact via automated reports. There is simply too much to build.

Another little secret, when you’re competing against humans, linear models are already a big enough step up. They might give you the 80 for free, you’ll have to fight a lot harder for the remaining 20. And if you have been successful in creating a virtuous data cycle, even this linear model will get better over time simply because your business is growing.

This brings to another point. Data Science as it is right now is a research practice. Things take time, they often fail. Most companies pump a lot of money into hiring expensive PhDs but don’t see the outcome. So they get frustrated and scale back on data science. Set appropriate expectations.

You need to separate research from production, which is business speak for ‘cannot fail’. Linear models are easy to debug, rarely fail and scale very well. They also motivate you to go back to it later and demonstratively make it better (in terms of revenue) with the right model for the problem.

And besides, you don’t even know if the metric you want to improve has ANY impact on business. Do you truly understand the problem yet? Life is cruel.

Feedback

Your model and impact is only as good as the data it has. To this end, treat UX and product as an extension of what you can optimize. Your goal should be to validate the metrics and make feedback loops that give you more data.

This means you might have to pause a project too. For example, I had a novel approach to fit recommendations, demoed with data extracted by NLP, got the green light and small team. Got working and discovered that we had signal but it wasn’t strong enough. We could have still launched, but decided to hold back and collect more data. We redesigned the survey Unlimited members take on returning a product to get data for this directly.

Focus on feedback loops to give you more data.

Team

You have a success. Congratulations! After the high fives are over, where do you invest?

This is where a lot of companies go wrong, and grow the team too quickly.

One data product will have user interfaces on the web, app, email, etc, all needing multiple engineers and UX designers to code up. It will need more analysts to understand if its driving revenue, generate more data that the ETL team will need to consume and sanitize. And that’s just the support system for one little algorithm you coded up.

I argue that you need more ETL folks, analysts and engineers than data science/ML. This answer won’t be appropriate for all stages. So take it with a grain of salt.

Another issue is that apart from long running research (dress wear over time), data work tends to be spotty. A company of 20 engineers can run about 5 projects in one quarter. Only one of those will likely be the bet on Machine Learning. There are always other competing priorities.

Your might have engineers and analysis who want to get into data science. It is always satisfying to mentor and figure out a problem together. However, everyone wants to work on recommendation systems or image processing. So it takes a lot of conversations to find a good problem their interests are aligned in, they are excited to work on and is a business question that potentially has an impact. Too hard a problem (predict churn, needs understanding a lot of variables) can kill the enthusiasm of a fledging data scientist. If you notice above, all the projects have a co-pilot who maintain the product once its out in the wild. Focus on working with only one person for that data product (keep N small).

Downtimes are also a good time to go back and better those models. Go ahead and try those neural nets now. But most of your time should go into finding what the next most important problem to work on is. Don’t get attached to a problem. There are LOTS of low hanging fruit. Quantify impact in dollars and see if its worth pitching.

A big caveat is hiring. It is unlikely you will find the same person who can do everything. I was the single point of failure for RTR for many years, which is obviously undesirable. It is easier to find folks with different skills that make the team have all the required talent. Again, these skills won’t be obvious when you start, so don’t grow too quickly. But definitely try to build redundancy as you scale.

I ran into issues hiring who to lead the ETL team for a relatively unknown RTR at the time. It took me an year to find the right candidate, who happened to be an extremely talented homegrown aspiring data scientist and found that she enjoyed ETL more. The answer was always complementary skill members.

So maybe look for both. You might be forced into one of those options if you are not known for data science. One funny thing is that graphic designers want to work at Apple, which already has great UX, not say, Amazon. People don’t always see the open green fields.

Partner with non-engineers. I started an internal RTR DeepDress team comprising of all folks in the company curious about applying AI to fashion problems. The browser extension, Instagram and some other product ideas came from that. UX designers were more than happy to work on this because they were involved at ideation.

Good ideas come from everywhere. Refine.

Good fences make good neighbors

RTR’s backend is standardized to JAVA. I had originally written my models in C++ with R for analysis (python data munging was nonexistent in 2012/13). I then moved to write them in JAVA or Scala (my preference at the time) to integrate with products engineering was building. The upside was that all I’d be writing is a class, and the data glue, etc is all in engineering’s domain. So barrier to entry was small.

That was a mistake.

For one, every model needs to be re-coded into JAVA, which, is not fun. Second, once deployed, you lose control over it. Changing the model now means testing the entire service, and that’s not going to happen unless it is someone’s project.

The correct way is to treat ML as SaaS, even though it’s an internal team. I’ve slept better since.

So we need SLAs. Notably, the top few are:

  • Uptime SLA: Engineering might want 99% uptime, but remember ML is ‘research’, and we would like to change it often. So we won’t be hitting that. Say uptime SLA we are committing to is 90%
  • Latency SLA: Engineering service calls the ML service but say, the prediction is taking too long to respond. How long is too long? Let’s settle on 200ms. We might need to hit this one more often (say, meh 95%).
  • Prediction SLA: How soon do the recommendations get refreshed? Model for women like me?
  • Realtime SLA: Does this data product need to be realtime? What would you lose if it were hourly? Daily?
  • The 2am question: Who wakes up when this fails? What do they do? This is a particularly important question for a team of me plus two.
  • Concurrency SLA

This looks pretty horrible. No CTO will sign off on this contract, and remember she is one of your investors, so you need that!

The first strategy is separation of concerns (KISS approved).

For Engineering, this means that they don’t need to understand the model. The runbook says, restart. And upon startup, the service will pick up the previous version of the model, which presumably worked a few hours ago. No one should have to debug ML code at 2am. Also, incidentally you get to sleep in because it’s not really a problem if it heals itself.

For the ETL team, this means that we will create data transformation jobs, schedule and handoff to them. Any ongoing maintenance from that point will be theirs.

Second strategy for this is to code for reliability via checks and tests. If anything fails, the model deploy should fail. Sorry for the teaser, but I will talk about in part 3.

Third, figure out graceful fallbacks. And fallbacks to those fallbacks. For example, if someone comes to the homepage whose recommendations we haven’t computed, what should they see? New item, who should it be recommended to? If the service is down, what should they see?

Caching is a strategy here but there is major caveats. For example, you want the engineering service to cache the unknown user’s recommendations every now and then, but if the recos now compute and the user is now known, you need to tell the engineering service to invalidate it.

Another problem is synchronization amongst servers. If one server knows it as one user and another has previous recos, refresh of the same page by the customer would give her different set of products, an undesirable experience.

Partner with your engineering buddies and involve them. This is fun stuff and ultimately they have to maintain it.

Fourth, deploy daily, hourly, continuously. This is because your automated tests/checks are only as good as you run them. And a small team is unlikely to re-run the model from last year and will absolutely hate to debug an issue that originally happened 2 months ago but no one noticed till now. A continuous deploy will catch the problem when it happens, and will revert because of the above fallback mechanism.

Another reason is to catch the more insidious problems. Sometimes upstream data, new feature, something else, unknown to you will break your model. Catch them when they happen.

Fifth, whatever tech/model/language you choose will be out of fashion next month. Data science is a fast moving field and even in deep learning, there have been at least 5 different libraries last year. So, design for parts to be switchable. We have things running in R/C++, python, and Scala. I’ll talk about this more in second and third parts of the series.

Sixth, ML can’t solve everything. Hopefully you have caught these edge cases during the model build phase (large sizes, low inventory, low history, etc). Sometimes you will rightly ignore the problem because it might not be too big, or put guardrails around it (ideal). Either way though, it needs to be surfaced on a report.

This brings me to seventh, and final strategy for this post, what gets measured gets fixed. Alternatively, make it someone else’s problem. One off complicated data products depending on data that only that product uses, is a recipe for disaster. Keep the data model consistent with what someone else is using (view of customers, products, etc). This way, when they find an issue and fix it, your problem will get fixed as well.

At the very least, make a report that tracks the metric that you agreed on and actual business impact. See if the metric still tracks that. This should be on a report that folks regularly use so you don’t have to.

I am now working on my own retail AI startup, Virevol. Currently hiring for product, sales, UX, front-end engineers and ML (of course). I usually write on https://sanealytics.com/ and tweet at https://twitter.com/analyticsaurabh

A very short version of this was given as a talk in GTC 2018, NVIDIA’s AI conference. Upon encouragement from Nikolai Yakovenko I have turned it into a series of blog posts.

See you next time…

--

--

Saurabh Bhatnagar

Making palatable meaningful data products from terabytes of data #ai #ml #stats Sr Data Scientist at Rent The Runway, previously Barnes & Nobles, Unilever,..