Optimizing Websites with Data Science

Lessons learned from millisecond-latency production environments.

Josh Izzard
Jul 30, 2017 · 6 min read

Intro

At Red Ventures, we’re a digital marketing company that applies data to customer acquisition on behalf of our partners. With the growth of our data science team over the last few years (resulting in an award-winning submission to last year’s NCTA Awards), we’re now using hundreds of data points to make real-time decisions with predictive models at each step of our marketing funnel. However, integrating predictive models into each part of our funnel has been a learning process, with several interesting engineering insights coming out of each different integration. This post describes the way we optimize our customer experience websites with data science, and shows how we overcame some technical hurdles by adapting our data and modeling infrastructure.

Some RV Context

At Red Ventures, we manage the entire customer’s lifecycle on behalf of our partners. This lifecycle, or funnel, greatly simplified, looks something like this:

  1. Searching for a product (Googling “internet in my area”).
  2. Visiting a website we run and optimize (I promise ours look prettier than the picture 😅).
  3. Calling a phone number on the website and getting routed to our sales center.
  4. Ordering from one of our sales agents.
An example customer funnel.

When a customer shows up on one of our websites, historically they would land in an A/B test that one of our Customer Experience analysts had set up. A/B testing allowed us to learn our way into website creative that was “on average” better for a segment of the population as a whole.

However, as data scientists, we believe that for each person, there’s a single website that is right for that person at that point in their buying process. You’re a “segment of one”, so to speak. So we set up a modeling infrastructure that would allow us to train models and serve personalized websites for each new visitor.


Modeling Website Creative

A simplified view of our “modeling cycle” looks like this:

How data flows through our modeling process.

To provide more details about each step:

  1. An in-house-built application lives on the website that captures (1) customer attributes like device OS, screen size, geographic location and (2) what creative was shown to this customer. This data lands in our reporting databases.
  2. These attributes will then become the features that our machine learning models. We have a CI/CD system that kicks off model rebuilds every night, so that Tuesday’s models are incorporating Monday’s data. These models are built in R and Python by our data scientists.
  3. R and Python are great languages for model building, tuning, and validation, but they’re not that easy to integrate into an HTML and Javascript-based website because of the latency demands and the differing runtimes. We’ve solved both of these problems by serializing our models into a format called PMML, which we then deploy to a Scala-based production scoring environment.
  4. The websites can then talk to our models with simple HTTP POST request, and it’s super fast, which makes our front-end developers happy 😃.

The same visual, but with the relevant technology stacks filled in, looks like this:

Our modeling cycle, with tech filled in.

However, you’ll note we added a scary looking “ETL” snafu between the website and our SQL Server reporting databases. More on this below.


ETL, Real-Time Scoring, and Bad Model Decisions

One of the biggest challenges of doing real-time predictive model scoring is the following fact: the data that your models train on can be different than the data your models score real-time. Arguably this is the single biggest thing to get right if you want to successfully operate real-time scoring of predictive models — figure out the latency problems later. Without consistency between the data your models are requested with, and the data that they train on, you’ll chase your tail endlessly.

What kinds of things can be different between these two data sources? Here are a few:

  • Some fields that you expect to be non-null from the training data may be null real-time due to a bug in the system providing that data.
  • The format and structure of the data may change: our models are requested with JSON that may be nested and not able to be transformed into the representative row of training data.
  • Nulls are handled differently between SQL Server and JSON.
  • Transformations that are applied in the training process (ex: “impute missing values in numeric columns with the median”) must be applied real-time by our PMML model scoring engine.
Real-time scoring versus batch model building.

This discrepancy meant that our model build process had to unwind the ETL shown in the above diagram. It had to mirror this ETL process backwards to arrive at the untransformed data that would come off of the website real-time, so that our real-time models knew what data format to expect.

But this process wasn’t scalable. To combat these difficulties, we looked at the root cause of the problem. Fundamentally our data store for our scoring environment was different than our data store for our training environment. We were trying to keep JSON events from a website and rows in an RDBMS in sync with each other. Once we expressed it in this way, a path forward was clear: store the exact same data in the real-time scoring data store as in the model training data store.


The Solution: Scoring Data = Training Data

Our criteria for a data store was simple: put the same piece of data in, and access it very quickly for model scoring, or in batch for model training. We’ve arrived at a combination of DynamoDB, S3 and Spark that meets this acceptance criteria.

Dynamo and S3 can both accept event-level data, and Spark (specifically, Spark SQL) allows us to write code that ships to the data on S3 and queries it with similar performance to a traditional RDBMS. Dyanmo solves our low-latency criteria, and S3 is what we use for offline batch model training. It’s horizontally scalable, so we don’t run into problems querying multi-billion row SQL tables and our DBAs can go back to sleep 😄.

When complete, our new and improved modeling cycle will look like this:

  • The website emits data to our data pipeline, which stores it in both Dyanmo and S3 for model building and scoring.
  • Offline model building jobs will query this data out of S3 using Spark, and deploy models to our production scoring environment.
  • When a model is requested, the production scoring environment looks up the necessary features for the model in Dynamo, and returns an optimized decision — with latency in the tens of milliseconds to keep our front-ends happy 😁.
Modeling cycle — new and improved.

Wrap-Up

To summarize, we saw how doing real-time optimizations of a customer’s sales funnel can present some interesting engineering challenges, aside from the usual data problems that every data scientist deals with. We’ve largely solved this by making our model decisions available from an HTTP endpoint, but there are deeper data issues that we’ve had to work through. It’s only after we stopped trying to keep two sets of ETL consistent with one another, and started building infrastructure that means we use the same data for real-time scoring as for batch model rebuilding, that we started really realizing the value of deploying predictive models at every step of our customer’s funnel.

If you’re interested in building predictive models at scale and working with an awesome team, come check us out at RedVentures.com.

Red Ventures Data Science & Engineering

Learnings from the Red Ventures Data Team. The views expressed are those of the authors and don’t necessarily reflect those of Red Ventures. https://www.redventures.com/

Josh Izzard

Written by

Data Scientist @Red Ventures

Red Ventures Data Science & Engineering

Learnings from the Red Ventures Data Team. The views expressed are those of the authors and don’t necessarily reflect those of Red Ventures. https://www.redventures.com/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade