Uday Babbar, Senior Data Engineer | Limin Yao, Senior Data Engineer | Fan Jiang, Senior Data Engineer | Lucile Lu, Staff Data Scientist | Juzheng Li, Engineering Manager
In the first two parts (here and here), you should have had a good idea about the fundamental constructs that comprise phoenix. But to briefly recap — we use remote configuration (using Levers) and ground control to tailor the behavior of a given variant to the end-user. We then collect metrics and compute insights to establish relative performance between variants using the results workflow — about which we’ll discuss more in this post.
Designing from the first principles
Tinder is a complex and dynamic product — the complexity stemming from the expansive product surface area and dynamism due to the rapid evolution of our product offerings.
Each of the product components potentially behaves differently in different contexts (geos) and interacts with other components in complex ways. It follows from that that a system of such a nature would require a diverse and exhaustive set of dimensions to effectively quantify its state of wellbeing / “health” at any given point (or an OEC that captures appropriately from all those dimensions — probably we can talk about that in the next blog :)). The rapid evolution part on the other hand is vital because you need to test more and test faster to enable the product to evolve to its promise. The same idea was encapsulated (accurately albeit slightly inelegantly) by Mike Moran (1):
“If you have to kiss a lot of frogs to find a prince, find more frogs and kiss them faster and faster.”
Though we don’t have a lot to do with either frogs or princes, we do care deeply about our members. And the path to maximizing our member community wellbeing goes through a framework that removes/reduces barriers to product innovation and tracks and improves the ecosystem's wellbeing at all times.
Building a safety net
There can be different teams operating on different/related areas at any given time. While as an organization, there are distinct product areas that we’re more focused on at different times. It should also not be prohibitive to keep track of novel surface area that’s continuously being innovated into the product. Experimentation is supposed to serve as a safety net that enables and expedites such a controlled evolution of the product.
What does all that dictate about the design?
The barrier to onboard a new metric should be low as far as the pipeline is concerned. This involves reducing the effort involved in integration and minimizing the turnaround time from integration to results being consumed on the UI. For us, the existing preferred tool for experimentation analysis was Redshift. Thus we sought to provide appropriate interfaces that push data through the pipeline once it’s defined as a rollup using a Redshift query. This allowed experiment owners to create a diverse set of metrics — from custom time windows to measure short-long term retention to very granular revenue metrics — easily. As long as they conformed with the basic contract, the custom metrics propagated seamlessly to the result page.
Reducing the barrier to create a metric, however, creates other issues like the proliferation of metrics, which results in selection bias — people tend to pick and choose results for certain metrics to substantiate their preconceived idea around the effects of a given feature or just declare success when no conclusion could be reached with the available evidence. We handle this by a mixture of embedded system invariants (multiple comparison correction) and experimentation guidelines (define narrow and concrete definition of success — pre-state target metrics comprising the hypothesis core), which we’ll touch upon later.
Considering the surface area under innovation is a pretty large one, we’d expect the steady state experiment throughput would be high. The concept of an experiment family helps in this regard. It promotes concurrency and leverages the existing member base in an efficient way by allowing them to be in multiple experiments and by enabling experiments to run in parallel by separating member traffic. To support the elevated experiment throughput on the computation side, we batch semantically similar metrics together for processing (per-metric-batch processing as opposed to per-experiment or per-metric processing).
It was our aim to capture these intentions while designing and implementing the Results workflow (or Metrics system). Let us now discuss the nature of the pipeline so that we get a better idea about how the pipeline further supports the aforementioned intentions.
We use a spark offline batch processing pipeline to generate experiment results on a daily basis. For the spark cluster, we use AWS on-demand EMR instances.
The pipeline can be segmented into 3 major components
- Experiment Assignment — deals with resolving different identifiers that are used to carry out an experiment and generating experiment assignment information
- Metrics — deals with generating metrics from all supported sources (s3 and Redshift)
- Aggregation / Stats — deals with generating summary/time series results
We have in total over 80 spark jobs scheduled in the experimentation DAG that are run daily. Segmentation into the components above helps with checkpointing and maintainability. For instance, the experiment assignment component started with 2 jobs and has 25 right now. There’s complexity associated with supporting different experimentation denominations. An example would be pre- vs. post-auth operate on different units with differing characteristics and information available about them. Another example is that the preliminary version of the component didn’t do bot removal. This and other complexity from similar use cases have been subsumed into the experiment assignment component with time.
For data storage, the offline pipeline primarily uses s3. And as mentioned before, we import metrics from Redshift, so there’s an interface to import metrics from Redshift to s3 (using Spark-Redshift Databricks interface). For serving results, we export the computed results to an OLTP store (RDS in our case).
We use Fireflow for scheduling, dependency management, and basic reporting/alerting of jobs. Fireflow consists of two major components — Airflow as the core engine and a Fireflow service layer that provides seamless integration with AWS major data services such as EMR applications, S3, MySQL, Redshift, etc. We provide real-time experiment assignment monitoring that gives experiment owners immediate visibility into experiment progress using Grafana.
We generate two kind of results — more information about them is provided below:
This view provides useful qualitative data especially relevant for observing long term trends (like in holdouts). It comes in two flavors -
- Daily per member/day
A sample time series result for a metric (name and values hidden)
We use this view to provide insight into real-time and daily experiment progression. It is pivotal in detecting experimentation setup errors early in the workflow.
The underlying motivation of this view is to provide signal around relative performance of competing variants — it qualifies that by providing significance statistics and quantifies that by providing lift and mean. The view assists in selecting the winning variant by having a configurable selection of a reference variant against which all the remaining variants can be compared.
It is imperative to track the impact of a feature on different cohorts because of potential heterogeneous treatment effects. Kohavi et.al (2) motivates this idea with an example wherein Bing mobile ads exhibited different click-through rates on different platforms as shown in the image below:
It’s tempting to formulate stories around the loyalty of users (windows phone with Bing) and how those populations differ; an investigation, however, revealed it was due to inconsistent click tracking methodologies between the different operating systems (Twyman’s law).
And therefore, the first step to address issues pertaining to heterogeneity is to measure them — which is enabled by the summary stats view. It selects (filter), segments (generates distribution), and quantifies relative performance (mean, lift) between competing variants in an experiment.
- Filter — This allows you to select a desired population.
- Group By — This allows you to segment the selected population from 1.
- Relative variant performance — This is enabled by configurable reference variant selection and provides lift, significance wrt the selected reference variant
A pragmatic balance — processes and automation
While designing any system, our approach is to make things easy to use correctly and difficult to use incorrectly. As an extension to that tenet, we can seek to embed all sorts of invariants within the system to prevent unintentional incorrect usage. That would have been fine if there weren’t prohibitive costs associated with implementing some invariants (e.g. correction for peeking). That is still the long-term goal but as a way to provide continuous and tangible value, we try to strike a balance between processes (experimentation workflow that involves a human) and automation (embedding the invariant within the system). An example of such a tradeoff would be the following:
It’s recommended that the formulation of the hypothesis should be as narrow and precise as possible. In the wise words of Mr. Feynman (3),
“I must also point out that you cannot prove a vague theory wrong.…If the process of computing the consequences is indefinite, then, with a little skill any experimental result can be made to look like an expected result. ”
He goes on to demonstrate the idea with a delightful example (available in the references). Anyway, coming back to hypothesis formulation. It should be precise with 2–3 success metrics defined in the experiment creation workflow for most cases (process). To enforce this as a system invariant lends additional complexity, some tests might just be learning tests with relaxed guidance for hypothesis formulation. So it’s more pragmatic to enforce it as an experimentation-best-practice process.
On the other hand, to discourage peeking and mitigate novelty effects from biasing the conclusion, we have embedded color-coding mechanisms that show statistical significance only after 2 weeks of experiment start date (embedded system invariant). This helps for the teams to use a predetermined experiment duration for evaluating statistical significance. Other companies like Google, Linkedin, Microsoft also use a similar approach to avoid peeking pitfalls (4). Setting a time horizon doesn’t, however, mean that we don’t try to detect/address oddities earlier in the funnel. We use real-time assignment monitoring dashboards to safeguard against instrumentation errors/experiment mis-setup (again, an embedded system invariant).
There are some exciting things we’re working on top of the base we laid down that we’ll hopefully share in the near future. We hope you were able to get a good idea for what comprises the resulting workflow of Phoenix and the motivations and tenets dictating its nature and design.
Acknowledgments — We have worked closely with our Director of Analytics, Jennifer Flashman, and senior Data Scientist Patric Eck. Also, our very own Jay Han has provided us with tremendous support on engineering architecture and implementation best practices.
- Mike Moran, Do It Wrong Quickly (2007)
- Kohavi, Ron,Tang, Diane,Xu, Ya. Trustworthy Online Controlled Experiments (Kindle Location 1639). Cambridge University Press. Kindle Edition.
- Lecture 7 — Seeking New Laws | The Character of Physical Law | Richard Feynman
“I must also point out that you cannot prove a vague theory wrong.…If the process of computing the consequences is indefinite then, with a little skill any experimental result can be made to look like an expected result. For example, x hates his mother. The reason is of course she didn’t care or caress him or love him enough when he was a child. Actually, if you investigate, you find out that as a matter of fact, she did love him very much. Well, then it was because she was overindulgent when he was young. So by having a vague theory it’s possible to get either result. (Audience applauds) Now wait, the cure for this is the following, if it were possible to state ahead of time how much love is overindulgent, exactly. Then there would be a perfectly legitimate theory against which you can make tests. It is usually said when this is pointed out how much love and so on, oh! You’re dealing with psychological matters — things can’t be defined so precisely. Yes, but then you can’t claim to know anything about it.”
4. Kohavi, Ron,Tang, Diane,Xu, Ya. Trustworthy Online Controlled Experiments (Kindle Location 1396). Cambridge University Press. Kindle Edition.