At Change.org we leverage Elixir to run tens of thousands of parallel multi-armed bandit optimization experiments. The algorithms powering these experiments make decisions about what copy to present to users in various parts of our product. Our bandit service delivers on both of these requirements. Thanks to Elixir, it is able to run an arbitrary number of parallel experiments, automatically deciding which copy to use when, at blazing speed with very few servers.
The bandit service increments counters for pulls and rewards that occur with the variants that belong to its experiments. The faster the counters are the more up to date information is used when making decisions. When a variant is used we call this a pull. For example, if we use a variant to make a Facebook share, we record a pull for that variant. If something good happens with the variant we call that a reward. For example, if someone signs a petition as a result of clicking through a share that was using a variant in our experiment, that variant is rewarded.
For our Elixir-based bandit service we chose Postgres for greater data durability and rich SQL access patterns. The main gotcha of using Postgres is it will lock rows for updates. This makes atomic counters challenging if you have the potential for many updates to the same counter. Updates to the same row will queue and wait for existing updates to finish. With many requests at scale this will cause request queueing as the database struggles to finish all the updates synchronously. The mitigation solution we chose was to emit events in the pull and reward endpoints and use Broadway batching to batch writes to Postgres that would otherwise cause locking. Emitting events and then processing them adds a small delay and means some decisions are made with data that might not be as up to date as it could be. Looking at our pipeline we can see that, on average, processed messages are only 9 seconds old. This delay does not affect our use case and is well worth it for the benefits we get from Postgres in terms of durability and query ability.
Our EDA (Event Driven Architecture) message bus is currently built using SNS and SQS with filter subscriptions. Each SQS queue contains events of a given type. Services process messages on their queues. The bandit service is a principled EDA service, meaning that it writes to the database only as a result of an event. When a web request is received, the only thing that happens is an SNS message is emitted. Those messages make their way to their respective SQS queues. We use Broadway to process messages from SQS and batch writes to Postgres to avoid row write locking.
In production, the bandit service is deployed in EKS using Terraform. We maintain eight pods in production and have never had to scale further(even during the record traffic levels of March 2020). Each Kubernetes pod sits behind a load balancer and hosts an API and two broadway pipelines, one for each of the events the service is interested in (“pulls” and “rewards”). Deploying these side by side means we can scale to meet the demand of either the API or the queue processing.
Web response times average 25ms, of which 20 is spent waiting for SNS to guarantee delivery — meaning the Elixir side of the request is only 5ms! In the future we look forward to a specialized message bus provider like Kafka that will have very fast writes.
We segment our testing space by petition (run an experiment per petition) but the bandit service offers targeting based on a dynamic set of targeting variables. For example if you wanted to segment based on petition and locale, that could be done. There would be an experiment for every petition, locale pair. Segmentation is a powerful tool but if there are too many very small tests, the service will have trouble converging on the optimal variant for each test.
The scale of running tens of thousands of parallel multi-armed bandit experiments requires a solution that is performant, automated and fast in its decision making. Our bandit service provides a model-free optimization method that can be used in many applications to optimize problems that can be stated in terms of pulls and rewards. Elixir has proven itself to be extremely resilient and highly performant. Integrating Broadway and using its batching feature was easy and intuitive.