API throttling with Oracle Responsys
More than a year ago our CTO made the decision to change our ESP (Email Service Provider) to a professional tool that would be able to deal with the company growth and the subsequent Marketing needs to reach our customers. He chose Responsys and I was on the team that help with the integration, which was pretty smooth and successful, you can read more about it here.
What is Responsys?
Responsys is a SaaS product by Oracle that offers a GUI tool to orchestrate your contents across platforms via user segmentation. Long story short, you provide Responsys your business metrics and then use their system (through the admin UI or via an API) to tap into those metrics to reach out to your customers over email, SMS, and push notifications.
The API is called “Interact API” and can be accessed via HTTP REST (JSON) or SOAP (XML). For example, when a new customer signs up we send over to Responsys the date, whether they came from email campaigns or from an organic origin and the name of the customer. That’s one API call that adds one to a list and subscribe them to the promotional emails.
We use the API to push data updates or to send an email for example. But every time a call was made it was for a single record (send this email to that customer vs send this email to this list of customers).
As the traffic grew, we noticed more and more of API throttling errors. This meant Responsys was restricting the API call rate to a maximum number of calls per minute. If you go beyond that rate, the call fails and you’re supposed to wait a few seconds before another attempt.
All of our calls are being handled by async jobs (Delayed::Job) so if there’s an exception the task will be run again a few minutes later and will hopefully go through. It works for a while until one morning you wake up to a hell of backed up jobs that keep failing and you know it’s time to work on a new solution.
The throttling rules
Responsys has an API throttle policy that limits the number of calls per minute. The rules are:
- The main factor is the number of calls per minute allowed for that specific method. It can be different for each method of the API. It can be 30, 100, 1000…
- The number of records per call: 200 maximum.
- No throttling for data retrieval methods.
You will need to reach out to Customer Support and ask for your account’s own throttling rates.
With that we got all we needed to transform our code from a one call=one record into a batched solution where one call=up to 200 records.
Let’s dig into the technical solution we came up with.
We directly looked into the direction of Redis. It is an in-memory key-value datastore that is fast and easy to use for data structures like collections, hashes and works well with concurrency patterns.
A simple prototype in Ruby working with a Redis server showed us it was the way to go, but a couple of things were blocking the way such as error handling and monitoring. Building a custom system would have been great for multiple reasons:
- Interesting usage of Redis core features
- Methods that are thought for concurrent tasks
- Tailor made to exactly fit the throttling rules of Responsys
But then we thought let’s not reinvent the wheel and try to find a more stable foundation that we could build upon.
What we found was an open-source solution based on Sidekiq, which has an async job processing system based on Redis: https://github.com/gzigzigzeo/sidekiq-grouping.
It acts as a middleware that:
- Listens to jobs being enqueued. If the job is from a batch worker, it is stored in a temporary queue.
- A process goes over the queues to flush the jobs that have been waiting X seconds.
- The jobs are aggregated together into one job, which is passed to Sidekiq to process it.
- It adds a new tab in the Sidekiq dashboard to monitor the temporary queues.
It’s almost exactly what we need! The GEM can take care of batching the API calls once per minute and it uses Sidekiq which handles errors and provides a dashboard and Ruby APIs as well. It also works fine with Sidekiq Pro.
However a few changes need to be done to make it fit Responsys:
- One API call can’t interact with multiple Responsys objects at once because of API limitations. As a result, we needed to change the way the middleware groups the jobs to make the batches easier to process. The solution is the queue option passed as the first argument of a job. A given worker can have jobs with different queue options. A queue option is a String that is formatted like this: “responsys_interact_object_location:call_option_name”. call_option_name can be a constant that describes a list of predefined arguments (add_to_profile_extension_table -> set of arguments to pass to the API client method like *[match_column = “RIID”, insert_on_match = true, update_on_match = “NO_UPDATE”])
- The number of records per call needs to be abstracted so that there’s no extra logic in the Sidekiq workers. Added two Sidekiq options (max_records_per_call and max_calls_per_minute) + the middleware splits the output data in a two level array (array of calls and each call has an array of records) based on the values of the worker’s options.
- If there are jobs for a worker with more than one queue option at flush time, then the number of calls will be spread over the different queue options.
For all these reasons we forked the project and applied changes. You’ll find everything in this pull request: https://github.com/thredup/sidekiq-grouping/pull/1 with some examples.
We can now announce it is being used on production and it has successfully fixed all of our throttling issues.
Feel free to comment if you have any questions or remarks about the article or if you want to share an API throttling solution you’ve experienced in the past.