Geographically distributing workloads in Python using Celery, RabbitMQ, and Redis

Published in

Qwyk TechIntel

5 min readJun 27, 2018

Among other things, Qwyk, on an abstract level, is in the business of aggregating data from a large number of sources and transforming it into something immediately useful for our customers. As opposed to some of the other companies in a similar line of business, we mostly use an On-Demand paradigm when it comes to data, only choosing to collect something when its immediately needed rather than casting a very wide net periodically and then producing data from our own data stores when there is a request.

There are pros and cons to either model, but generally I feel the pros of an On-Demand model enable us to provide significantly higher quality results to our customers and we can do things to (at least partially) mitigate the cons, shifting the balance.

Pros and Cons

One of the cons of an On-Demand model for data acquisition is latency. When serving stored data we have much finer control over the things that impact latency: we can replicate our assets geographically so to bring them closer to our consumers, scale up our backend, cache data in a memory store such as Redis or Memcached, and tons of other things. We can do similar things in an On-Demand model but a significant cause remains: roundtrip time to external datasource.

A Distributed Solution

An interesting attribute of the data we fetch is that the source obtained from the origin is often of a much larger physical size then the data we output at the end of the process (factor of 10 usually.) We do a lot of stripping and parsing to get to this point, producing something structured. In here lies a problem, but also the solution:

If we place the collection and parsing workers closer to the source, we can minimize the size of the data we need to transport over the net, reducing latency a little bit (or sometimes significantly).

This is how that looks:

Schematic of a distributed worker solution, keep in mind the workers can scale, so in reality we might have 3 in FRA, 2 in NYC and 4 in SIN, or something along those lines.

I’ll quickly describe the components and how they interact.

The Gateway is the web facing point in which data requests are received, and processed. It takes care of parsing the request data, logging, and routing individual requests to the RabbitMQ server using Celery, as well as retrieving the results from Redis and serving them back to the client.
RabbitMQ is an OSS message broker implementing AMQP. It takes care of receiving new tasks from the gateway and queueing and routing them to the correct brokers depending on regional queues we’ve implemented. It also helps monitor the queues, indicating when we might need to scale workers in a specific region.
The Worker is where, well, the work happens. It sits geographically close to the sources its assigned to and implements Celery to takes tasks assigned to it by the RabbitMQ broker. Fetching the requested data from the external sources and parsing it into our requested output format. When it completes a task, it submits it to the Redis datastore.
Redis is an OSS in-memory datastore which receives results from the brokers and allows the gateway to retrieve them. We use Redis because it’s blazing fast, and allows us to auto-drop the results after a specific amount of time, since we don’t need to persist data there. The gateway takes care of any persistence needs we have using MySQL and MongoDB databases.
Celery, while not displayed in the diagram is vital, it handles the implementations of communication between the gateway and workers through RabbitMQ and Redis.

What does that look like practically?

All these pieces put together, handle any request in roughly this way:

We receive a request at the gateway for data from, say, sources A through to F.
The Gateway looks up the assigned worker zones for each requested source and pushes jobs onto their respective RabbitMQ queues.
In each zone Workers will pick up tasks assigned to them, fetch the results from the source, parse them into our requested format and submit them back to Redis.
At the Gateway, the client will either block (synchronous request) until the results have come in¹ or poll asynchronously, being served results as they come in from the workers.

¹ We only allow synchronous requests for a single task to avoid queue and HTTP blocking. For multiple sources, clients are assigned a request ID they can use to poll the gateway incrementally for results.

What exactly are the benefits?

From geographic distribution:

Reduced latency, we only move over the net the amount of data we physically need, rather than raw source data. (Although, the whole broker/worker adds some overhead, it’s a minor tradeoff)
We can scale regions independently, so have a finer grained control. Workers will only consume tasks from the queues they’re assigned to so we could put more, or more performant servers in a region we expect higher workloads.

From using a task worker based system:

We can use cheap commodity servers, scaling out rather than up if we need to. Previously all of this would be handled within a single server, using many different HTTP requests, and while we could scale out, that would bring new problems such as persistence between the servers. In this model the broker/worker paradigm takes care of all of that and can autoscale based on workloads or time.
We get a much better idea about throughput, this helps with scaling. Monitoring all together is much improved as well.
We’re able to handle requests asynchronously, rather then have multiple http requests waiting at the server until their results are returned.
We can use socket transport to asynchronously send results back to the client as tasks finish.

Topics / Further reading:

Read more about Routing Tasks in Celery: http://docs.celeryproject.org/en/latest/userguide/routing.html