How to use Elixir’s GenStage.Flow for image resizing
Image resizing is often a pretty important (but not often thought about) feature for many websites. We all want to make pages load fast and look good, and that means sending images at the size they’re going to be displayed, and hopefully optimised to minimise file size. At carwow, one of the big sources of images is stock photos of cars, which look something like this:
We receive batches of these every couple of weeks as high-resolution PNGs (about 2MB each), which need to be resized and optimised into a 10–20KB JPEG. Previously, we had a simple Ruby script which would run through each image, resize it with ImageMagick, and then upload the results to S3, but this was incredibly slow. This seemed like the perfect opportunity to try Elixir!
At first, it didn’t seem like any of the OTP abstractions were a particularly good fit for this kind of pipeline process; we could use Task to run many operations in parallel, but this doesn’t really help with capping resource usage — which is pretty important when Heroku kills dynos that use too much memory!
Then we came across the announcement of GenStage last July, which solves a lot of these problems for us. We can now write something like:
and Flow will run the whole thing in parallel, up to 32 tasks at a time.
Running through this line-by-line, we are:
- Fetching a list of new images which need to be resized
- Converting this enumerable into the data source for a Flow, configured to run 32 stages (Elixir processes which will do the work later on), and for each stage to only fetch 1 item to process at a time
- Ignore any which have already been processed
- Resize the images
- Upload the images to S3
- Get rid of the temporary files we used
- Actually run the Flow we just set up, and block the current process until it’s finished working
The interesting line is the second one, where we configure how Flow should distribute work. By default, Flow pulls batches of 1000 jobs in to each process, and asks for more from the previous step when it is down to 500 pending jobs.
This is great if each job is small, when the overhead of fetching jobs from the previous stage is more than running each job, but that’s not the case here — each step in our flow is pretty slow, involving either I/O or running ImageMagick.
We also don’t necessarily have 1000 jobs to run every time, so Flow in its default configuration might not even do any work in parallel!
However, this is mentioned right at the top of the Flow docs, so we had the right configuration set up in no time.
Overall, using Flow removed loads of (probably pretty bad) code we’d written ourselves, and only required a little bit of adjustment to get incredible speed gains, particularly over the old Ruby version.