Life on the Render Farm
What’s in it for us?
Why in the world would Cybera want to get involved with independent film? Well, it turns out there are several reasons:
- We know that a general-purpose Infrastructure-as-a-Service cloud performs perfectly well with general-purpose applications running on it. Things like web services, or off-site backup servers are able to share the cloud resources quite happily. But other resource-intensive applications — like scientific simulations or data analytics are more demanding, and their performance suffers if they have to share compute resources with other applications. Graphics rendering is a very CPU-intensive operation, and building a rendering farm in our cloud would allow us to study the effects of very demanding applications on the other cloud tenants, and on the cloud itself.
- We spend a lot of time trying to foster a new ‘digital economy’ in Alberta. We like to imagine how the economy of rural Alberta could be improved if everyone had access to the kinds of resources we used in this projects. Imagine Vulcan, Alberta rendering the next Peter Jackson film. Imagine vast quantities of environmental data being analysed in High Prairie. This project is a good example of what ordinary citizens could achieve if cyberinfrastructure like this was as cheap and available as any other utility, such as water and electricity.
- We’re curious to see what becomes possible when constraints are removed. What will people come up with when they can have unlimited network bandwidth? Or all the computing power they want, when they want it? In this case, a made-in-Albera film was created that would not have been feasible otherwise.
- It was fun.
A simple approach
The approach we took to building our rendering farm had the virtue of simplicity. We wanted something that wouldn’t take long to build, and could be done by almost anyone with modest technical skill. A small python application installed on a laptop had the job of taking the scene to be rendered, slicing it up into ranges of frames, and handing each range to a different rendering server. The rendering servers would take each frame in the range, and render it as a .png image. When the entire range was complete, the set of images would be uploaded into our object storage system. Matthew could then retrieve the images for assembling into the final animation. What could be easier!
The problem with this simple approach is that the rendering job is not complete until the last slowest server finishes the slowest frame. We discovered that some frames rendered very quickly — sometimes just a few minutes, while others took hours. The result is that we some rendering servers zipped through their chunk of the work in no time at all, and sat idle while another server took a geological age to grind through it’s work. We developed techniques to feed idle servers new chunks of work to keep them busy, but it was a tedious and manual chore.
A better way
A better way to build the rendering farm would have been to invert the relationship between the controller and the rendering servers. Instead of the controller saying ‘here, work on this’, it would be better to have the renderers say ‘I’ve finished rendering that frame, give me another one to work on’. This arrangement is known as a ‘distributed task queue’, a popular solution for problems like this. It would have been (a little) more complicated to build, but would have allowed the renderers to keep themselves busy 100% of the time. Much more efficient, and easier to manage.
What are the numbers?
Our rendering farm used 456 CPUs pretty much 24 hours a day for 10 days. As I mentioned above, we weren’t 100% efficient, but still managed to do 2,245 hours of rendering in that time, consuming 53,881 hours of CPU time. These are the kinds of results we like to see: we got the job done, and we learned better ways of doing things next time.