Folding@work

Grégoire Seux
Criteo R&D Blog
Published in
4 min readOct 23, 2020

--

Mid-March, as France prepared its lockdown, a few Criteos were wondering how they could be useful in the midst of a global pandemic.

Several discussions on our internal chat came down to the same question:

What can WE do?

I’m lucky to work with talented and generous people. Of course none of us is a doctor, a nurse or even an essential worker, but we all have the motivation.

And then, someone asked:

That felt like the obvious thing to do: we have servers (a lot), spare capacity and technical skills. After a few minutes we had a prototype running the folding@home client on our container platform. Seconds later, we had 10 instances running using CPUs and GPUs from our servers and we started folding.

What is folding@home?

Folding@home is a project started nearly twenty years ago, with the same idea as the venerable SETI@home: using spare (personal) compute power to contribute to scientific research. The project is focused on protein 3D-structure simulation and has already led to significant progress/discovery over the years.

How did it go?

To scale up from 10 folding@home instances to something significant, we had to ask our SRE VP if we could use company resources for something completely unrelated to our business. What came out as a surprise was that such discussion had already been held. All it took was a sparkle to be concluded. And we got the GO!

With the enthusiasm and motivation quickly came two questions: how far can we push our GPU servers and why don’t we receive more work to do so?

Data center design requires engineers to make an assumption on how much electricity and cooling is necessary per rack. It’s rare to have all servers from a rack at 100% speed, so racks are designed to deliver far less than this in electricity. Running on our container platform meant that we were able to spread folding@home instances across multiple racks and to spread those heavy consumers. Problem solved!

After throwing some capacity in (around 20 GPUs), we were wondering why they were not used all the time (actually only the equivalent of 4 full GPU units were active). The explanation was simple: folding@home project was overloaded with the explosion of number of new contributors, bringing their “work assignment” servers to their knees.

Every time we double our servers resources, the number of Donors trying to help goes up by a factor of 4
What a great issue to have!

Over the following weeks the situation improved of course and most of our instances started to contribute full time to the project.

Some more weeks later, we were in the planning of decommissioning an old data center when we realized there would be 2 weeks between all applications being removed and the moment when we’d pull the plug on the data center. It meant that 10 thousands CPUs would be unused for 2 weeks 🤔.

A simple click on an UI and we had 3500 instances of folding@home client running on this data center, consuming the equivalent of 7k CPUs (the normal load on this data center when used for normal business). It helped us to learn that we could support applications on our container platform with that many instances (teaching us valuable insights that will help us for our largest application).

Folding on CPUs is far slower than on GPUs, to the point that apparently CPUs and GPUs are not given the same kind of tasks by folding@home project. So the overall contribution of this decommissioned data center was not that large compared to the 40 GPU units that were used at that time.

The data center eventually closed and we came back to folding on GPU units. That was fun though and we decided to explore that part further: could we automatically leverage the unused capacity of our clusters to run folding@home? Doing this while staying within power constraint (and keeping strong performance for our applications) is a real challenge and we created an internship position for this exciting project (that filled up immediately)!

So, where are we now?

We have 80 GPUs folding full time for the project and this makes Criteo a top contributor of the folding@home project. This now allows to fully focus on the “Covid-19 Moonshot project”.

6000 GPUs! This makes us stay humble :)

Criteo is one contributer among many others and we all do a lot to advance this project. Helping a distributed computation project is a learning experience and a way for us to participate in the worldwide efforts to fight Covid-19. We are happy and proud to be able to do our share!

--

--