Python, Celery, RAM intensive tasks, and a memory leak

I built a web scraper not to long ago. It goes after a website that has thousands if not millions of urls. Seeing as I was feeling a bit lazy, I added celery and rabbitMQ as a broker.

For anyone looking to get celery and rabbitMQ up and running with a client / server in order to do simple batch processing, I’d say check out Avil Page’s Scaling Celery — Sending Tasks to a Remote Server.

I’d submit hundreds of urls and off they’d go, celery steadily increasing the amount of allocated RAM. That’s kind of a problem with this server, seeing as I am limited to a very small amount of RAM (<8 gigs). Hey, it was just laying around!

Unfortunately, this is a known issue with Python and Celery. So what’s the solution? Chase Siebert has a fantastic write up on the issue, and I encourage you to read it.

The solution? Roll Celery every after a certain amount of tasks are completed.

“For celery in particular, you can roll the celery worker processes regularly. This is exactly what the CELERYD_MAX_TASKS_PER_CHILDsetting does. However, you may end up having to roll the workers so often that you incur an undesirable performance overhead.”

For newer versions of celery, you should use --maxtaskperchild=20. The number depends on the amount tasks, the mem used in each task, and how celery responds to that allocation. It may take some trial and error to resolve.

Happy submitting!