Python, Celery, RAM intensive tasks, and a memory leak
I built a web scraper not to long ago. It goes after a website that has thousands if not millions of urls. Seeing as I was feeling a bit lazy, I added celery and rabbitMQ as a broker.
For anyone looking to get celery and rabbitMQ up and running with a client / server in order to do simple batch processing, I’d say check out Avil Page’s Scaling Celery — Sending Tasks to a Remote Server.
I’d submit hundreds of urls and off they’d go, celery steadily increasing the amount of allocated RAM. That’s kind of a problem with this server, seeing as I am limited to a very small amount of RAM (<8 gigs). Hey, it was just laying around!
Unfortunately, this is a known issue with Python and Celery. So what’s the solution? Chase Siebert has a fantastic write up on the issue, and I encourage you to read it.
The solution? Roll Celery every after a certain amount of tasks are completed.
“For celery in particular, you can roll the celery worker processes regularly. This is exactly what the
CELERYD_MAX_TASKS_PER_CHILDsetting does. However, you may end up having to roll the workers so often that you incur an undesirable performance overhead.”
For newer versions of celery, you should use
--maxtaskperchild=20. The number depends on the amount tasks, the mem used in each task, and how celery responds to that allocation. It may take some trial and error to resolve.