Python “lxml” Memory Leak
Memory leak issue for Python “lxml” package and solution found on Internet
Background
- I’m building up a Flask web service scraping stock data
- Service is on an 512MB memory instance
- Service is wrapped by Docker
- Use “lxml” with xpath to parse html string
“lxml” Memory Leak
I write a web service in Docker with schedulers to scrap data every day. And every few days, the service got killed by OS because of OOM(out of memory). I use “memory_profile” and set scheduler running every minute in local to trace memory usage.
After period of time, the service still occupied the memory. In general, there is garbage collection periodically releasing unused memory but seemed not working here. And I googled and found lots of “lxml” memory leak threads.
Solution—Run “lxml” Function in Sub-Process
I found this article on Reddit. He has the “lxml” function running in sub-process and terminates the sub-process after finishing the “lxml” function and those un-freed memory is released while the sub-process terminated.
import multiprocessing
from lxml import etree def lxml_func():
...
tree = etree.HTML(response.text)
...
return resultresults_queue = multiprocessing.Queue()def subprocess_function():
results_queue.put(lxml_func())parse_process = multiprocessing.Process(target=subprocess_function)parse_process.daemon = Trueparse_process.start()result = results_queue.get() # blocks until results are availableparse_process.terminate()
And there is still memory occupied in sub-process. But main process’ s memory remains. Just make sure you terminate sub-process after “lxml” function finished.
It’s kinds of workaround. But I need “lxml” to parse data using xpath.
Summary
Python “lxml” package has memory leak issues. Running memory leak function in sub-process makes occupied memory released while the sub-process terminated.