“Web scraping considered dangerous”: Leaking files from the spider’s host

Claudio Salazar
Published in
6 min readJul 15, 2019

This is the next post of this serie called “Web scraping considered dangerous”. You can read the previous post here and as an update, my pull request fixing FormRequest.from_response behaviour was merged!

This post is again based on scrapy (version 1.6.0) and I’ll show two techniques to leak files from the spider’s host, however it’s not that easy since the website must meet certain requirements to make this exploitation successful. Let’s go the facts!

Website’s structure premise

Websites return data in multiple data formats ( html, xml, json, csv, plaintext, etc) and to do the exploitation exposed in this post there’s a tight relationship between the data format and the website structure. I’m going to simplify the scenario choosing the plaintext data format since:

  1. It doesn’t meet a format (in comparison to i.e. JSON), then the spider is more tolerant to process different data (expected content vs injected content).
  2. Most files can be read as plain text files :)
  3. It will be clearer later.

Based on my first decision on the data format, the website must meet the following requirement:

It must have an endpoint returning plain text data that is used to do next requests. It doesn’t matter what type of data format the rest of the endpoints return.

In example, it could be a website that has a category sitemap endpoint that returns a list of categories and then the spider requests each of these categories.

Case 1

Continuing with the previous idea, our first case can be drawn exactly under the premise:

  1. /sitemap endpoints contains a list of categories in the website. It’s a list in plain text, each category separated by a newline ( plaintext format).
  2. /category?name=<category> returns the number of products in the category.

Using Flask, we can create a web application running at http://dangerous.tld and the underlying code will look like:

If we want to get the number of products, we will need to create a scrapy spider like this:

This spider goes to the /sitemap endpoint, gets the list of categories and requests the page of each one of them, yielding the number of products of each category. It’s the output of the spider:

Everything works well, but what about the website becoming a malicious actor?

Our old friend, redirect!

As mentioned in our previous post, there’s an issue with the OffsiteMiddleware and in my words:

However, an interesting behavior happens when there’s a request to a page in an allowed domain but redirects to a not allowed domain, since it won’t be filtered and will be processed by the spider.

Ok, being a malicious actor I could modify the web application to create a redirect from /sitemap endpoint to a URL of my choice. When the spider runs, it will request /sitemap endpoint and follow the redirect, then in parse method it will split the response in lines and exfiltrate the contents to /category?name=<line> , using as many requests as lines in the response.

What should be the URL of my choice? It’s a kind of SSRF (or as I’ve dubbed “Spider-Side Request Forgery”) then we have:

  1. If the spider is running on the cloud, the URL could be a metadata URL.
  2. Local services (as I did in the previous post)
  3. LAN servers

Something else? scrapy tries to mimic a lot of browser behavior but it’s not a real browser. How it supports requestinghttps or s3 URLs? It uses downloader handlers and there’s also a file handler. It allows to support the file protocol, then file:///etc/passwd is a valid URI that will get translated to /etc/passwd on the spider’s filesystem.

Can we redirect /sitemap to file:///etc/passwd and exfiltrate that file? Yep, we’re transforming this to SSRF+LFI.

Now, we modify our malicious app to look like this:

When the spider runs, in the server we will receive the following output:

With our malicious app, we’ve successfully exfiltrated /etc/passwd from the spider’s host. As the category endpoint returns a JSON response without caring about the input, the spider leaks the file and doesn’t raise any error. What other sensitive files are available?

  1. /proc/self/environ and friends
  2. Files in /etc
  3. Files like /.dockerenv if they exist

With this spider, to get a private SSH key we would need two spider runs: one to get /etc/passwd to get the local user and the second to try /home/<user>/.ssh/id_rsa . I did something similar in 2014 in “Exploiting the scraper” post (one spider run but a different kind of spider).

In fact, it all depends on the combination of website structure + spider how much information you can leak and how many spider runs it will require. In the worst case, you could create a malicious app that create a vicious flow circle and exfiltrate a long number of files in one spider run.

It’s important to note: as scrapy makes asynchronous requests, we can’t assure integrity of the exfiltrated file. In example, if we exfiltrate a file like a private SSH key that contains around 27 lines, we will receive its content shuffled and we will have to reorder somehow (I guess bruteforce) to get the original file.

Case 2

Now, we’re going to change the web application. Sometimes we have to visit a page to get server-side information needed for next steps. In this case, the web application will generate a token at /token_sitemap that we need to add to the next requests to /token_category. Here’s the new application:

The template first.html is simple, it echoes the url parameter in an anchor tag:

A spider for this new website should be like this:

In parse , it uses an XPath selector to get the full URL (included the token) from the response body and requests it. The spider works and gets the number of products in the category.

Let’s be malicious now!

In our application, at token_sitemap endpoint we’re passing a URL to the template. What if we change this URL to be the file:///etc/passwd? This is the change:

Running the spider (always in DEBUG mode) I get the following output:

Nothing happened. If we review manually the endpoint, this is the response:

It’s correctly extracted by the spider and the request is created, but it doesn’t reach the malicious webserver and there’s not any error message . We have to dig into it.

Debugging what’s happening

My first suspect was the OffsiteMiddleware because in the redirect case we avoid it but maybe we’re hitting it now and that’s why the request to file:///etc/passwd is not done. It’s weird too because this middleware usually shows offsite messages but not in this case. The full middleware file is here, but we’re going to review only the interesting parts:

result argument contains our Request object with url=file:///etc/passwd . At line 4, there’s a check: first condition x.dont_filter is False and second condition logic is at line 18.

In short, should_follow() verifies if the hostname of the url matches the host_regex regular expression. This regular expression is related to allowed_domains attribute in the spider, then it checks if the url hostname is dangerous.tld or a subdomain of it. From the python console, we have this:

Our url=file:///etc/passwd doesn’t have a hostname, then the check fails. The flow continues at line 7 and line 8 makes again a check about the hostname of our url, as it’s None it won’t enter the if logic and that’s why the offsite warning is not emmited by scrapy. Anyway, our request will be discarded.

Can a file protocol URI have a hostname? I didn’t know but according to the RFC, yep. Taking this in consideration, our newurl becomes file://dangerous.tld/etc/passwd .

Running the spider again, we can see the log exfiltrating /etc/passwd bypassing OffsiteMiddleware 🏆 :


When you do web scraping, it’s one way flow: the spider extracts information from the website and not the other way around. However, in presence of a serie of factors, malicious actors might be able to exfiltrate data from spiders’ hosts, compromising private information in an unexpected way.

In my opinion, the lessons from this post are:

  1. Run your spiders in isolated environments.
  2. There’s a thing about the file protocol. It’s useful in cases like starting a spider with a local file as initial input to test something, but apart from that, it’s not quite used. How about making a change of schema from https to file as seen in the redirect exploitation? Is that allowed by the browsers? How do I protect myself from that, if I scrape a website only using https URLs, there’s a setting to only accept http/https URLs and reduce my attack surface?
  3. The redirect issue is a serious concern.

I’m going to report these issues and collaborate to find a solution with the scrapy development team. I hope that for the next post I’ll have some news about these concerns and how we fix them.