A secret note to Bug hunters about URL structure and its parsers.

Simgamsetti Manikanta
Mar 14 · 3 min read

I wrote this article because there is no proper resource about the URL structure on the internet.

Image for post
Image for post

Introduction-Uniform Resource Locator (URL)

We all are familiar with the internet, so we are also familiar with URLs. We can easily recognize a string whether it is URL or not by seeing the ‘Protocol Scheme’ followed by “://” and then sequence of characters separated by “dot”.


Q. OK! But how the browser recognizes that the given input string at the address bar is the URL?

A. Actually, the browser first looks for the URI instead of the URL.

oh Wait!…. What is the URI?

Before we understand about the URL we should know about the Uniform Resource Identifier (URI). It worthy to directly go to the URI syntax instead of going into depth of its characteristics and rules.

Image for post
Image for post
RFC3986 URL Structure 1
URI Components

Scheme: In this we have to define the protocol. (ex: http, ftp, ldap etc.,)

hier-part: authority (This is so complicated )

Path: which defines the resource location

query: It defines the query parameters

Now you will get a doubt. What is the difference between the URI and URL.

Actually, the secret is in the Authority part.


Image for post
Image for post
RFC3986 URI components

This is the part I missed when I was learned about the URL.

Userinfo: Which is nothing but the username, password fields which are separated by the “:” (ex: username:password)

And the remaining parts we already know.

So that the actual URI is look like below.

Image for post
Image for post
URI structure or URL Structure

So it means “Uniform Resource Locator” refers to the subset of URIs.

Secret behind the URL parsers

Lets investigate how browsers parse URLs.

browser URL parser analysis

This will give the four results.

https://evil.com.#.example.com -> evil.com
https://evil.com./.example.com -> evil.com
https://evil.com.?.example.com -> evil.com
https://evil.com.\.example.com -> evil.com

So what we can achieve from the above observation is you can bypass the URL redirect validation

#if the application redirects only that subdomains via "redirect_url" and the back-end logic is implemented by matching the only the last "." separated words using regex match
GET /api/logi?redirect_url="https://evil.com.#.example.com" HTTP/1.1
Host: example.com
HTTP/1.1 302 Found
Location: https://evil.com.#.example.com

The above example may look silly but the different URL parsers are giving the different host value. Because of the lack of consistence in the regex pattern two different parsers interpret the URL differently.

Sample desync parsers example:

Image for post
Image for post
URL parser desync
http://example.com &@google.com# @attacker.com

The above URL parsed result is varies by the different URL parsers in the python3.

This desync behaviour of parsers causes serious vulnerabilities in the web applications.

Real-world bugs due to URL parsers desync

Universal XSS in the chrome browser

CORS bypass on google domain

PHP SSRF bypass


I recommend you to analyze the URL parsers behaviour during the web application pen-testing. So that we can construct our payloads to bypass the CORS and SSRF protection.


A New Era of SSRF

Thanks for reading. If you like this write-up please follow me and stay tune for more hacking techniques.


Cyber Hygiene Facilitators

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store