Exploring Sherlock

Aashutosh Poudel
Analytics Vidhya
Published in
7 min readJan 15, 2020

A tool for searching usernames across social networks.

This post is my attempt at understanding an open-source project. The project itself and the code belong to the authors and the collaborators. I am just trying to make a sense of how this works and what frameworks, libraries are actually used in this project. The Github repo for the Sherlock is here.

So, let’s start with what Sherlock does. It’s a simple tool that searches for a particular username in above 300 websites. One of my friends told me about this tool, and we discussed how it could have been implemented. And after reading this wiki, I came to know our initial guess was correct. That is, Sherlock relies on the status codes and response from hitting the URLs of these different sites. There are there ways Sherlock finds if there’s a user registered with a particular username within a site.

  1. Examine the HTTP status code: Valid usernames do not send 404 errors in their response.
  2. Examine the response URL: The presence of response URL indicates that there is no user for the queried username and the site returns a redirect URL to either the main site or sign up page.
  3. Examine the error message: Check the errors in the response of the site or look for any redirects.

Detailed discussion on the above methods and how the sites are stored and the usernames are queried can be found here. Also, the project’s wiki and list of sites scanned can be found here and here respectively.

Photo by Hitesh Choudhary on Unsplash

The Github repo for the Sherlock project can be found here. Now, let’s get right into code.

Going through the readme we can easily see that it’s a command-line application, requires Python to run, and can be run with Docker. So, let’s quickly look at the requirements.txt file to get an idea about what drives Sherlock under the hood. The following is a list of all the packages used by Sherlock.

  • beautifulsoup4>=4.8.0
  • bs4>=0.0.1
  • certifi>=2019.6.16
  • colorama>=0.4.1
  • lxml>=4.4.0
  • PySocks>=1.7.0
  • requests>=2.22.0
  • requests-futures>=1.0.0
  • soupsieve>=1.9.2
  • stem>=1.8.0
  • torrequest>=0.1.0

Also, let’s take a small detour and get to know these packages better.

  • beautifulsoup4: Beautiful Soup helps us parse and modify HTML and XML files. It allows us to manipulate pages in a way similar to how document in javascript allows us to operate on a web page. It actually transforms the original HTML file into Python objects. Most common types of those objects include Tag, NavigableString, BeautifulSoup, and Comment. Tag objects allow us to play with the HTML or XML tags together with their name and attributes. NavigableString refers to any text inside the HTML tag objects. BeautifulSoup contains the entire parsed document. We can iteratively find next_ or previous_ elements within the document or declaratively find or find_all elements, strings, tags with regular expressions, list, or dictionary of attributes. The full documentation can be found here.
  • bs4: So bs4 is just a dummy package that installs the Beautiful Soup package.
  • certifi: Certifi has the root certificates for verifying the identity of TLS hosts. That is, it makes sure the SSL certificates of the said website is valid and was issued by one of the CAs whose root certificate is present within the certifi CA bundle.
  • colorama: ANSI control sequences are used for control operations like color, cursor positions, etc on terminals. Colorama makes these control sequences work in MS windows.
  • lxml: A library for processing of XML and HTML documents. It provides Pythonic bindings for C libraries libxml2 and libxslt. It is based on the ElementTree API, which defines XML documents as trees of Element objects.
  • PySocks: SOCKS protocol helps exchange network packets between a client and a server through a proxy server. More information about SOCKS proxies can be found here. PySocks allows creating SOCKS proxy, that acts as a tunnel, relaying traffic without modifying it.
  • requests: Requests is a complete HTTP library for Python. An easy to follow documentation can be found here. Requests can be used to make all sorts of HTTP requests, read the response, pass parameter in the URL, handle raw/binary/JSON contents, pass custom headers or JSON data, check response status codes, get or post cookies, etc.
  • requests-futures: Requests-futures provides the exact same functionality as the requests library and adds asynchronous execution to it. Also, it returns Future objects instead of Response objects.
  • soupsieve: Soupsieve is a CSS selector library that provides selecting, matching, and filtering using modern CSS selectors.
  • stem: Stem is a Python library for talking with Tor. It is used to connect to the Tor process using the Tor control protocol.
  • torrequest: A wrapper around requests and stem for making HTTP requests over Tor.

After we’ve known what libraries are used and for what purpose, we can go and start looking at how they are used in the project itself.

As we look into the project files, we can safely ignore the docker, travis, code_of_conduct, contributing files, and the tests folder for now. Let’s first look into the two data files: data.json and data_bad_site.json

  • In data.json file, the errorType refers to a piece of information to look for in the response in case of an error. So, an error type of status code will look for the HTTP status code whereas an error type of message will look for a specific error text in the response. Also, the URL format used for checking the username consists of {} as a placeholder for the username.
  • data_bad_site.json contains the sites which aren’t supported by the current detection algorithms of Sherlock. The file removed_sites.md contains detailed reasons for why the sites aren’t supported.

Ok, now we are down to the three final files: site_list.py, sherlock.pyand load_proxies.py.

  • load_proxies.py: So, the functions in this file are well documented in the code. It provides functions for extracting proxies from a CSV file, checking the proxies against wikipedia.org, and returning a list of only working proxies.
  • site_list.py: This script takes a command-line argument -r that updates the Alexa rank of the sites present in data.json. Each request runs in a separate thread and the results are written to sites.md file. It uses the XML ElementTree object to find the REACH tag and eventually extracts the RANK attribute from the response returned by the Alexa API. A typical response from the Alexa API looks like this:
<!-- Need more Alexa data?  Find our APIs here: https://aws.amazon.com/alexa/ -->
<ALEXA VER="0.9" URL="elwo.ru/" HOME="0" AID="=" IDN="elwo.ru/">
<SD><POPULARITY URL="elwo.ru/" TEXT="252350" SOURCE="panel"/><REACH RANK="235507"/><RANK DELTA="-115898"/><COUNTRY CODE="RU" NAME="Russia" RANK="22456"/></SD></ALEXA><?xml version="1.0" encoding="UTF-8"?>
  • sherlock.py: The core module of this project. Let’s explain the major parts of this file one by one.
  1. main function: Initializes colorama for printing colored text and background in the terminal. Defines command-line options for tor, proxies, printing, sorting by ranks, color, output folder, verbosity, etc. Defines a global proxy_list to be modified inside a function. Checks for and verifies various conflicting arguments passed, and sets up proxies, checks them, or if tor should be used to make requests. Checks if a URL is provided for loading the site list. Loads the site list from JSON file either complete or just a subset passed in the site_list command line argument. Checks if the passed sites in the site_list are supported. Also checks if we need to sort the list based on site’s rank. Then, sets up folder and files for storing the fetched information. Finally, the sherlock function is called, along with all the required arguments, which performs the task of searching for the username. The final results dictionary gets searched if the username is actually detected and written to the output file. Additionally, if specified the results are written to a CSV file.
  2. Utility functions for checking the timeout validity, printing valid and invalid results, info, errors, etc.
  3. ElapsedFuturesSession class: This class adds a total response time for each request. Future is something like Promise in Javascript that is used in asynchronous calls. So it overrides the request method of the base class: FutureSession, defines a method called timing that gets called by the hook, checks if there are any other hooks with theresponse key, and adds the timing as the first function to be executed.
  4. sherlock function: Initializes tor if required. Sets the maximum threads in the session to 20. Creates the extended ElapsedFutureSession object. Adds the User-Agent header and also adds extra headers, if present, in the site_data before making the request. Prevents making the request if there’s a failed regex check of the username for the given site. Checks the errorType for every site to know the extent of information to gather for username detection ie. whether to allow redirects or not and whether to use the HEAD or GET HTTP methods. Finally, makes the requests with the above specifies options, header, timeouts, URL, etc. Stores the futures returned by the request calls. Extracts the response time, error type, and response using the get_response function. Extracts the status code and response text from the response. Checks if the status code is of type 2XX, checks if error message from the response is according to what is defined in the data.json file, or if there’s a redirect, and updates the exists status accordingly. Ultimately, saves the status, response text, and time elapsed in the results and returns the final result as a dictionary.
  5. get_response function: Returns the response, error type and the total response time. Handles all kinds of error related to connections, proxies, and number of retries.

--

--