Widespread Misconfiguration Exposes Stored Source Code on Thousands of Servers

Cameron Stokes
Hive Intelligence
Published in
7 min readApr 10, 2018

In the last couple of years, many inexperienced web application developers have been found to inadvertently leave key components of their Git repositories publicly accessible — potentially giving anyone access to sensitive source code, access keys, passwords and more. In this article, Hivint Technical Specialist Cameron Stokes outlines how he developed his own tool to research this problem further, and his findings as to how common publicly exposed repositories are ‘in the wild’.

What is Git and what are the potential security issues?

Git is a distributed version control system commonly used to address the challenge of having multiple developers working on the same project simultaneously. It allows source code versions to be managed in a logical manner and tracks changes through different ‘forks’ and ‘branches’. A Git repository contains all this information and, as a distributed version control system, any host that synchronises to the repository also receives copy of the historical information. One of the common uses of a Git repository is to manage websites. These can span from single page “brochureware” sites with static content, all the way through to large enterprise Content Management System (CMS) installations. Using Git to manage these is an efficient way to synchronise development and production systems and is not inherently insecure; however, publicly exposed Git repositories can lead to serious security issues.

Once an attacker (or penetration tester) gains access to a Git repository, they also gain access to the secrets within, which can include source code, sensitive configuration files, and even credentials.

An example of the security implications can be seen in this twitter thread by Hanno Böck, which showed the telecommunications company T-Mobile Austria exposed credentials in configuration files that were able to be obtained by retrieving exposed Git repositories: https://twitter.com/hanno/status/982530027135922179

As such, scanning for and retreiving exposed Git repositories is a key component of a web application penetration test. However, many of the publicly available tools to achieve this are relatively slow and not suited to large repositories. To address this (and to learn a bit more about the problem), I developed my own tool that achieved the same results much faster.

How widespread is the problem of publicly exposed Git repositories?

Over time, I noticed that publicly exposing Git repositories was a common mistake for developers to make — but the question was, how common? To try and answer this, I undertook a scan of the internet, and was fairly surprised to find some large companies exposing fairly sensitive information in the Git configuration files alone.

When I first encountered this issue, I found a blog post from Internetwache.org. It indicated that a substantial number of repositories were available when analysing the Alexa top 1 million domains at the time, even if the percentage is quite low (9700, or less than 1%).

Some of the sites identified as being vulnerable were surprising, with a number residing in the government and banking sectors. Another alarming finding was that exposed Git ‘config’ files were often found to contain HTTP basic authentication credentials. As these credentials are stored in plaintext, any Git repositories where write access is permitted with those credentials could be targeted by attackers inserting malicious backdoors into the codebase.

Pillaging Git repositories is nothing new — penetration testers and attackers have been doing so for a number of years. There are many freely available tools to assist with dumping Git repositories, as well as others for analysing and extracting secrets from the dumps. Tools are not required to retrieve repositories when the .git folder allows directory indexing, as it’s possible to simply recursively download all files.

However, retrieving repositories where indexing is not available, while still possible (as the .git/index file keeps track of most of the important repository objects, and this is where the dumping tools are required), is a slightly more complicated process, and can be considerably slower — especially when used against larger repositories.

My own tool for retrieving exposed repositories is born…

To solve this issue, I developed a tool in Golang that performs the same tasks as the existing tools that can be used for retrieving exposed repositories, but in a much faster timeframe. My version simply downloads the index file (which is a catalogue of Git ‘object’ files), parses it, and then uses multiple threads to retrieve any referenced files recursively. It’s like Gobuster for exposed Git repos.

From my experience, I knew that the most interesting results would be obtained through analysing accidentally exposed website back ends and development sites, not generally found on the Alexa listed domains. To further the research undertaken by Internetwache.org, I decided to scan the entire public IPv4 space to identify any servers that have an exposed Git repository. The scan was performed as follows:

1. Masscan was used to identify servers with either port 80 or 443 open;

2. For all servers identified in the step above, a (unreleased) custom tool was used to send a HEAD request for http(s)://ip.ad.dre.ss/.git/index; and

3. If the HEAD request showed that it was likely a git repository, a GET request was sent for http(s)://ip.ad.dre.ss/.git/config.

Across all servers that were identified to have port 80 or 443 open, approximately 0.2% were found to be hosting a vulnerable Git repository. Although this may seem like a small number; to put it into context, it equates to approximately two exposed Git repositories for every thousand servers that have either port 443 or port 80 open on the Internet. This equates to just short of 200,000 servers.

It is also worth noting that this was a simple a root directory search and that there are likely to be significantly more repositories that are not stored in the root directory (for example, if the repository was stored in a ‘dev’ directory on the web server, the path would look like https://ip.ad.dre.ss/dev/.git). This scan also did not reveal any potentially vulnerable sites that are on shared hosting or require a host header for access.

Retrieving the config file helped to identify the hosting repository and to gain insight into what the repository may contain, as the config file is used to configure repository options such as upstream servers, and store certain variables used by Git to manage the repository (and in some cases passwords). Targeting this file was also useful to filter out false-positives due to the presence of predictable values. One of the more sensitive configurations that is possible to set in this file is basic authentication to remote repositories (for example -http://username:password@github.com/user/repo.git). In the retrieved config files, approximately 2500 contained basic authentication credentials. As previously noted, this could lead to the compromise of a repository, as an attacker can easily commit malicious backdoors into the codebase.

Basic Authentication Credentials can be located in Git config files

In addition to sensitive information being hosted within the publicly reachable Git config files, the names of upstream or origin repositories in several of these indicated that some servers likely contained sensitive information. Several government addresses were identified as well as a handful of other unnamed entities that very likely should know better.

Government websites were also found to be vulnerable
It is unlikely the bankingsystem repo was intended to be made public

Several repository names also indicated content that might be of high value to a potential attacker, such as the repository named ‘investigation’ in the following example:

A repository was located that was titled ‘investigation’

In the process of performing these scans, I was contacted by Andrew Morris from GreyNoise who almost immediately identified that I was scanning for exposed Git repos. After a brief discussion he kindly provided some historical data about other entities scanning for Git repositories on a large scale. Over the past 90 days, IP addresses geolocated to Korea, the Netherlands, USA, China, Russia and Spain have sent GET requests for Git config files, to large portions of the internet. Interestingly, 3 addresses with an RDNS pointing to the NYU Internet Census were tracked collecting config files directly. 22 unique IP addresses were tracked potentially scanning for repositories (excluding my own), with a range of different user agents. This information shows that there are definitely entities out there that already have this information or are already looking for it.

During the process of parsing the collected config files, I have independently contacted organisations that had particularly sensitive information exposed, or where I have contacts that would take the matter seriously. Given the substantial volume of data, it won’t be possible to contact all affected hosts.

What is the solution to this problem?

Simple — don’t expose any Git repository files. Even if you believe your Git repository does not contain sensitive information, you must ensure that none of the files in the .git folder are publicly exposed. How to do this will depend on the hosting server, but keep in mind that simply disallowing directory indexing is not enough. Some more guidance depending on the platform is provided below.

Apache:

Include this line in the apache config:

RedirectMatch 404 /\\.(svn|git|hg|bzr|cvs)(/|$)

IIS:

Use the hidden segment feature in the web.config file:

<configuration>

<system.webServer>

<security>

<requestFiltering>

<hiddenSegments>

<add segment=”.git” />

<add segment=”.svn” />

</hiddenSegments>

</requestFiltering>

</security>

</system.webServer>

</configuration>

Conclusion

Exposed git repositories have been around for a long time and the risks they pose are unlikely to be addressed without more awareness and action amongst administrators and the security community. We should place greater emphasis on ensuring development artifacts don’t enter production systems due to the sensitive information potentially exposed, and relatively simple techniques for collecting it.

--

--