Detecting Cyber Attacks in the Python Package Index (PyPI)
13 Oct 2018
There are malicious packages lurking in the Python Package Index (PyPI) repository. Using a custom-written automatic scanning tool, I was able to identify eleven different malicious packages based on the content of their installer scripts. Many of these malicious packages was typo-squatting a legitimate package, posing real possibilities of programmers inadvertently executing malicious code on their machines. Future work may identify further packages in PyPI.
In the Fall of 2017 I was looking for a research project in the information security field that would also include aspects of software engineering. At the time SKCIRT (Slovakia’s National Security Authority) just published an advisory of 10 Python packages that were typo-squatting popular packages on PyPI (see: SKCIRT Advisory and Ars Technica Article). As I read through the report, I was surprised by how easy it is to pull off this type of attack and was wondering if there would be an easy way of detecting it. However, before I dive into how to potentially detect this type of attack, let’s look at how this attack works.
Anatomy of the Attack
A typo-squatting attack proceeds as follow:
- The attacker creates a fake Python package with a name similar to an existing package.
- The attacker adds malicious code to the setup.py file of the Python packages. The setup.py file is executed when the package is installed.
- The attacker uploads the package to PyPI and waits for victims to install it.
- When a victim installs the package using “pip install” the malicious code in setup.py executes.
In addition to adding malicious code to the setup.py it is also possible that the malicious code could be added to the functional code of the package itself. This more difficult to execute since the attacker has to understand the code in the package in order to add the malicious code without breaking the functionality of the package.
Most attacks on PyPI so far involved using setup.py (or code called from setup.py), however earlier this year a package called ssh-decorate was modified to steal ssh credentials as part of the functional code of the package (see: ssh-decorate incident).
Detecting the Attack
The Python community has been mostly focused on prevention techniques like checking and preventing the use of typo-squatted packages names. Package signing is not a good option for preventing this type of attack since it verifies only the identity of a package author, but it does not provide any information regarding the malicious intent of an author, even one with a verified identity. An author reputation system could possibly be added (centralized or crowd-based) to the signing part to mitigate attacks.
Despite these prevention efforts, bad actors still manage to publish their malicious code in the repository. At this point, it must be detected and removed. Detection efforts so far have been primitive, based on an ad-hoc detect-and-report system. If a developer stumbles across bad code, the PyPI administrators are notified and the package is removed.
For my research I looked at automated detection techniques rather than prevention. The automated detection techniques can be used in conjunction with the existing prevention techniques to provide multiple layers of defense against these attacks. The question on my mind was: “If there was a malicious Python package on PyPI, is there an automated way to tell if it is malicious (preferably without installing it)?” My project was to author such an automated detection tool.
There are two broad approaches to authoring such an automated tool:
- Dynamic Analysis: Install the package in a sandbox and look for indicators of malicious code.
- Static Analysis: Analyze the code without executing it to check for indicators of malicious code.
Using dynamic analysis (executing the install in a sandbox) is likely more accurate, but slower. It also requires very careful setup to ensure no sandbox escapes. My preference was to develop a tool that could be run and within a couple of seconds provide an evaluation of potentially malicious code in a Python package.
I decided to implement the detection tool using static code analysis, since dynamic analysis cannot provide the required performance. Certainly, there are ways a bad actor could purposefully avoid the patterns used in static code analysis, but doing so would make it more difficult for bad actors to include malicious code in a setup.py script.
The tool would also provide pro-forma measurement of the prevalence of malicious code within the Python repository. Since this code pre-exists any automated detection mechanisms, the initial detection of malicious code provided valuable insight as to the interest of bad actors in code typo-squatting attacks.
Static Code Analysis Detection Strategy
The main pattern used for detecting malicious code in the Python installer code (setupy.py) is based on looking for code that attempts to establish an outbound network connection. Most malicious code attempts to exfiltrate data, check-in with “command and control” or both. Both those operations generally require an outbound connection.
My assumption was that there should not be not many legitimate packages that make outbound connections as part of the installation process. Although Python packages occasionally (and legitimately) download additional resources as part of the install process, these are in a minority. Detection of outbound network connections provides an efficient method to detect malicious packages, albeit with some false positives that can be dismissed through manual inspection.
The detection tool was implemented in Python and uses the Abstract Syntax Tree (AST) library to parse Python source code. The main patterns the tool looks for in the source code is outbound network connections, strings executed as code and obfuscation techniques like base64 encoding.
In my initial scan of PyPI (which included approximately 123,000 packages at the time) I detected 11 packages containing malicious code and reported it privately to the PyPI maintainers earlier this year. Based on the package names — several typo-squatted the popular django package while several others were typo-squatting Python standard libraries.
About half of the 11 packages could be classified as typo-squatting research or experiments and these packages just performs a ping back to a server indicating the package was installed. Several packages injected code into the .bashrc file when installed on a Linux system — however the injected code was downloaded from a website that was seemingly abandoned/broken, so I could not determine the purpose of the malware. One package opened a reverse shell through a proxy service, so it was not possible to determine the other end of the connection.
Work continues to improve the tool. Obvious avenues of improvement include detection of malicious code in the functional code of packages and reduction of false positives.