Two years ago a few colleagues (shoutout to helloarbit, travismcpeak, and coffeetocode) and I were talking about supply chain attacks which led to this work being completed. A supply chain attack is an attack that targets dependencies of a company in hopes that they can leverage a weakness in the supply chain to damage the target company. For a company that produces software, the supply chain attack typically targets the software that is used in development of the software product.
With this is mind, the software dependencies are packages/libraries in popular languages like Python, Java, Ruby, and Golang. Each language has their own way of packaging and retrieving dependencies which over the years has led to some having more proactive measures to protect against certain types of supply chain attacks than others.
I decided to take a look at what typosquatting in Python looks like and first started to look at Levenshtein to calculate the distance between two package names to determine if one thing is a typosquat of another. This can be useful to detecting a package that has been squatted and used as a tool to prevent packages from being squatted. With this in mind I wanted to see how many packages of the top installed Python packages can be squatted by removing underscores (
_) or dashes (
One of the most fascinating pieces for me is that the Python Package Index (PyPI) makes data available about each Python package. This data can be very powerful when determining which packages are the most popular packages in the Python ecosystem. For more on analyzing PyPI downloads, checkout their guide here.
The goal of this exercise was to understand how many packages could be squatted and if possible prevent the future squat by registering them myself. Using the data mentioned above, I performed a query to tell me the top 10,000 installed Python packages using the package installer
pip. Having a list of the top 10,000
pip installed packages allowed me to come up with a list of which packages could potentially be squatted by removing any
- from the package name as mentioned above.
I wanted to squat the potential packages so I needed to come up with a strategy for squatting them. There were a few options to choose from:
- Clone the existing packages, change the name to the squatted name and register them.
- Register the packages with a package that does nothing.
- Register the packages with a package that could potentially educate the person installing the squatted package.
Looking at the different options, I chose number three. The goal of the project has always been to be a “Guardian” and the other options seemed less than ideal in achieving the goal. Number two would have left developers wasting a lot of time trying to understand why their software doesn’t work if they installed the squatted package instead of the real package and number one seemed shady and could lead people to believe the intent behind this project had a malicious future.
Initially I decided to squat around 1,100 or so packages. In order to do this, I first needed to create the package to push to PyPI and then create some automation around this to make squatting this many packages achievable in a short amount of time.
I decided to create a simple package that when installed would fail and print out an error message letting you know the actual package name you probably meant to install. Below is an example of what happens when you try to install
pythonjsonlogger instead of the real package
pip install pythonjsonlogger
Downloading pythonjsonlogger-0.1.1.tar.gz (1.3 kB)
Building wheels for collected packages: pythonjsonlogger
Building wheel for pythonjsonlogger (setup.py) ... error
ERROR: Command errored out with exit status 1:
Complete output (29 lines):
File "/private/var/folders/cy/kc766fxx37b5rf87qxkt8hj00000gp/T/pip-install-kp4jab6f/pythonjsonlogger/setup.py", line 20, in run
raise Exception("You probably meant to install and run python-json-logger")
Exception: You probably meant to install and run python-json-logger
ERROR: Failed building wheel for pythonjsonlogger
Now that I had the package setup and automation written, I kicked off the squatting and sat back. With over 1,100 packages registered, I now needed to wait to see if anyone would actually install these accidentally. What I found over the next two years is the most interesting piece of this project in my opinion.
I originally meant to do a post on this after 6 months or a year, but here we are two years later and I have some data and stories to share. I’ll start with the data on 1,131 packages, and end with the stories.
Top 10 squatted package downloads from July 16, 2018 until August 4, 2020:
In a little over two years there have been 530,950 total
pip install commands run on 1,131 packages! This does not include any mirrors or internal package registries that have cloned these packages privately. Malicious packages in PyPI have been know to steal credentials stored on the local file system such as SSH credentials in
~/.ssh/, GPG keys, or perhaps AWS credentials stored in
~/.aws/credentials. If these typosquat packages were written with malicious intent and we assume one attempt per install, that would mean 530,950 machines could have been compromised over the two year period.
While the data is incredibly interesting and you can draw your own conclusions on what could have happened if these were malicious packages, the encounters/stories I find are the most interesting.
Over the two years I received the following encounters:
- Help installing my package
- Thanks for protecting the Python community
- Text from a friend asking if I owned a certain package
- Researchers finding my work
- Company asking to confirm the license on one of my squatted packages
Help installing my package
Over the course of the last two years I have received numerous emails asking for help installing my package or reporting that my package is broken. Each time I’d simply reply with the correct package they should install.
Thanks for protecting the Python community
A few times I received emails from people who have installed my squatted package accidentally and either read the error print out or saw my package registration clearly stating it is package to prevent exploit.
In a few cases it was both a research finding my work and thanking me.
I am working on a project for my security class related to attacks on the Python ecosystem. We kept stumbling upon your packages while trying to identify typo-squatting attempts.
I just thought I would say thanks for helping out the community! :)
Text from a friend asking if I owned a certain package
I told a few friends about this work and I don’t quite remember how the interaction went down, but it was something along the lines of:
Friend: Hey! Do you own pythonjsonlogger?
Me: Yeah why?
You know who you are :)
Researchers finding my work
I have really enjoyed each occurrence of a researcher finding my work because it typically involved a conversation around typosquatting. The example above ended up with my work being included in a symposium paper for the University of Maryland CMSC 8180 class called PYed PIPer by Josiah Wedgwood and Aadesh Bagmar.
In the most recent case I met with the creator of pypi-scan, John Speed Myers, on Zoom for about an hour on our work and we talked about potential future collaborations in this area. Most importantly, this conversation got me excited about the topic again and prompted me to finally write this post as well as do another round of squatting 3,000+ more packages to continue to protect the Python ecosystem.
Company asking to confirm the license on one of my squatted packages
The one I find most scary is an email from a large company asking to confirm the license on one of my squatted packages for use.
All in all this was a fun project that I never meant to take two years to actually write about. Over the last year or so, I have discussed this work with the folks at PyPI and they will actually be taking ownership of my packages once I can confirm a final list of which packages are squatted versus the real packages I actually contribute to or own.
All language ecosystems are vulnerable to this type of attack with some being harder to achieve due to things like package namespace. This work was very targeted and did not expand into squatting future libraries for evolution of projects. Supply chain security is very difficult and has been challenging companies with large pockets for many many years.
Most importantly, THANKS to the folks at PyPI for what they do in making Python packages available to enable folks to develop each day!