Addressing Python Dependency Confusion at Pinterest
Bill Prin | Software Engineer, Engineering Productivity, Python; Devin Lundberg | Software Engineer , Security Lead; and Adam Berry | Software Engineer , Engineering Productivity
Software supply chain is an incredibly important security topic in today’s world. In May 2021, an American oil pipeline system fell victim to a cyberattack, and with the assistance of the FBI, paid over $4.4 million dollars in ransom to the attackers in order to recover their systems.
As a result, US President Joe Biden was forced to declare a state of emergency, and on May 12 issued Executive Order 14028 increasing software security standards for software supply chains. According to VentureBeat, software supply chain attacks increased 300% in 2021.
Earlier that year, security researcher Alex Birsan wrote a viral piece describing how he hacked Apple, Microsoft, and a dozen other companies using a variant of software supply chain exploits called “dependency confusion”.
As a software engineer, you will frequently find useful software distributed online for free on package repositories such as Python’s PyPi or NodeJS’s NPM. These open source packages can save a ton of time, but downloading code from the internet introduces a multitude of security vectors.
With dependency confusion, a hacker can give their malicious package the same name as a genuine package and upload it to a public repository in the hope that it’s accidentally downloaded. If this happens, the attacker will have access to arbitrary code execution, potentially allowing them full access to sensitive data or enabling them to damage the production services that power sites like Pinterest. Pinterest takes security extremely seriously and mitigating these efforts has become a top priority for the company. Pinterest allocated both internal security experts and third-party security researchers to audit our risk of dependency confusion.
One of the first actions we took to prevent dependency confusion was to verify all our dependencies were “pinned” to a specific version. For example, instead of accepting any version of the “requests” library, the requirement is specified as “requests==2.27.1” in the “requirements.txt” file. This is a hugely important first step so that you don’t accidentally download a malicious version of a package. This is a necessary step in preventing Python dependency confusion attacks, but it is not sufficient alone.
The Python programming language is especially vulnerable to supply chain attacks for a few reasons.
- There is a culture in Python of code reuse.
Because of this, people frequently want to add new libraries, often created by unknown third-party developers.
- The most popular package manager, pip, is easy to misconfigure in a way that leaves your organization vulnerable to dependency confusion.
It is an incredibly valuable tool run by a team made up mostly of volunteers. They do a fantastic job in general, which is why the tool is so popular, but it does have this weakness.
The dangers of “pip –extra-index-url”
One major issue that put us at risk of dependency confusion was using multiple index endpoints for our Python “pip” config, using the configuration flag ` — extra-index-url`. Pinterest Python artifacts were partially stored on our own custom repository, open-sourced as Pinrepo, and some of our Python packages were stored in JFrog’s Artifactory.
There is a major danger in the usage of the ` — extra-index-url` flag: it will not honor any sort of priority ordering. This has been extensively discussed on Github and Stack Overflow. The short summary is that the volunteer team that manages the pip open-source project does not consider repository index prioritization within the scope of the pip tool. They instead recommend using a single server endpoint that manages priorities on the backend.
Here is a diagram showing what the attack looks like.
Again, keep in mind, pip does not let you set which backend index to prioritize. Therefore, despite the version being pinned, it’s still possible for pip to accidentally download the malicious package. One remediation would be to “squat” every internal package with an empty package on PyPi. However, the PyPi organization does not endorse this approach and has removed empty packages.
Fortunately Artifactory supports the concepts of virtual repositories that allow you to use a single endpoint that then “virtually” dispatches to an appropriate backend repository, including features like prioritization. That means you can configure Artifactory to always prioritize internal repositories, reducing the risk of an external package accidentally getting downloaded instead.
At Pinterest, we invested in moving all our packages to repositories that are compatible with Artifactory virtual endpoints. We then set up the virtual endpoint to delegate to backend package indexes in correct priority order, always prioritizing internal packages. By switching to a single endpoint on Artifactory, we can prioritize where pip looks for packages, and always prefer the internal version.
Other Important Remediation Steps
There are a few other important remediation steps. One of the biggest is to tie dependency installation to a hash of the contents rather than just the name and version of the package. That way, even if the name and version of the package is exactly the same, if any contents are different than expected, the installation will fail. This can be done with tools like pip-tools or poetry.
Another recommendation we are in the process of implementing is to name every internal package with the same prefix, such as pinterest-*. That way, we can code logic to never download packages with that prefix from external sources.
At Pinterest, we’ve made significant improvements to our security infrastructure that will help prevent the type of supply chain software attacks we’ve seen emerge. Still, given the amount of code that’s frequently downloaded and run from the internet, it’s a massive security surface area that every company should invest in hardening. Software supply chain attacks are a serious threat to both governments and businesses, and it’s an area that Pinterest engineering will continually monitor and invest in preventing.
Acknowledgements & References
We want to acknowledge the efforts of Josh Koza and the Product Security team, Jasmine Qin and from the Continuous Integration & Testing Team, and Kynan Lalone and Ruth Grace Wong in Cloud Architecture. We are also looking for individuals to help bootstrap our Engineering Productivity: Code & Language Runtime team, who are focused on solving similar problems for languages such as Java, Python, Node, C++ and Go.
One of the most informative blog posts on the topic of dependency confusion was from Twilio: Dependencies, Confusions, and Solutions: What Did Twilio Do to Solve Dependency Confusion. At Pinterest engineering, we found Twilio’s write up on the topic to be highly informative and we strongly recommend any companies working to prevent dependency confusion review Twilio’s post on the topic.
Other useful references on work going on in industry in this space are the Google’s SLSA framework, which you can read about on Google’s blog post Introducing SLSA, an End-to-End Framework for Supply Chain Integrity, and the Sigstore project, which aims to push forth better industry standards for signing and verifying software.