Open-sourcing Pinrepo, the Pinterest artifact repository
Baogang Song | Pinterest engineering lead, Cloud Engineering
At Pinterest we practice Continuous Integration religiously. We build every code commit in mainline, which in turn produces tons of build artifacts every day. We need to not only store these artifacts reliably, but serve them efficiently with consistent performance to scale the engineering team and ensure developer productivity. Today, we’re open-sourcing our work with the release of Pinrepo, a highly scalable solution for storing and serving build artifacts (binary files and metadata produced by a build process) such as debian packages, maven jars and pypi packages.
- Simple: publish and store build artifacts in AWS S3 and serve with Nginx reverse proxy
- Extensible: easily add other format supports such as RPM
- Reliable: highly available nginx cluster and AWS S3 service
- Scalable: nginx layer scales horizontally, as AWS S3 backend is highly scalable
- DevOps-friendly: been running in production for 8 months with virtually no maintenance
Overcoming scaling challenges
The biggest challenge Pinrepo helped us solve was scalability, and allowed us to efficiently serve large amounts of data over time. For instance, scaling became an issue when building a major 36M (1.8G/day, 162G/ 3 months) Python package 50 times a day. We’ve used different solutions before, but their performance generally deteriorated over time as the number of objects grew.
In addition to scaling, there were also service availability and data durability concerns. The solutions we used previously were based on a single host and disk, and they were slow and crashed often. While there were backup processes, they often broke without notice. Maintaining these services was a nightmare. We needed a reliable service on top of a durable, highly-scalable object storage.
We considered using Amazon S3 directly, however we found we couldn’t use it to store and serve build artifacts because the repository clients couldn’t talk to S3 if the bucket wasn’t configured as publicly accessible. These types of AWS S3 requests need to be “signed” first, including a process to calculate and attach the HMAC based on some of the elements of the request.
The problem could be solved by simply adding a cluster of nginx servers in front of AWS S3. We could in theory “fix” all the repository clients so they could talk to S3 directly, but it would require lot of work and be hard to maintain and extend to other artifact formats. On the other hand, not only could the nginx layer sign and forward the requests to AWS, but it could also provide a local cache for artifacts with a huge performance improvement. The nginx layer is stateless and could be easily scaled up horizontally, so it solved the availability problem as well. Plus, there’s no need for extra backup process.
Our answer: Pinrepo, our internal artifact repository.
We created a cluster of nginx servers behind a load balancer, proxied all the requests to the backend AWS S3 service, and used the nginx module ngx_set_misc to create the AWS service request signatures.
Publishing Artifacts directly to S3
There are many ways to upload and maintain the artifacts metadata and layouts in Amazon S3, including using deb-s3 to upload and maintain the debian packages, and using maven plugin maven-s3-wagon for maven packages. We couldn’t find any existing tool to upload Python packages to S3, so we wrote our own pypi-release to upload and maintain pypi packages.
Simple Is Better
By publishing and storing the build artifacts directly to Amazon S3 and front ending it with a nginx cluster, we simplified the technical stack and achieved scalability and durability at the same time. We’ve been running Pinrepo in production for more than eight months with consistent performance and virtually no maintenance.
A simple UI with a search support could be a feature to add, though various S3 tools can be used to find out what’s available in the repositories. Take caution when adding extra features, and don’t make them part of the critical path, as to not jeopardize the reliability and scalability of the existing simple solution. Staging repository and package promotion are also nice to have, though not necessary if you have staging deploy and pre-production test environment. Pinrepo could also be extended to easily support more package formats such as RPM.
Check out Pinrepo on Github, and let us know what you think!
Baogang Song is an engineering lead on the Internal Development Tools team, part of Cloud Engineering team at Pinterest.
Acknowledgements: Thanks to Raj Patel, who helped evaluate various existing solutions and formalize the process; Jayme Cox for reviewing the design and advising on implementation details; and to all of the members of the Cloud Engineering team at Pinterest, which drives reliability, speed, and security for the site, and builds the technical building blocks for Cloud Infrastructure and developers.