Migration to Artifactory - Managing Distributed Artifacts at Scale

Published in

AppsFlyer Engineering

11 min readMar 31, 2020

As a platform engineer at AppsFlyer for more than two years, I’ve been around to witness our growth (or rather hypergrowth) in artifacts over this time. I came into a reality where my team is responsible for our artifact management, with more than seven different places where artifacts are maintained via different open source solutions for each case (jars, gems, pip, Docker images, etc). For quite some time, we handled this just fine.

However, once our engineering organization started to grow, we realized we’d need to rethink how we manage our artifacts at scale. This post will tell the story of how we migrated from public and third-party artifact management to Artifactory to help us improve our reliability, release and security processes, as well as ultimately provide a single source of truth for our artifacts and binaries.

Before the Migration

With a diversity of languages, frameworks, tools and tech stacks at AppsFlyer we found ourselves managing many distributed artifact repositories such as Docker Registry, npm, AWS S3 buckets with Maven, additional custom and generic third-party repositories, and this was becoming a wild west to manage. However, it’s not just the management aspect that created difficulties — third-party dependencies were also complicating deployment processes and causing volatility. For example, many times artifacts were unavailable or deleted — as managing these were out of our control when they are hosted in public resources.

On top of this, we started having difficulties around access control (i.e. ACL) & credential inflation, where managing the credentials to the different repositories started to create real overhead and complexity. We were also missing the most basic role-based access control (RBAC) and permission management for each repo. And generally just plain clutter became overwhelming, there was increasing difficulty with keeping track of where artifacts are even located and pulled from.

What We Needed

We realized that we needed a centrally-managed, single source for all of our engineering artifacts. We started to define the different requirements we’d expect the solution we’d select to include:

High availability, multi-site capabilities, redundancy
Support for major artifact repositories (e.g. Docker registry, npm, pip, etc)
Fine-grained access control, governance and audit trail

With these we were hoping to achieve improved deployment reliability, a single source of truth that would be agreed upon by our entire engineering organization, as well as simplifying and streamlining the management operations for all of our artifacts.

Tools of the Trade

Once we understood that we were growing and needed a better artifact management system — we realized the DIY approach was no longer scalable, and that we’d probably need a tool or platform built with these specific challenges in mind.

After some research we decided to test drive JFrog’s Artifactory, for the most part because we were familiar with this product from our local ecosystem and previous roles, and could minimize the learning curve by choosing this tool — and had Sonatype’s Nexus as the next option on our list to test drive, if we weren’t satisfied with Artifactory.

On our checklist of challenges that we needed resolved for the solution to be a good fit were:

Redundancy through clustering
Network & connectivity issues — we didn’t want to have to depend on remote fetching any longer for our artifacts
Role-based access control

In addition, we didn’t want to have too much of a learning curve with introducing a new tool — and that’s why we generally just wanted to maintain our current method of operation, but with improvements to availability, access, and governance.

High Availability

We decided to check the robustness of Artifactory’s high availability through its clustering capabilities. We created two clusters, and leveraged JFrog’s “mission control” product capabilities. Our architecture stack was designed as follows:

One production cluster in the EU region
An additional US-based cluster for replication of the EU cluster
Mission control that is responsible for creating replications for all created repositories
Data recovery through sync of US’s data directory to an S3 bucket

How it Works

Everything that is deployed to the EU cluster is immediately replicated to the US cluster (via Artifactory’s push replication) that has the additional important role of syncing the data to the S3 bucket. This is automatically taken care of by Mission Control that also provides the additional functionality of a cron replication job per every new repo created in the event that there was a failure in the push replication. Mission Control essentially gives us a high level view of both clusters, and replicates all the repos from EU to US through automatic replication.

The US-replica is responsible for syncing all the data to an S3 bucket for disaster recovery purposes, and the two clusters also enable rapid recovery in the case of massive failure by just redirecting the DNS record and traffic to the healthy and working cluster.

In the event that both clusters fail simultaneously there is still the S3 filestore from which a new cluster can be bootstrapped pretty rapidly, where bootstrapping a new cluster of the two nodes was clocked as taking under five minutes.

Dependency Management

Imagine you have a dependency for a mission critical service on an old artifact that sits in a remote repository, and one day out of the blue it just gets deleted, or just as bad, the remote repository that holds the artifact is no longer available or accessible.

Artifactory enabled us to overcome artifact volatility by creating a remote repository that acts like a proxy to the third-party repository you want to access artifacts from. What is really cool is when you download an artifact using the Artifactory remote repository functionality, it caches the artifact, meaning that for each artifact you only need to go outside your local network once — it is perpetually available thereafter in the Artifactory cache.

Governance and Access Control

Another important feature was the management of the platform and setting roles and permissions for the different functions in the engineering organization. Through the unified platform it is now possible to control all of our artifacts repositories in one place, as well as provide access by function through the same endpoint.

This is done with the permissions module in Artifactory, for example let’s take the Docker repository which is our biggest and busiest repository, we separate the images for each team with tagging.

Let’s say TeamA has a Docker image that’s located at: artifactory.appsflyer.com:1234/<TeamA>/<image-name>:<tag>

TeamB won’t have access to this Docker image.

The team’s permissions are defined based on the repositories and artifacts they require access to, and this same syntax is applied to any other team as well, and likewise the access control.

In this way, TeamA is only able to deploy or delete Docker images in their images path, (i.e. only images that are under TeamA’s path).

Test Drive

First, we wanted to check if Artifactory provides access to all the different repositories we typically use, such as Maven, Docker registry, npm, and our own custom S3 buckets.

Second, we wanted to see if Artifactory can handle our current traffic and be prepared for growth as well (as our traffic is continuously growing and we wanted to ensure it could withstand the load).

And finally, we also wanted to check the cluster’s resilience using the following methodology:

Test redundancy by pushing against the first node, and pulling against a second node, where the behavior should be that the artifact remains available on all nodes within the cluster immediately.
Test failure and recovery capabilities by testing what happens when a node fails, i.e. mean time to recovery (MTTR). How long does a recovery take — including bringing the node back up, and placing it in the cluster.

Now that we knew what we wanted to benchmark, we set to work actually performing the PoC. The purpose was to benchmark Artifactory’s performance vs. our old registries.

We conducted the PoC as follows:

Two rounds of pulling from the Artifactory Docker repository

First round was 50 instances (for benchmarking)
Second was 300 instances (for load testing)

Results were compared to our existing registries

Built-In Integrations

We quickly ticked the first item on our list when we found that Artifactory has out-of-the-box integrations with all of the registries we currently work with, as well as our custom repositories, and even added registries that we hadn’t been working with directly and were now accessible through the platform, such as PyPi.

Testing Resilience

The first part of testing the platform’s resilience, was by testing the redundancy of the cluster. We first did so within a single cluster by turning off one node and then checking if artifacts that were deployed directly to this node were accessible from a different node in the cluster that was still up.

✅ Check: The artifacts were indeed accessible.

We then performed this same test across clusters. We then pushed data to the EU cluster, and pulled the data from the US cluster, and found that:

✅ Check: It too was accessible.

Next, we wanted to benchmark the MTTR from the node that was off. When it was turned on again, we found that it took us on average ~40 seconds or so for the dead node to be fully operational again.

50 Concurrent Nodes Against One Registry

50 Concurrent Nodes Against One Artifactory with Local FS Storage

Load Testing the Cluster

The last part of the PoC was to load test the one Artifactory node. We tested this using Docker registry. We pushed and pulled from 50 concurrent machines (a typical load we could expect in our engineering organization) against our old registry, and then against the Artifactory cluster.

The actions performed on the Artifactory nodes were ~2.5x faster for the three Docker images tested — from 80 to 900MB.

We then performed the same test now with 300 concurrent nodes against each registry — ours and Artifactory (this was to see how Artifactory performed with expected growth), and were surprised to discover that the performance advantage for Artifactory was maintained despite the added load.

Load Test — 300 Concurrent Nodes Against One Artifactory with Local FS Storage (c5.4xlarge)

Following the PoC we discovered that Artifactory has basically all of the features we are looking for, with built in resilience and governance mechanisms, and even delivers significant performance improvements, as well as access to new types of repositories such as PyPi. As a result, we decided we did not require any further tool testing, and would move ahead with implementing Artifactory in production without testing any additional platforms.

✅ Great success!

Migration

We were actually pretty impressed by how easy the migration actually was. The UI itself has embedded documentation, such as the “Set Me Up” wizard and walkthrough. Which essentially got us up and running pretty quickly.

We first started by creating remote repositories that pointed to our old repositories, as the first step in connecting our primary repositories to Artifactory. The next move was to block writing to the old repositories, while artifacts were still being pulled from these repositories, while pushing new changes to the Artifactory-hosted repositories.

The next step was to start pulling from Artifactory (all the artifacts from the old repositories were still available from the remote repositories). Once pulled from the remote repositories they were then saved to Artifactory. After we managed to migrate all of our existing artifacts to Artifactory, we then closed the old repositories entirely.

Full Visibility

After we had the repositories set up, we also wanted to make sure we had full visibility and set up operational dashboards to be able to monitor everything — from load, uptime, disks, request counts and even JMX metrics collected with Jolokia and sent via Telegraf to our metrics server, that are then presented in Grafana.

AppsFlyer Grafana Operational Dashboards for Artifactory

Next we connected and parsed all Artifactory logs to our ELK stack, in order to have full search capabilities.

And it’s always a bonus when you get to remove lines of code, and that’s just what this migration made possible.

TL;DR — ~ One Year Later

We migrated to Artifactory for artifact management from third-party public registries nearly a year ago, and are incredibly pleased with the improvement in deployment reliability — no more failed builds due to deleted artifacts or unavailable services, having a single source of truth for all of our mission critical artifacts, and out of the box governance, resilience, integrations with additional registries — not to mention improved performance.

For the most part the migration process was pretty seamless, most of the work involved replacing one URL for another. But, typically, engineers don’t want to change anything in their code that is not a real necessity (if it ain’t broke don’t fix it), so another part of the process was gaining the buy-in of our engineers to make the move to Artifactory, and demonstrating how migrating to Artifactory could really be beneficial for them. The turning point that enabled engineers to gain trust in Artifactory was when the first adopters reporting significant improvements in speed and reliability. After these proven successes, the adoption was fairly quick by the rest of the team after that.

Looking Ahead

With the success of the first phase behind us, we are now looking ahead to make the platform even more robust. We started by building new functionality including some nice automation.

This was made possible due to the popularity of Artifactory, where we found an excellent open source Terraform provider for Artifactory written by Atlassian, and even built modules upon this provider leveraging its resources. We leveraged the provider to enable self-service for incoming requests to create a repo. This then invokes Gitlab CI which creates the repository upon merge request, and includes full logs via Gitlab CI, and the best part is that it is fully aligned with our naming conventions, testing practices, and is done via one line of simple config.

Code sample of tf module use made from the Atlassian Terraform provider:

Next up we plan to migrate our filestore to AWS S3, with a big cache directory on the nodes for stateless nodes, and to streamline future upgrades. We also plan to deepen our usage of Atlassian’s Terraform provider to automate permissions in Artifactory in the future.