Stemma: Palantir’s Distributed Git Server

Welcome to the new Palantir tech blog on Medium! This post is just the first in a series on Palantir software, engineering practices, and what we’re contributing to the open source community. Stay tuned!

Stemma is a distributed, highly-available Git server, featuring a fine-grained permissioning model and simple integration points with external services. Stemma installations are an integral part of the data integration backbone of Palantir deployments, both in cloud and on-premise environments.

This blog post describes the motivations behind this project and explains some core design choices and implementation details. In a nutshell, Stemma implements a distributed Git server in only a few thousand lines of code by leveraging JGit’s DfsRepository framework and Palantir’s open-source distributed and transactional key-value store, AtlasDB.

Our requirements for a Git server

Palantir’s work spans many different countries and industries, each with their own unique business needs and data infrastructure. This means our project teams need to manage complex data integration and analytics pipelines with data transformation code written in different languages (Java, Scala, Python, R, SQL, etc.) as well as product configuration for dozens of different software components. We track this code and configuration in Git repositories.

We have several unique requirements for these Git repositories:

Deployability. Our software is typically run in per-customer installations (as opposed to “as-a-service”), either in dedicated Palantir Cloud instances or on-premise hardware. This requirement rules out standard cloud-only Git services, such as Amazon CodeCommit or GitHub.

Fault tolerance and scalability. These installations are made up of a number of highly-available micro-services. Since service instances are typically stateless and redundant (in order to facilitate load-balancing and fail-over), their backing code and configuration repositories need to support concurrent remote access and (ideally) be highly-available, too. We initially experimented with NFS-backed vanilla Git, but based on our experience deploying products in a wide variety of customer networks, we’ve learned that NFS can be very difficult to manage and tune.

Fine-grained access control. Many different users access our Git repositories on a daily basis: customers and Palantir employees, technical and non-technical, data producers and data consumers, automated and human, readers and writers. This means that each access is governed by a rich and granular set of access controls that we manage via an authentication/authorization system.

Integration with external services. We install our products in a wide variety of environments, which have to be compatible with our customers’ existing systems, whatever those might be. We prefer an event-driven architecture for notifying these systems and triggering downstream actions: pushing data through a transformation pipeline, synchronizing permission configurations, or reconfiguring and restarting services. This means the Git server needs to provide an extensible notification mechanism, triggered by Git state changes like pushes, merges, code reviews, and more. (Notably, a fixed set of Webhooks — similar to Github’s Webhook model — does not suffice since we often cannot control or extend the APIs served by our customers’ systems.)

It quickly became clear that none of the off-the-shelf Git solutions (including Gitlab, GitHub, and Gogs) would meet all of these requirements, so we decided to build our own Git server: Stemma. (A long-standing tradition at Palantir is for the product teams to name new products and tooling. In this case, Stemma is based on the analogy between software versioning in Git and the concept of a stemma in textual criticism: diagrams showing the relationship between different manuscript versions.)

Architectures for distributed Git

At a high level, the Stemma architecture is simple: a number of stateless Stemma nodes serve Git’s HTTP-based “smart protocol”. Shared state is maintained by a storage backend, and authentication/authorization is delegated to a dedicated authentication and authorization service. Load-balancing and failover amounts to choosing one responsive node per request.

The Git data model

Digging a little deeper, let’s discuss the interaction between the Stemma nodes and the storage backend. How is the state of the distributed Git repository represented, and how do Stemma nodes interact with this state? To understand the complications and options, we’ll take a closer look at Git’s data model and communication protocol. Any Git implementation has to manage a combination of mutable and immutable data:

Git objects include commits, trees (i.e., directories), and blobs (i.e., files). Every such object is identified by a probabilistically unique hash. Intuitively, the state of a Git repository is represented as a graph of commits, each of which typically points to a tree of directory and file objects. Objects are immutable in the sense that every modification in a Git repository is encoded by adding new objects with distinct, fresh hashes to the object graph.

Git refs include branches and tags and are human-readable labels to commits. Refs are mutable in the sense that users can change their “destination”. For instance, a user can delete and subsequently create a branch with the same name but pointing to a different commit.

Git packfiles are compressed collections of objects, indexed by object hashes. Intuitively, each packfile represents a subset of the object graph. Packfiles are used extensively in Git’s smart protocol to exchange a minimal delta between two Git graphs. Additionally, many Git implementations (JGit, amongst others) use packfiles as a serialization and storage format for Git graphs. Packfile-based representations of the object graph can choose to treat packfiles as mutable or as immutable:

  • If they are immutable, then each mutation of the graph (e.g., a git push) yields an additional packfile and the graph is conceptually the union of all packfiles.
  • If they are mutable, then each mutation of the graph may augment the packfiles containing the relevant portions of the graph.
  • A third option is to consider each packfile immutable and occasionally combine a subset of packfiles into a new packfile, thereby potentially compressing their objects into a more compact representation. The whole object graph is then given by the set of all packfiles, minus the combined packfiles, plus the resulting new packfile.

The main challenge for any distributed Git implementation is the management of mutable data (i.e., refs and packfiles) under concurrent access. There are two alternative approaches for tackling this challenge:

Architecture A: Git servers store immutable data in a distributed filesystem and mutable data in a (distributed) transactional database.

Since Git objects are stored by unique hash, they can be written safely to a distributed filesystem without much concern for conflicting reads or writes. Concurrent writes of refs and packfiles are managed by a database with suitable atomicity and isolation properties. For example, Google’s internal Git implementation is backed by GFS (Google Filesystem) and Bigtable.

Architecture B: Git proxies federate requests to a network of redundant, vanilla Git servers.

Each proxy acknowledges requests only once a quorum of the replica repositories have acknowledged. The main challenge of this approach lies in implementing failover, balancing, and recovery of replicas correctly. Concurrent access is handled locally by each Git slave repository. This model is implemented by GitHub’s DGit and Git Ketch.

Implementation with JGit and AtlasDB

We decided to build Stemma based on Architecture A due to both its conceptual and implementation simplicity. Stemma is standing on the shoulder of open-source giants: firstly, JGit is a complete Java implementation of the Git protocol and exposes the DfsRepository extension point for distributed store implementations. Secondly, by relying on the recently open-sourced AtlasDB as a distributed transactional database layer for all shared state, we’re able to delegate the complexity of the distributed system to an existing, well-tested backend (analogous to Google’s decision to base their Git implementation on GFS and Bigtable). The AtlasDB APIs provide interfaces for storing and querying both tabular and stream data. Stemma uses the former for structured metadata (e.g., list of packfile IDs per repository and set of refs per repository) and the latter for packfile blob storage. In production systems, we typically deploy AtlasDB on top of a Cassandra cluster that’s sized according to the expected performance and redundancy requirements.

The core JGit interfaces left for us to implement concern storage and retrieval of refs (DfsRefDatabase) and objects/packfiles (DfsObjDatabase). The following pseudo-code sketches their implementations (AtlasDB interfaces are conceptually accurate but syntactically simplified to de-clutter the presentation):

class AtlasDfsObjDatabase {
  void commitPackImpl(
Collection<DfsPackDescription> toSave, Collection<DfsPackDescription> toDelete) {
atlasStore.withTransaction { tx ->
tx.storePacks(toSave)
tx.deletePacks(toDelete)
}
localDiskCache.storePacks(toSave)
}

  List<DfsPackDescription> listPacks() {
return atlasStore.withTransaction { tx -> tx.listPacks() }
}
  ReadableChannel openFile(DfsPackDescription desc, PackExt ext) {
return localDiskCache.getOrLoad(desc, () ->
atlasStore.withTransaction { tx -> tx.loadPack(desc).stream() }
)
}
}
class AtlasDfsRefDatabase {
  RefCache scanAllRefs() {
return atlasStore.withTransaction { tx → tx.listRefs() }
}
  boolean compareAndPut(Ref expectedExistingRef, Ref refToUpdateTo) {
return atlasStore.withTransaction { tx ->
if ((Ref existingRef = tx.getRef(expectedExistingRef)) == null) {
if (expectedExistingRef.getStorage() == NEW) {
return tx.putRef(refToUpdateTo)
else {
return false // ref was deleted by other user
}
else {
if (existingRef == expectedExistingRef) {
return tx.replaceRef(expectedExistingRef, refToUpdateTo)
} else {
return false // ref was mutated by other user
}
}
}
}
  boolean compareAndRemove(Ref refToDelete) {
return atlasStore.withTransaction { tx → tx.deleteRef(refToDelete) }
}
}

With this implementation, the complexity of the distribution logic is offloaded entirely to AtlasDB’s transaction mechanism. For example, the storage backend guarantees that interleaved concurrent calls — whether to the same or to different Stemma nodes — to AtlasDfsObjDatabase#listPacks and AtlasDfsObjDatabase#commitPackImpl are serializable, which means users aren’t able to observe or store inconsistent or partial state changes.

Every Stemma node maintains a local cache of packfiles. Since in JGit packfiles have stable, unique identifiers, this cache is very easy to maintain: whenever an uncached packfile is requested, it’s retrieved from AtlasDB and stored on disk. Since each packfile — identified by a unique identifier — is immutable, the disk cache does not require invalidations or updates.

Packfile compression & garbage collection

The implementation above creates one new packfile per write operation (i.e., git push) in the repository. Typically, each such packfile is small, and contains only the objects committed between subsequent git push operations. For the Git server, there is a tradeoff between maintaining a large number of small packfiles on the one hand, and a small number of large packfiles on the other hand.

  • The set of small (uncompressed) packfiles generated by a sequence of mutations usually contains highly redundant information since all file objects are stored in their entirety. In contrast, packfile compression delta-encodes incremental changes to the same file object and is therefore vastly more space-efficient.
  • When compressing packfiles too eagerly then cache misses become frequent and require loading of large new packfiles. Once those files have been retrieved however, scanning of only a few efficiently compressed packfiles and indexes is fast.

We tune Stemma between the two extremes by periodically coalescing smaller packfiles into larger ones with JGit’s DfsGarbageCollector. As above, AtlasDB’s transaction semantics allow us to atomically swap out the obsolete packfiles for the repackaged new ones, without jeopardizing the consistency of the Git repository under concurrent access. Right now, we haven’t observed sufficient variance in Stemma usage to generate conclusive recommendations for compression strategies under different access characteristics, but we’ll share those findings as we learn more.

Access Control

Stemma integrates with authorization and authentication services to implement access control to repositories and refs. We currently support repository-level read permissions and ref-level write permissions. For example, users or groups may be granted permissions to read the contents of a repository without being able to commit to some of its branches.

By virtue of delegating authentication to a dedicated service, the implementation is very simple. Repository-level read authorization is implemented via JGit’s RepositoryResolver interface:

Repository open(HttpServletRequest httpRequest, String repositoryName) {
AuthToken token = extractToken(httpRequest.getHeader(HttpHeaders.AUTHORIZATION)
if (!authService.canReadResource(repositoryName, token)) {
throw new ServiceNotAuthorizedException()
}
// ...
}

Ref write access is restricted via a JGit PreReceiveHook that checks permissions on every push:

class BranchAuthPreReceiveHook implements PreReceiveHook {
AuthToken token // injected on instance creation
Repository repo // injected on instance creation

void onPreReceive(ReceivePack rp, ReceiveCommand command) {
switch (command.type) {
case CREATE:
boolean canCreate = authService.canCreateResource(repo, token)
// ...
case UPDATE:
// ...
}
}
}

Integration with external services

Integration with external services is primarily achieved via Stemma webhooks, a set of HTTP/REST interfaces akin to Github Webhooks. In the event of a pre-defined set of state changes (e.g., push or merge), Stemma notifies subscribed services by calling corresponding API endpoints. This model supports standard continuous integration workflows, such as running pre-merge tests, as well as Palantir-specific notifications like triggering updates of data integration pipelines upon changes in data transformation code.

Additionally, Stemma puts us in a position to implement custom notification plugins specific to external services that do not implement a set of standard webhook APIs. This flexibility dramatically simplifies integration with existing customer systems.


Authors: Ben D, Grace W, Jared N, Jimin S, Mark E, Robert F