Building a Scaleable Protocol Buffers/gRPC Artifact Pipeline

Ibotta uses gRPC and Protocol Buffers to rapidly innovate in our microservice-driven environment. gRPC provides low-latency, bi-directional binary streaming via HTTP/2. gRPC also relies on reusable, validateable, language-agnostic Protocol Buffers messages.

Because of these advantages, many of our new services are using gRPC for their inter-service communication. We’re also using Protocol Buffers outside of gRPC for predictable serialization/deserialization of messages on streams and queues.

However, our initial implementation for packaging Protocol Buffers messages and gRPC services could not handle the influx of new technologist (engineers, data scientists and analysts) interest. This article describes the problems we faced with the initial implementation and how our new pipeline significantly decreased time to ship Protocol Buffers & gRPC artifacts to production.


The Initial Implementation

During our initial exploratory phase of using gRPC, our strategy was to place services in the same package as our common Protocol Buffers (aka Protobuf) message definitions.

ibotta_schemas
├── services
│ └── my_service
│ └── messages.proto <- service-specific Protobuf messages
│ └── service.proto <- gRPC service definition
└── system
└── entity.proto <- shared Protobuf messages

This methodology only allowed for one versioned package at a time (we called it ibotta_schemas) and resulted in a new version of the artifact anytime a core message or service mutated. At one point we saw the shared version published more than ten times in one day! This rapid release cycle made it extremely difficult for users of ib_schemas to keep up-to-date with releases that often didn’t impact their service.

With just a few services, this pattern worked well-enough for us. It allowed for the protoc compiler to easily discover shared messages, but we realized that it would not scale as we added many more services.

Turn Back Now?

If you’re just starting to experiment with gRPC or you only plan to have a few services, the pipeline I describe below might be overkill.

This article is targeted to an audience who need to create a pipeline to deploy artifacts for many gRPC services, developed by a variety of technologists, compiled for a diverse set of languages. If that sounds like you, please continue on. If it doesn’t, go experiment and come back to this article if you’re ready to move forward with wider adoption.


Pipeline Requirements

Our pipeline had three main requirements.

  1. Significantly reduce boilerplate to create and deploy language-specific artifacts for a new gRPC service
  2. Enable gRPC services and Protocol Buffers messages to rely on shared .proto files stored in base packages
  3. Build, version and publish gRPC artifacts in a variety of languages (Ruby, Java, Node, Python, etc.)

The first of these goals was relatively straightforward to approach. Once we had a common pattern in place, we could introduce a Yeoman Generator to create the needed files in the correct place. However, goals two and three were less straightforward. The follow sections describe how those problems were attacked.

Depending on common .proto Files

Sharing reusable messages is one of the strengths of Protocol Buffers. For instance, this is a common message that all services use to talk about an “entity” such as a Shopper, Offer, Bonus, etc. that has meaning across our system’s services.

// entity.proto
syntax = "proto3";

package ibotta_pb.system;

// Identifies an entity across systems.
message EntityUri {
// Uniquely identifies the entity across all systems.
string uri = 1;
    // Human-readable name of this object.
// Should never be considered unique for any purposes
// and should only be used for reference.
// [Optional]
string name = 2;
}

For instance, a Shopper’s EntityUri might look like 'shopper~123' or an Offer’s EntityUri might look like 'offer~456'. Using a shared EntityUri message helps to standardize these messages across many microservices. This message definition is stored in a common package that we’ll call ib_core.

Let’s say that I’m generating a new gRPC service that wants to talk about a Shopper. Its internal message would look something like this:

// my_service.proto
syntax = "proto3";
package ibotta_pb.my_service;
import "ibotta_pb/system/entity.proto";      <- located in "ib_core"
message MyServiceRequest {
ibotta_pb.system.EntityUri shopper_uri = 1;
}

Importing the entity definition from ib_core exposes a few problems.

  1. The protoc compiler for our service needs to have access to the raw entity.proto file to generate the language-specific files for our gRPC service.
  2. Changes to the core file definition should rebuild the gRPC service artifacts. For example, if we added a new field to EntityUri we’d want to rebuild and publish the artifacts for my_service so it can take advantage of that new field.

We recognized this is a dependency management problem and we needed a powerful tool like npm, Bundler, or Maven. Managing inter-dependencies between our core packages and our gRPC services with a bespoke solution was asking for subtle problems in the future.

Additionally, we wanted a common build and deployment system so that technologists could simply create their gRPC service definitions and have the proper artifacts uploaded to our private Artifactory in a few minutes.

Using a Project Monorepo

Because we wanted to share an opinionated build and deployment pipeline, a monorepo made a lot of sense for the project.

If you’re not familiar with the concept of a monorepo, I highly recommend reading Markus Oberlehner’s article, Monorepos in the Wild. Using his terms, this specific use of the monorepo pattern is a Project Monorepo.

Making large scale refactorings across all related packages can be done very quickly if every package is maintained in one single repository. By contrast changing an API which affects all packages spread across multiple repositories, means making a separate commit in everyone of those affected repositories.
- Markus Oberlehner from “Monorepos in the Wild

This is especially valid for this project because these “refactorings” are mutations in common packages relied upon by many gRPC service definitions. Having all the code in one place helps to make changes in one place, instead of across many repositories.

ibotta_schemas                <- root github repository
├── README.md
└── packages <- "sub-repos" inside the monorepo
├── my_grpc_service_1
├── my_grpc_service_2
└── ib_core

In a non-monorepo environment, each sub-folder below packages would be its own GitHub repository, with it own versioning and build/deployment pipeline. The downside to that approach, as discussed above, is managing dependencies, releases and builds across many repositories.

Lerna for Monorepo Management

To manage versioning and releasing in packages within the monorepo, we are using the powerful tool, lerna.

lerna does a great job at independently versioning our the “sub-repos” in our monorepo. For instance, consider a dependency graph where there are two services that both depend on messages in ib_core .

Both gRPC services depend on the Protocol Buffers messages defined in ib_core

At their base state for this example, all of the packages are at version 1.0.0.

# code example here: https://git.io/vhC1Q
packages
├── ib_core @1.0.0
├── my_grpc_service_1 @1.0.0
└── my_grpc_service_2 @1.0.0

When backwards-compatible changes are made in my_grpc_service_1 , it’s the only packages that needs a new release.

# code change made: https://git.io/vhCMI
# release of my_grpc_service_1: https://git.io/vhCMY
packages
├── ib_core @1.0.0
├── my_grpc_service_1 @1.1.0 <- release
└── my_grpc_service_2 @1.0.0

However, when ib_core changes, all three packages need a release. One for the core package updates and a release each for the dependent gRPC service definitions.

# code changes made: https://git.io/vhCMW
# releases of all three packages: https://git.io/vhCMu
packages
├── ib_core @1.1.0 <- release
├── my_grpc_service_1 @1.2.0 <- release
└── my_grpc_service_2 @1.1.0 <- release

lerna manages the versioning and release process for us with a user-friendly CLI tool. For instance, when all three packages needed a release, this is the output from running lerna publish :

> lerna publish
lerna info version 2.11.0
lerna info versioning independent
lerna info Checking for updated packages...
lerna info Comparing with my_grpc_service_1-protofiles@1.1.0.
lerna info Checking for prereleased packages...
? Select a new version for ib_core-protofiles (currently 1.0.0) Minor (1.1.0)
? Select a new version for my_grpc_service_1-protofiles (currently 1.1.0) Minor (1.2.0)
? Select a new version for my_grpc_service_2-protofiles (currently 1.0.0) Minor (1.1.0)
Changes:
- ib_core-protofiles: 1.0.0 => 1.1.0 (private)
- my_grpc_service_1-protofiles: 1.1.0 => 1.2.0 (private)
- my_grpc_service_2-protofiles: 1.0.0 => 1.1.0 (private)
? Are you sure you want to publish the above changes? Yes

lerna updates the version (tracked in each “sub-repo”s package.json file) and creates git tags on that commit.

commit 6ef8f4618c6423d138b3b73e5a3f4f34528744df (HEAD -> master, tag: my_grpc_service_2-protofiles@1.1.0, tag: my_grpc_service_1-protofiles@1.2.0, tag: ib_core-protofiles@1.1.0, origin/master, origin/HEAD)
Author: Ben Limmer <hello@benlimmer.com>
Date: Fri Jun 1 11:39:08 2018 -0700
Publish
     - ib_core-protofiles@1.1.0
- my_grpc_service_1-protofiles@1.2.0
- my_grpc_service_2-protofiles@1.1.0

The example above had only one core package but, in reality, we have many more core packages that our services can use, creating a dependency graph that looks more like this:

A slightly more complex message dependency graph

lerna easily manages these dependencies and the versioning/release process as the schema mutates. This powerful, independent versioning allows for rapid development of core packages and gRPC services alike.

Building and Packaging Artifacts

The other challenge in developing our gRPC pipeline was the build and deployment of language-specific artifacts (gems for Ruby, jars for Java, eggs for Python, packages for node, etc.)

The protoc compiler tool does a great job at taking .proto files and turning them into the appropriate language-specific files (.rb files, .java files, .py files, .js files, etc.), but it doesn’t handle packaging up the artifacts for consumption by the services that use them.

To solve this problem, we use the excellent docker-protoc project from Namely Labs and a custom script to handle packaging/publishing the built artifacts.

docker-protoc

The docker-protoc container encapsulates the multitude of dependencies required to build language-specific files from .proto source files. Using this container provides us with a reliable, consistent interface for the compilation step.

Additionally, their container enables linting of our .proto files via protoc-gen-lint and generation of documentation via protoc-gen-doc .

Code comments in .proto source files are used to generate documentation in HTML or Markdown

If you haven’t read their post, “How we build gRPC Services at Namely”, I highly recommend taking a look. This post was hugely inspirational to us when solving this problem.

Invoking the docker-protoc Container with the Correct Includes

The protoc compiler needs access to all .proto files that are include -ed in your messages. From our example above, our gRPC service required a core package like this:

import "ibotta_pb/system/entity.proto";

If you don’t explicitly include a reference to that file, you’ll see the protoc compiler fail with an error.

ibotta_pb/system/entity.proto: File not found.
ibotta_pb/my_grpc_service_1/messages.proto:
Import "ibotta_pb/system/entity.proto" was not found or had errors.

To resolve this problem, we leverage the way that lerna uses npm to link up dependent packages. From our previous example where my_grpc_service_1 depends on ib_core , this is the folder structure.

my_grpc_service_1
└── node_modules
└── ib_core-protofiles -> ../../ib_core
└── src

Notice that lerna uses a symlink in our monorepo to the ib_core-protofiles it’s dependent on. So, in our script, we iterate the node_modules directory and pass through any package to the docker-protoc compilation run as a -i include.

Our docker run command looks like this.

docker run --rm                             \
$dependency_mounts \ mount dependent pkgs
-v $PACKAGE_DIR:/defs \ .proto files compiled
namely/protoc-all:1.11 \ label is gRPC version
-i src/main/proto $dependency_includes \ node_modules includes
-d src/main/proto \ location of .proto's
-o $BUILD_PATH \ output location
-l $BUILD_LANG \ ruby, node, python etc
--with-docs

$dependency_includes with the example above would expand to a string like this.

-i my_grpc_service/node_modules/ib_core/src/main/proto

And $dependency_mounts expands like this.

-v $PROJECT_ROOT/packages/ib_core:ib_core

We have to mount through the symlinked dependencies since Docker struggles with understanding symlinks.

By gathering, mounting-through and including the dependencies, the protoc compiler can produce the requested language-specific files.

build
└── ruby
└── ibotta_pb
└── my_grpc_service_1
└── messages_pb.rb

Note that even though we included a reference to the imported packages, they are not copied into the resulting build files. We’ll handle this problem below.

Packaging the Files into Artifacts

Once we have the language specific artifacts, we need to package them up per the language and use-case standards. For this example, we’ll use Ruby as an example, but this same idea is used for all the languages we support.

For Ruby, our first step is to move the built files into a lib directory, per the Rubygem standards. Our build directory now looks like this.

build
└── ruby
└── lib
└── ibotta_pb
└── my_grpc_service_1
└── messages_pb.rb

Now we need to generate a .gemspec file for our service. To make this more manageable, we utilize a tool called gomplate , a simple wrapper on top of Go templates. This allows us to fill in the proper variables for each package we build.

Gem::Specification.new do |spec|
spec.name = 'ibotta_pb-{{ getenv "ARTIFACT_NAME" }}'
spec.version = '{{ getenv "ARTIFACT_VERSION" }}'
spec.authors = ['Ibotta']

# etc.
spec.add_dependency 'google-protobuf', '~> 3'
spec.add_dependency 'grpc', '~> {{ getenv "GRPC_VERSION" }}'

{{ if (datasourceExists "dependencies") }}
# Protobuf Base Dependencies
{{- range $row := ( datasource "dependencies" ) }}
spec.add_dependency 'ibotta_pb-{{ index $row 0 }}', '>= {{ index $row 1 }}', '< {{ index $row 2 }}' {{ "\n" }}
{{- end }}
{{- end }}
end

The datasource portion of the template above is how the consuming Ruby service pulls down the dependent base-packages at runtime. Our build script creates a basic CSV of any dependencies (from the node_modules directory).

// dependency name, symlinked version, next major version
ib_core, 1.1.0, 2

Since it’s a bit tough to read, here’s how the rendered template would look with our previous example.

Gem::Specification.new do |spec|
spec.name = 'ibotta_pb-my_grpc_service_1'
spec.version = '1.1.0'
spec.authors = ['Ibotta']

# etc.
  spec.add_dependency 'google-protobuf', '~> 3'  
spec.add_dependency 'grpc', '~> 1.11'
  # Protobuf Base Dependencies
spec.add_dependency 'ibotta_pb-ib_core', '>= 1.1.0', '< 2'
end

This consuming service pulls down the ib_core Ruby package to use at runtime. Since the protoc compiler does not deep-copy dependencies into the resultant build, we need to provide those built files at runtime (or compile-time for compiled languages).

We pass these common parameters ARTIFACT_NAME , ARTIFACT_VERSION , etc. to each of our templates so that each language can produce the appropriate file (build.gradle, setup.py, etc.)

Now our build folder looks like this.

build
└── ruby
├── ibotta_pb-my_grpc_service_1.gemspec
└── lib
└── ibotta_pb
└── my_grpc_service_1
└── messages_pb.rb

The final step is to build the gem and publish it to our Artifactory. Our script runs gem build and gem publish to do this.

Our script has switches for each language type we support in the pipeline. We follow the same pattern describe above for each supported language.


Impact on Ibotta Technologists

We’re finding Protocol Buffers and gRPC to be extremely powerful tools for us at Ibotta, particularly as we continue our move to a microservice-style architecture.

By creating this pipeline, a technologist can use our Yeoman Generator to quickly generate the necessary files to import common proto dependencies and publish the language-specific artifacts for their use-case.

Our Yeoman Generator creates a new service in seconds

These conventions allow our technologists to focus on solving for their business-case right away, a huge efficiency improvement compared to our initial solution.

Lessons Learned

Other than the benefits to the technologist users of the pipeline, there are a few overall lessons we learned from tackling this project.

Protocol Buffers Artifacts are a First-Class Dependency

Protocol Buffers messages allow for backwards-compatible schema evolution. Nevertheless, they should be published and released according to Semantic Versioning. Using an artifact store like Artifactory and language-specific dependency managers (npm, bundler, etc.) in the consuming services makes handling mutations in the schema more transparent and observable.

Naming is Hard

Naming is often one of the hardest problems in programming. As I eluded to above, we have several base packages that share our common Protocol Buffers messages. As part of this project, we did a Card Sorting exercise to take inventory of our existing shared messages and separated them into four base packages. We tried to be aware of Conway’s Law when separating the packages, but we know we likely won’t get it perfect the first time.

To hedge our bets with getting the naming and organization correct, we opted to use a single namespace for Protocol Buffer messages and services, ibotta_pb . This gives us flexibility to change the containing package without rewriting all services using shared messages. Take some time when writing your initial Protocol Buffers messages and monorepo packages, but know that there will always be downsides to the convention you choose. Leave yourself room to adapt as your problem domain changes.

(Probably) Don’t Use Bash

When we started this project, we didn’t foresee packaging of artifacts being this complex, and wrote this pipeline largely in a bash ‘deploy’ script. In hindsight, a more testable and widely-understood language like Ruby would have been a much better choice.

We will likely rewrite our script in the coming months, but at least the complex problems are solved in Bash for now and we can translate to another, more robust language.


If you have any questions, feel free to use the comment thread below! We’re happy to share our experience with you.

We’re Hiring!

If these kinds of projects and challenges sound interesting to you, Ibotta is hiring! Check out our jobs page for more information.