Releasing Substation v1.0

Published in

Brex Tech Blog

8 min readMar 26, 2024

Earlier this month, the Security Engineering team at Brex released Substation v1.0, the next evolution of our cloud-native, event-driven data pipeline toolkit. This milestone comes nearly two years after the first public release, three years after the first internal deployment, and after many petabytes of data processed. This post is part announcement and part retrospective on how the project has changed over the years.

Substation v1.0

First and foremost, Substation v1.0 improves on previous releases with:

Simpler applications based on a new data flow model.
Simpler configurations, where almost every user-facing feature — including ones that don’t modify data — is configurable as a transform function.
Simpler cloud deployments using improved Terraform modules.
Passthrough send (aka sink) transforms that can route data to several destinations at once.
Dozens of ready to try examples for building applications, configurations, and cloud deployments.

Previous releases have always been stable and ready for production use, but this release was necessary to fix and improve upon design decisions made years ago. The goal of shipping v1.0 was to intentionally “break” the project (in the context of Semantic Versioning) so thoroughly that it will be easier to use and maintain for many more years to come. To understand that choice better, the rest of this post is a retrospective on how we got to this point and lessons learned along the way.

Project History

It’s hard to believe, but Substation began back in 2020. Here’s a brief history of how the project has evolved since then.

Late 2020:

We sent audit and security event logs directly to our SIEM, just like nearly every other security team in existence.
After implementing our threat detection methodology and systems, I saw an opportunity to level up our security monitoring, threat detection, and incident response capabilities with an event logging system, but nothing that existed seemed like the right fit.

Early 2021:

Designs for an event logging system were drafted and experiments were conducted to test ideas. This is when key decisions were made, including using Go instead of Python, running on native AWS services like Kinesis and Lambda instead of Kafka and Kubernetes, and deploying using containers.
MVP was completed within a few months, and production deployments started thereafter.
At this point, the project had no name.

Mid 2021:

Workflows for deploying infrastructure as code and configurations as code were improved upon and standardized across the Security team.
Several production data pipelines existed, but their architecture was not customizable.

Late 2021:

95%+ of all audit and security event logs were sent through a data pipeline, and the project was refactored as a toolkit with added support for more AWS services. Deployments were highly reliable, and plans for open sourcing began.
At this point, the project was given the internal name YASP.

Early-Mid 2022:

YASP went through an internal open source review and was named Substation.
The project was open sourced in April 2022. The team got used to working on the codebase in public. (We didn’t, and still don’t, maintain a private fork.)

Late 2022:

We formally announced and shared the project in October 2022 with the blog post Announcing Substation. At the time, we managed 30 data pipelines and spent less than 1 hour per week on operations and maintenance.

Early-Late 2023:

v0 went through many iterations, including SemVer “breaking” changes.
Areas of friction in the user experience and opportunities for improvement were identified. (See the Lessons Learned section below for more detail.)

Late 2023:

v1.0 was designed, tested, and iterated on (again, in public).

Early 2024:

Large-scale internal deployments were migrated from v0 to v1.0. Additional bug fixes and refactors were made before the next release.
v1.0 releases in March 2024. There are still very few projects (or products) like it that exist for security teams. As of now, we manage 42 data pipelines and process ~6 TB of data each day while still spending less than 1 hour per week on operations and maintenance.

Lessons Learned

As with most software, mistakes were made — to be specific, decisions made early in development created “papercuts” that affected the user experience. None of them impacted the reliability of the system — we have two production pipelines still running the YASP source code that were deployed 2 years ago and haven’t been touched since — but they were annoying enough to fix. These are some of the biggest mistakes that were made, and how we fixed them in v1.0.

Over-Abstraction with Terraform

Substation shipped day one with infrastructure as code support using Terraform modules — which is great, because it means that infra custodians can consistently deploy data pipelines — but unfortunately these were “over-abstracted” in some critical areas. Here’s one of the most problematic examples, assigning IAM roles and permissions:

module "example_kinesis" {
  source            = "../modules/kinesis"
  kms_key_id        = module.yasp_kms.arn
  stream_name       = "example_kinesis"
  autoscaling_topic = aws_sns_topic.autoscaling.arn
}

module "example_lambda" {
  source        = "../modules/lambda"
  function_name = "yasp_example"
  description   = "YASP Lambda that is triggered from a Kinesis Data Stream."
  appconfig_id  = aws_appconfig_application.yasp.id
  kms_arn       = module.yasp_kms.arn
  image_uri     = "${module.yasp_ecr.repository_url}:latest"

  env = {
    "AWS_MAX_ATTEMPTS" : 10
    "AWS_APPCONFIG_EXTENSION_PREFETCH_LIST" : "/applications/yasp/environments/prod/configurations/yasp_example"
    "YASP_HANDLER" : "KINESIS"
    "YASP_DEBUG" : 1
  }
}

resource "aws_lambda_event_source_mapping" "example_lambda" {
  event_source_arn                   = module.example_kinesis.arn
  function_name                      = module.example_lambda.arn
  maximum_batching_window_in_seconds = 30
  batch_size                         = 100
  parallelization_factor             = 1
  starting_position                  = "LATEST"
}

# Every Lambda should have AppConfig read access.
module "example_appconfig_read" {
  source    = "../modules/iam"
  resources = ["${aws_appconfig_application.yasp.arn}/*"]
}

module "example_appconfig_read_attachment" {
  source = "../modules/iam_attachment"
  id     = "example_appconfig_read"
  policy = module.example_appconfig_read.appconfig_read_policy
  roles = [
    module.example_lambda.role,
  ]
}

# Every Lambda should have KMS read access.
module "example_kms_read" {
  source = "../modules/iam"
  resources = [
    module.yasp_kms.arn,
    aws_kms_key.xray_key.arn,
  ]
}

module "example_kms_read_attachment" {
  source = "../modules/iam_attachment"
  id     = "example_kms_read"
  policy = module.example_kms_read.kms_read_policy
  roles = [
    module.example_lambda.role,
  ]
}
# Every resource that interacts with an encrypted resource needs KMS write access.
module "example_kms_write" {
  source = "../modules/iam"
  resources = [
    module.yasp_kms.arn,
    aws_kms_key.xray_key.arn,
  ]
}

module "example_kms_write_attachment" {
  source = "../modules/iam_attachment"
  id     = "example_kms_write"
  policy = module.example_kms_write.kms_write_policy
  roles = [
    module.example_lambda.role,
  ]
}

# Only specific Lambda need read permission for the Kinesis Data Stream.
module "example_kinesis_read" {
  source = "../modules/iam"
  resources = [
    module.example_kinesis.arn,
  ]
}

module "example_kinesis_read_attachment" {
  source = "../modules/iam_attachment"
  id     = "example_kinesis_read"
  policy = module.example_kinesis_read.kinesis_read_policy
  roles = [
    module.example_lambda.role,
  ]
}

This is, objectively, a terrible way to implement the principle of least privilege and consistently led to resource access errors whenever we deployed or updated a data pipeline. Now with v1.0, the same configuration is:

module "example_kinesis" {
  source = "../modules/kinesis_data_stream"

  kms = module.substation_kms
  config = {
    name              = "substation_example"
    autoscaling_topic = aws_sns_topic.autoscale_default_topic.arn
  }

  access = [
    # Reads from the Kinesis stream.
    module.example_lambda.role.name,
  ]
}

module "example_lambda" {
  source = "../modules/lambda"

  kms       = module.substation_kms
  appconfig = aws_appconfig_application.substation

  config = {
    name        = "substation_example"
    description   = "Substation Lambda that is triggered from a Kinesis Data Stream."
    image_uri   = "${module.substation_ecr.repository_url}:v1.0"
    image_arm   = true

    env = {
      "SUBSTATION_CONFIG" : "http://localhost:2772/applications/substation/environments/prod/configurations/substation_example",
      "SUBSTATION_LAMBDA_HANDLER" : "AWS_KINESIS_DATA_STREAM",
    }
  }
}

resource "aws_lambda_event_source_mapping" "example_lambda" {
  event_source_arn                   = module.example_kinesis.arn
  function_name                      = module.example_lambda.arn
  maximum_batching_window_in_seconds = 30
  batch_size                         = 100
  parallelization_factor             = 1
  starting_position                  = "LATEST"
}

The two configurations are functionally equivalent, except the v1.0 configuration is much more sane. Instead of making users deal with correctly assigning access, now resources have an access variable that applies least privilege to any supplied IAM role. This makes the configuration more readable, and significantly reduces the chance that someone will forget to assign the appropriate access.

Complex Jsonnet Configurations

Substation uses Jsonnet as its configuration language — still a good choice — but the initial implementation left a lot to be desired due to requirements imposed by the source code. It’s worth mentioning that users could have written their own Jsonnet functions to mitigate most of these issues, but reducing complexity was still worth addressing.

A common pattern used internally at Brex is to check for the existence of a JSON value, which was done like this in the original release:

sub.interfaces.processor.copy(
  settings={ key: 'foo', set_key: 'bar', condition: sub.interfaces.operator.all([
    sub.patterns.inspector.length.gt_zero(key='foo'),
  ]) },
),

The equivalent in v1.0 is:

sub.transform.object.copy({ object: { source: 'foo', target: 'bar' } }),

The source code was updated to only execute a transform if the JSON value exists. Our internal deployments rely on ~25,000 lines of Jsonnet configurations (~150,000 lines when compiled to JSON) so this has a significant impact in reducing “lines of config,” which dramatically improves overall readability.

In the spirit of improving readability, every transform function was updated to be more atomic and easier to understand at a glance. This is the same function in both versions:

sub.interfaces.processor.domain(
  options={ type: 'tld' },
  settings={ key: 'domain', set_key: 'tld' },
),

sub.transform.network.domain.top_level_domain({ object: { source: 'domain', target: 'tld' } }),

In the first release, most functions were abstract and configurable. Confusingly, every function also used options and settings variables, which makes no sense. Now the same function is easier to reason about at a glance, assigned to a category of similar functions (“network”), and requires fewer characters to type. If you want it even shorter, then you could use the shorthand function:

sub.tf.net.domain.tld({ obj: { src: 'domain', trg: 'tld' } }),

Most of these changes removed complexity, but in some cases we added complexity to create new opportunities. For example, conditions were removed from every transform function and replaced by a meta switch transform that allows for any combination of if-elif-else logic — this wasn’t impossible to do before, but it was a major pain to implement.

Waiting to Refactor

For better or worse, in the original release we often added code instead of removing or refactoring it. This wasn’t a completely horrible decision since users could just ignore the features that they didn’t need, but it did leave us with a bloated codebase that needed pruning before the v1.0 release. This is apparent if you look at the history of the project’s exported methods; for example, in v0.9.2 there were 13 exported functions related to transforming data and in v1.0 there are only 2 exported functions that have the same purpose.

Related to this, a word of caution for anyone attempting to refactor a codebase of 10,000+ lines of code (Substation is now 27,000+ lines of Go): if you try to do it all at once, then you might not survive. It’s better to iteratively merge changes to the main branch instead of refactoring an entire project in a single dev branch.

What’s Next

From my point of view, the release of Substation v1.0 is a bookend; the project was production-grade before, but now it’s a “shipped product.” We’ll continue to build on and support it, but we no longer plan to maintain a public roadmap — instead we’re moving to a model of iterative development that tracks changes using SemVer. (As of this writing, the project is already on v1.1.1!) If you haven’t used it yet, then give it a try and reach out to the team on the GitHub repository if you have questions or feedback!