This is the second installment of the post, Building a Delivery Ecosystem: Part 1.
In the first installment of this post, we describe a delivery ecosystem as a collection of products and services that software engineers rely on to deliver value to their customers. When the delivery ecosystem products are defined and managed as code (for example, infrastructure-as-code), we can apply software engineering practices such as Domain Driven Design and Event Storming in order to determine product boundaries and build shared understanding. Using the key delivery ecosystem journeys described in Part 1, initial product epics and stories can be created to build or evolve a delivery ecosystem using agile development practices including user stories and test driven development.
With cloud computing, it’s important to avoid over-engineering the delivery ecosystem since costs are typically tied to a consumption model; a cloud computing based delivery ecosystem is not a fixed asset in your datacenter. To avoid over-engineering we can apply the software engineering practice ‘YAGNI’ or the concept of simple design, to the delivery ecosystem. Every feature of the delivery ecosystem can be incrementally and iteratively built to serve well defined and measurable customer needs, with mindful Last Responsible Moment decision making and feature prioritization.
Prioritizing delivery products & capabilities
Core compute and network connectivity are typically among the first needs of an engineering team, followed by other capabilities (such as observability) to support ongoing development. The delivery of features needed by engineering teams can be prioritized by journey: Path to Production (P2P), Path to Repair (P2R), Path to Compliance (P2C) and organized by product or context boundaries. Potential delivery product boundaries and capabilities, in order of implementation when building a new delivery ecosystem, are suggested below. In addition to the order of implementation, we include some useful attributes to consider and regarding change friction. After all, our only certainty about technology is that it will change; it is important to optimize and account for the cost of change. Cost of change and the ability to evolve are characteristics and essential architecture concerns of any investment in technology.
Core compute describes the resources consisting of CPU and memory capacity. They may be established by any combination of:
- Infrastructure-as-a-Service (IaaS), whether virtual machines, containers, or serverless.
- On-Premise compute (data center hardware).
When examining core compute options, consider:
- Elasticity to scale and shrink on-demand.
- Ability to automate so it be provisioned and automated through a standard interface.
- Deployability and rapid self service provisioning.
- Immutability for constructing a completely new resource to replace the former one.
- Testability, including a built-in interfaces that can be used to automatically test and validate the resource.
A compute resource provider that supports infrastructure-as-code capability can address the attributes above and create a foundation for configuration management.
High change friction. Provisioning compute resources tends to be vendor-specific, whether public cloud, private cloud, virtual machines, or physical servers. Change friction can be reduced through the use of tools such as Terraform, which provide a general approach to configuration management. Infrastructure-as-code primitives will vary by vendor but the architecture and design of the delivery ecosystem can be preserved to some degree from vendor to vendor, especially in the case of cloud computing. Tests can check for functionality with vendor-agnosticism in mind. Investing the time and effort into developing tests for compute resource functionality will help evaluate the differences between vendors and operating systems and identify how they can affect services upstream.
Deployment pipelines automate the creation and promotion of all products, including those that are part of the delivery ecosystem Consider a continuous delivery framework that supports:
- Integrability with source code management.
- Automate-ability and audit-ability of the build, deploy, and test process.
- Flexibility to create pipelines for all products, including delivery ecosystem products.
While there are many resources elaborating on continuous delivery for services, deployment pipelines for delivery resources can be difficult to construct. Managing infrastructure-as-code in a pipeline is a critical capability for service development and operations, since minute differences in delivery products can have larger impacts to its hosted services (i.e., network policy or connectivity). In general, delivery ecosystem deployment pipelines might have stages to:
- Check Configuration & Syntax.
- Integration Test.
- Security Compliance Test.
- Monitoring Test.
- Performance Test.
During the development process, patterns for services’ deployment pipelines will become clearer. At this point, a library can be created which reflects these shared patterns and standardizes the required tests for continuously delivering a service.
High change friction. When choosing a CI framework, syntax specifics and functions may not be available or equally robust in every framework (for example, sharing artifacts between jobs). These features may require additional effort to convert. However, core pipeline steps (such as running shell commands) can be encapsulated into a task framework to reduce change friction. Treating deployment pipelines as “dumb pipes” with intelligence encapsulated in code external to the pipeline can minimize the effort involved in adopting new build pipeline tooling.
Secrets management includes storage and orchestration of all secrets required to deliver value through the delivery ecosystem secrets (such as cloud provider account credentials). When choosing a secrets management framework, it is useful to have:
- Access control for specific secrets.
- Operability and scalability of management.
- Inject-ability & renewability of secrets.
There are many patterns in retrieving secrets from a secrets management store. They can be injected as an environment variable or at start-up time for the service. Regardless of the approach, it is important to use secrets management with access control to restrict access to certain secrets in production and between products. The idea of least privilege applies!
Low change friction. This depends on the framework available and operational responsibility to assume. Some frameworks are open-source, which can be deployed and self-managed. There are other secrets management SaaS offerings which can reduce operational complexity at the cost of vendor lock-in. Similar to pipelines, encapsulating secrets lifecycle tasks can reduce the change friction for various frameworks.
Identity & Authorization
We need an identity and authority framework to not only address service-to-service communication but also holistically manage delivery ecosystem accounts. It is difficult to implement a framework that addresses IaaS and SaaS accounts, Attribute-Based Access Control (ABAC), and Role-Based Access Control (RBAC) SaaS integration for single sign-on or federated identification, along with token-based access control for application and API authorization. Early development of this product earlier helps reduce pain in managing access control across many IaaS and SaaS components. Similar to infrastructure-as-code, roles and service accounts can be managed using a build pipeline. This eliminates the need for “root” access in any human users and leverages version control for tracking changes to access.
In a service or microservice based architecture, APIs and events or messaging interfaces provide access to resources. Those resources are typically protected via a token-based Identity and Authorization implementation such as OpenID Connect. Service-to-Service authorization can also be controlled by network policy, such as certificates and service mesh. Distributed applications are often secured via a combination of network policy and token-based identity and authorization.
High change friction. To accomplish federated identification needs for both service APIs and cloud accounts, there may be a higher potential for vendor lock-in. While many vendors support open protocols such as OAuth, standard implementations such as OpenID Connect can reduce friction, and additional integrations will depend on the framework. Discussion around a common identity and authorization framework must begin early, as it often requires procurement.
Managing distributed services and their dependencies can be difficult. Containers virtualize the operating system and isolate application resources, which allow services to be deployed immutably and their dependencies organized. Changes to code are repackaged to a new container and deployed.
From a compute resource standpoint, leveraging containers for service deployment can improve the density of services per compute server, reducing cost and optimizing resources. When there are multiple instances of services and their corresponding containers, a container orchestrator can wrangle them into more useful constructs. Container orchestrators have built-in schedulers that help manage and deploy various container instances, schedule tasks on available resources, and expose specific services. There are both proprietary and open-source container orchestrators that offer container management (such as Kubernetes, Cloud Foundry, Nomad, Marathon).
A container orchestrator should have some form of:
- Service discovery to resolve to a service from another service or outside of the orchestrator as a user.
- Self-service to deploy to the orchestrator and leverage its features.
- Resiliency for if a failure is introduced into the system.
- Debug-ability for checking the status of the application and troubleshoot issues.
- Elasticity from add-ons that will grow and shrink the cluster and its services based on resource usage.
Low change friction. Container orchestration have become quite popular, with low to moderate impact depending the container orchestrator and runtime in use. Many of them leverage similar patterns for service discovery, scalability, observability, and resilience but may not have exact correlation in such feature offerings. For example, there may be some impact if certain autoscaling features available in one orchestrator differs in another.
Container Image & Artifact Registry
Product artifacts in a delivery ecosystem should be managed immutably, which means they updated by replacement rather than in-place. Container technologies lend well to this model. However, we need a registry to store both container images and any internal and third-party libraries in use. Consider registries with the following attributes:
- Built-in security scanning.
- High availability.
- Support for a diverse set of stacks.
- Access control for container images (to allow for separate read/write).
- Automate-ability for image/library security acceptance.
It is likely that the container image registry will be different than the artifact registry.
Low change friction. Container registries generally involve two functions, namely read and write. This process does not change if the container runtime is unaffected. Similarly, the change friction of artifact registries may be low if dependency management is unchanged. The primary source of friction will be in the build and compilation of images, updating endpoints, and updates required to address security vulnerabilities in any of the artifacts.
By using immutable image artifacts for virtual machines and containers, we can further automate vulnerability management. While more dynamic components of our ecosystem may challenge current security scanning workflows and processes, we can apply “as-code” approaches to more nimbly address critical vulnerabilities. Constructing a deployment pipeline for a virtual machine image not only enables functional testing but also security gatekeeping. We can apply similar ideas to container images. Vulnerability management should address:
- Agility to reduce manual intervention during risk analysis and review.
- Immutability for patching and releasing new images and minimize downtime.
- Ephemerality to shorten the lifetime of a instance and reduce the attack surface.
- Integrability to interface within a deployment pipeline.
Low change friction. In general, security scanning will depend on the framework. Consider reducing change friction by leveraging tools with industry benchmarks, such as Center for Internet Security (CIS). These benchmarks should remain agnostic of the underlying virtual machine or container image. Changes to the vulnerability management framework often affect the risk management process rather than deployment.
Observability includes monitoring, dashboards, alerts, log aggregation, and tracing. Service troubleshooting benefits from early implementations of tracing and log aggregation to debug end-user issues. Monitoring should provide a foundation for inputs to result in actionable alerts and purposeful dashboards.
Observability can be achieved with the following attributes:
- Standardization in the form of a contract a service must adhere to in order for it to be observed.
- Self-service to support new custom metrics and alerting.
- Ability to automate alerts and adjust thresholds without friction.
- Traceability for finding metrics sources and correlating events in the system.
- Operability to support sudden changes in metrics and tracing volume.
As a product, observability encompasses a large set of features. As the number of services increases, new patterns emerge to address scale. We might start with tracing frameworks or metrics libraries and evolve to proxy sidecars for tracing requests and exporting metrics. Standardizing early by establishing health and metrics endpoints and formats provides a foundation to scale.
Low change friction. If observable business capabilities and delivery ecosystem products produce metrics in open-standard format, such as Prometheus or OpenMetrics, changing metrics aggregators can be low friction. Similarly, standardizing logging to output in structured, parsable formats (such as using fluentd for log message transformation) can help reduce the change friction when changing log aggregators. Alerting can be implemented with low friction open-source libraries but integrations to alerting mechanisms (pager tooling or regular communication channels) may vary. Dashboards tend to be vendor-specific and customized to the particular service, thus having a high change friction.
Network as a product can be divided into three sub-domains:
- Service Discovery
Often times, the network can be difficult to implement quickly for service-based architectures. Certain frameworks, such as container orchestrators, have addressed the friction of deploying and configuring networks with constructs specific to the framework. For example, we often use internal DNS resolution of services within a container orchestrator. To support the scale and diversity of services, consider a networking implementation with the following attributes:
- Modularity to leverage networking with abstractions for each service, as the network is often a monolithic service.
- Manageability of network-as-code helps configure individual rulesets and reflect the intent of configuration.
- Deployability and rapid configurability of networking components reduces the friction required to start a new set of services.
In the case of connectivity, infrastructure-as-code approaches to managing public cloud networking or data center routing configurations help ensure agility and compliance of services. Connectivity implies the connection of two services being able to potentially communicate with each other. For example, if a manually configured routing table rule on a network router disappears, it may direct network traffic of one service into a black hole and impact connectivity of that service. Services may require adherence to compliance (e.g., Payment Card Industry or medical) regulations outlining physical or logical isolation of data and resources. This is called network segmentation and can be accomplished in a variety of ways outside of routing table rules, such as public cloud security groups which segment by intent rather than by address space.
Service discovery refers to the ability to resolve to a service at an alias that is human-understandable. This can be done by a variety of mechanisms, including load balancing, path-based routing, DNS registration, and more. Sometimes, registering a service can be very difficult and requires a sequence of tickets, processes, and workflows to correlate DNS, load balancing, and a pool of IP addresses. Dynamic service discovery facilitates the deployability and manageability of services. It is helpful to use software load balancers and highly automated DNS registration to allow self-service of secure, highly available service endpoints.
We consider mechanisms that implement rulesets to allow or deny communication between services as network policy.Container orchestrators often have their own tools for enforcing network policy (e.g., Calico). They tend to be easily automated and intent-driven. For IaaS providers, we can use network security groups to automate more specific rulesets on network policy. We can also use Access Control Lists (ACLs) at a minimum to provide least-privilege access. For a data center, firewalls can also enforce rules for inbound traffic. Often times, network policy is cited in relation to compliance. The method of enforcing network policy might differ but accomplishes similar outcomes. Managing these rules “as-code” mitigates the risk caused by conflicting rules from multiple levels of abstraction and allows services to self-manage their particular communication intent. For finer-grain traffic management, service meshes have started to address some scaling concerns in the container orchestration space. We might choose a service mesh to scale service discovery, load balancing, encryption, service-level authentication and authorization, support for circuit breaking, and other capabilities via a unified library or provider. We can also use many other libraries not associated with service mesh to support these functions.
High change friction. Due to the wide array of capabilities that fall under networking, change friction will be heavily vendor- or tool-dependent. For example, changing a load balancer technology may have more change friction than changing a network policy provider. Furthermore, we might have even higher change friction by adopting cross-domain approaches, such as transitioning to a service mesh from market standard libraries and tools serving tracing, traffic management, and encryption needs. Assembling these capabilities on an as-needed basis allows for more freedom to change tooling or approach.
Messaging, Events, & Persistence
As services grow and begin to publish events, there becomes an ecosystem of other services that need to be triggered based on such events. While services may not be formally organized in an event-driven architecture, events allow services to execute as-needed rather than staged at-ready. From a delivery ecosystem standpoint, services constructed in this manner enable on-demand resource optimization, as services scale based on event volume rather than expected usage.
Depending on the service architecture, decisions around messaging, events, and persistence will vary widely. Choices directly depend on product needs and the appetite for operational complexity. System qualities to consider for messaging or persistence include:
- Agility in automation to build, backup, and restore quickly.
- Observability for securely troubleshooting events, transactions, or issues.
- Scalability to support initial and short-term scale.
- Operability of resilience to assess self-healing capability.
Messaging can be in the form of a queue or event stream and persistence can be a database or cache. It is tempting to build for long-term scale, meaning a very high volume of events or transactions. However, it comes with high operational complexity and lower availability, especially since technologies used for massive scale tend not to have features for handling low-volume use cases. In other situations, a poorly configured messaging or persistence technology will not scale or be performant. For example, a misconfigured database may limit the number of writes and prevent services from scaling under load.
As per the Extreme Programming Principle “You aren’t going to need it” (YAGNI), build persistence only when the service requires it. This tends to be a fairly challenging product to manage, as it requires a greater level of observability and resiliency engineering. However, in the case of event sourcing, there may be some discussion around the implementation of persistence to capture events. If this is required for a greater concern regarding auditability or observability, it may be worth considering an initial implementation investment.
High change friction. There are many open source messaging and persistence technologies, some of which are offered as SaaS offerings on public clouds. Vendor-specific messaging and persistence offerings may have high change friction, as interfaces and libraries are often specific to the vendor implementation. Furthermore, fundamental assumptions about messaging with one public cloud vendor will differ from another, which may affect the architecture of the services. Persistence technologies often have high change friction, especially if the schemas or data are inconsistent. Inclusion of an anti-corruption layer, a layer to delegate communication, can ease the friction.
Evolutionary delivery ecosystems and teams
In a delivery ecosystem, we must prioritize application and delivery features based on the needs of the customer, business capabilities, and security stakeholders. YAGNI and all foundational software engineering practices apply to build a low-friction and efficient delivery ecosystem. Building the delivery products incrementally can map to specific value, reduce operational complexity, and optimize for resources.
To begin our initial journey to build our delivery ecosystem, we typically begin with a single “Hello World” service and determine what is needed to deploy it. Core compute, deployment pipelines, secrets management, identity & authorization, container orchestration, and container image & artifact registries provide the initial foundations to support a simple service in a non-production environment. Before promoting code to production, vulnerability management, observability, and networking may need to be ironed out in order to address security and operations concerns.
As more services begin to communicate with each other, networking and observability patterns may grow more complex with network policy and tracing. Events can be addressed using a messaging component but the services remain stateless. Gaining infrastructure-as-code, resiliency engineering, and operations proficiency by managing the delivery ecosystem configuration and state provides a foundation for addressing concerns with persistence. Establishing simple capabilities allows flexibility to change as the environment scales.
A capability is not complete unless it is automated and deployed via a build pipeline.
As a result, simpler capabilities can be implemented with a focus on quality and manageability.
Starting with a single service often highlights that all teams need to adjust to new ways of working and related tooling. Some of these capabilities may already exist in business product teams but are often new to existing data center and operations teams. To support a a scalable, resilient, service architecture, data center engineering and operations begin to align around product-mode rather than project-mode, allowing them to evolve more quickly to address business product teams needs. These teams can build a basis for knowledge transfer through feature development, issue tracking, and shared documentation. This often improves operational readiness and builds the confidence to stabilize issues and support all changes.
Creating an efficient and flexible delivery ecosystem is worth the investment and transformation.
Supporting a services architecture can be challenging, both in culture and technology. With an evolvable delivery ecosystem, we can build and change different services that deliver business value and grow teams beyond the traditional data center boundaries. While the list above does not serve as a comprehensive outline of every product or capability required for delivery, it covers an initial set that supports the scalability and agility for an ever-changing set of services. Just as we incrementally improve a product to react to the market, we build and improve our delivery ecosystem to evolve to varying workloads and demands.