DevOps Practices for Crypto Infrastructure, Part I: Version Control, Full Stack Automation, and Secrets Management
When standing up services that will have cryptographic interactions with a blockchain, the DevOps infrastructure and practices you employ will dictate a lot about the security and reliability of those services. In this two-part series of posts, I will introduce core DevOps principles that will help guide crypto infrastructure creation. I’ll also share different DevOps infrastructure aspects that I have to worked well for me, and could be helpful to other teams looking to stand up crypto as-a-service offerings.
Cloud vs Roll-Your-Own
For many infrastructure elements, you must choose whether to go with a cloud provider such as AWS, Azure, or Google, or to roll-your-own in a colocated data center with self-managed software. In crypto and blockchain, there are some specific requirements, particularly relating to key security, which may factor into requirements around hardware security modules (HSMs), physical servers, and tiers of colocated data centers (more on this later).
But in general, all other things being equal, if there is an option from one of the three major cloud providers for an as-a-service offering vs rolling your own with purchased hardware and self managed software, there are a lot of reasons to go for the cloud option.
In my experience, it is very easy to underestimate the investment and labor required to self-manage infrastructure elements in a high-quality way over time. Especially when the software is open source, the temptation is always to just pull down the software and start running it. Focus tends to be on the cost of the hardware instead of the cost of the cloud service. The DevOps staff that is required to manage, upgrade, performance tune, patch and evolve this infrastructure over time is almost always underestimated by startup teams and becomes baggage as the team and the company grows. For any piece of infrastructure, you really have to ask yourself if this is the best use of your team’s time.
In most cases, you will want to focus your energy on things that you only you can do, and purchase services where possible from a reputable cloud provider. The three major cloud providers (AWS, Azure, Google) all have large and highly specialized teams surrounding each of their as-a-service offerings. For smaller companies, there is no way you are going to do a better job with management and security than these cloud provider teams for base/commodity offerings.
My take: go with a cloud provider or (better yet) more than one cloud provider, so you can and focus on building and running things that you can’t purchase as a service and that are unique to your offering.
Version Control
In recent years, the idea of infrastructure-as-code has become a leading principle in DevOps. This is part of a larger evolution of DevOps that continues to shift the discipline towards looking more and more like a software development practice. A core part of any software development practice is storing all your software artifacts in a version control repository. Artifacts can include source code, configuration files, data files, and in general any of the inputs needed to build your software and your infrastructure environment. It seems like a given, but I have seen operational environments where not all of the artifacts necessary to build the environments were stored in source control.
The benefits of storing everything under version control is that you have a unique version for a given state of the artifacts used to build your environments. This allows for the repeatable build of environments, the implementation of processes around change to these artifacts, and the ability to roll back to any previous known good state in case there are issues. High-quality and cost-effective cloud-based services such as GitHub make this an easy choice to serve as a foundation for DevOps activity.
Full Stack Automation
One of the best things about using the cloud for your infrastructure is the programmability and APIs that the cloud vendors provide. These APIs can be used to automate the entire application stack from base layer network, DNS, storage, compute, up to operating systems and serverless functions, and all the way through to the custom code in your application. Taking an infrastructure-as-code approach means having software artifacts in your source code repository and a build process that can create an entire application environment in a fully automated way. This automation can be used to drive the initial build and incremental change to development, test, and production environments.
There are good tooling options these days to achieve this kind of infrastructure automation. At the base infrastructure level, there are solutions native to cloud provider environments such as AWS CloudFormation or Google Cloud Deployment Manager. We are fans of Terraform as it allows for the management of infrastructure in AWS, Azure, and Google from the same codebase with provider-specific modules and extensions. Once the base level infrastructure has been provisioned, packer images combined with configuration management tools like Ansible, Chef, or Puppet can be used configure host-based services.
There are a lot of benefits to be had from automating the full application stack. Automation eliminates the chance of manual errors and allows for a repeatable process. It also can drive the same stack into dev, test, and prod, thus minimizing the chances of environmental differences leading to surprises. Automation can also be used to support blue/green production deploys in which an entire new environment is built with updated code and then traffic is cut over from the existing to the new environment in a controlled fashion. In addition, it is easy to roll back in this model if there is a problem with the new environment.
Full stack automation also lends itself to the switch from thinking about servers as unique elements with individual character to managing servers as interchangeable elements. It becomes a straightforward proposition to rip and replace troublesome infrastructure and to use tightly-focused servers rather than sprawling snowflakes that acquire dozens of responsibilities and take on a life of their own.
Secrets Management
When you have an automated environment it is very important that the secrets that are part of your application are managed carefully. Secrets could include service passwords, API tokens, database passwords, and cryptographic keys. The management of crypto keys is particularly critical for crypto infrastructure where private keys are present, such as exchange infrastructure and validators on proof of stake networks. Read my recent blog to learn more about crypto key management using multisig accounts and offline keys.
However, a lot of the same principles apply to infrastructure, application, and crypto secrets. You want to make sure that these secrets are not in your source code repo, but rather that they are obtained at build or, better yet, at runtime in the different environments in which your application is running.
Software and platform native tools that help protect secrets in production environments include AWS KMS/CloudHSM, Azure Key Vault and Hashicorp Vault if you are looking for something cross platform. Some very sensitive secrets such as crypto private keys can benefit from hardware key management systems such as YubiHSM2 and Azure Dedicated HSM based on Safenet Luna hardware. The downside is that hardware solutions are generally less cloud-friendly than software ones and, while they may improve key security, some aspects of security are worsened by taking a hardware approach over a more automatable cloud-native software approach. The infrastructure costs and surface area that needs to be managed can also be far higher when taking a hardware-centric approach.
Intel SGX is a promising hardware technology that allows processes to run in secure enclaves. A process running in a secure enclave is totally isolated from the host operating system. What this means is that, if you have access to the guest operating system, you cannot read the memory of the process running in the SGX enclave even if you have root privileges. I am excited by the use of SGX enclaves combined with e.g. Hashicorp Vault to improve the security of software and cloud native secrets management. SGX is available today via Azure Trusted Compute but has the downside of requiring coding to the SGX APIs. We eagerly await further developments of the AWS Nitro architecture which we believe will greatly improve the security of software and cloud native secrets management. Nitro is the AWS version of providing hardware support for isolation of customer workloads on shared infrastructure.
Topics to Cover in Part II
There are many aspects to consider when thinking about secure and reliable infrastructure for crypto based applications. We’ve only touched on a handful of areas in this article. Here are some additional areas I cover in part II:
- Authentication
- Authorization and Roles
- Networking
- Monitoring
- Logging
Looking for further information about infrastructure for crypto-based applications? Contact us today
Originally published at https://www.purestake.com on August 5, 2019.