DevOps Practices for Crypto Infrastructure, Part II: Authentication, Authorization, Networking, Monitoring, and Logging
Picking Up Where We Left Off
In part one of this two-part series, I discussed core DevOps principles that helped guide our crypto infrastructure here at PureStake, and discussed some unique considerations around version control, full stack automation, and secrets management. If you didn’t have a chance to read the first post yet, you can find it here.
In this second post, I will continue calling out principles and examining different areas that are important to consider when setting up and running secure and reliable crypto infrastructure.
Authentication and Authorization
One of the most important aspects of ensuring the security of your infrastructure is having the right authentication systems in place.
For logging into infrastructure and servers, I favor centralized/federated authentication directories over local ones. It is very important that DevOps staff have unique user accounts for logging into infrastructure, rather than using shared accounts. Unique accounts provide a record of who logged into what, which is essential to understanding what is happening in your environment. Shared accounts, including direct use of the Administrator or root accounts on servers, become very challenging when you have turnover in your staff or, in the worst case, if there has been an incident. It’s much cleaner to revoke access, assign rights, review past history, and understand what is happening with a centralized directory.
For the scope of authentication, I recommend a full separation of the corporate IT environment and the production infrastructure environments. Fully separate directories are the best approach, even if your directory supports different groups and roles. This greatly reduces human error, resulting in too much — or incorrect — access.
However, that doesn’t mean that you shouldn’t use groups. Grouping users who need access to the infrastructure and assigning the appropriate roles to them is critical to being able to manage access in a reasonable way in crypto and other environments. It is too complicated and too easy to make a mistake when assigning rights to individual users. Even for a small team, having at least a few roles will be appropriate, such as a role with full access for select senior DevOps staff, a role with limited access for junior DevOps staff, and perhaps a monitoring only role for managers and other technical staff. The principle to keep in mind is that of least privilege, which states that users and groups should have as few rights as possible to do their job with a mechanism/process to escalate that can be logged/monitored. This also supports a closely-related concept of blast radius minimization. Having users with roles that employ the concept of least privilege will minimize the blast radius associated with an incident where user credentials or accounts have been compromised.
Using traditional passwords as a way to log into crypto infrastructure is not a good security practice. Where passwords must be used, I recommend the use of a password manager such as Dashlane, which can be set up in a corporate configuration with shared groups, role-based access, and where a unique strong password can be used for each system. Crypto environments require more security than this. At a minimum, all accounts must require two-factor authentication, where the first factor can be a traditional strong password, and an authenticator app is the second factor. A better setup replaces the authenticator app with a physical hardware device.
For identity management in Windows environments, Active Directory is the logical choice. For Linux environments, OpenLDAP and Kerberos serve a similar function. Each cloud vendor has their own identity management scheme including AWS IAM, Azure AD, and Google Cloud IAM, each with their own nuances. Google authenticator works very well as a second factor in 2FA setups. For a physical device second factor, YubiKey is an inexpensive option that plugs into a USB port on your computer. Requiring the YubiKey as one of the authentication factors means that the device must physically be in the possession of the user at the time of login.
Logging
A well-run infrastructure has good mechanisms in place to manage server, application, and as-a-service logs. Logs are not only useful for troubleshooting infrastructure issues, but also provide the basis for audit control and intrusion detection. You need reliable logs to understand what has happened in your environment.
The most important practice is to ship logs off the servers, containers, and other infrastructure elements to isolated, tamper-proof locations. Authorization roles should be employed to isolate these log collection points to make them as tamper-proof as possible. Then the logs can be loaded into query optimized data stores to facilitate visibility, troubleshooting, and monitoring scenarios.
In particular, when running crypto nodes, sometimes logs are the only way to understand what is happening on the nodes. Critical error messages and log entries related to the crypto network protocol can be the only way you can understand that a node is running well or poorly.
In a Windows environment, events can be forwarded to an event collector. In a Linux environment, rsyslog works well for forwarding syslog to regional and ultimately centralized data stores. For log-based searching, troubleshooting, and time series analysis, Splunk is a Cadillac solution: tons of functionality, but at a very high price. An alternative to Splunk is the open source ELK stack (Elasticsearch, Logstash, Kibana) which has gotten a lot better over time and offers a much less expensive way to search and troubleshoot infrastructure based on log data.
Monitoring and Alerting
If you want to run reliable crypto infrastructure, you have to know when services are not running well. The principle here is that everything fails — but early detection allows infrastructure element failures to be remedied quickly.
With good redundancy in design, individual element failures ideally have little-to-no end user impact. For cases where there are failures that lead to end user service impact, strong automation will minimize the time to restore services.
Focusing on early detection, the best way to accomplish that is through the extensive use of monitoring at the different layers of the stack and from different locations. If you are in a colo environment and managing hardware, the monitoring of that hardware will likely require vendor specific tools and possibly the collection of SNMP traps. For cloud environments, the providers offer native monitoring that is integrated with their service offerings. As an example, AWS offers CloudWatch for monitoring AWS based services.
There are a lot of elements in a crypto infrastructure that need to be monitored. It’s important to choose a platform which will serve as the place where monitoring data is sent, where alerting thresholds are set and where alerts are managed. As different monitoring checks are added over time, they can feed into that system. It is extremely difficult to manage alerts, maintenance downtime, and inventory completeness if you have multiple places to go to manage these items.
At the lower end of the stack, you will want to put basic checks in place for OS-level resources like CPU, memory, and disk. Basic network checks would include ping/ICMP, TCP port exhaustion, and TCP service checks. Security events such as those that come off IDS and IPS systems could be fed in here as well. Application-level checks can include HTTPS checks that hit a URL and look for status or error codes and messages.
For crypto-specific infrastructure checks, consider that the base crypto infrastructure consists of nodes. Crypto nodes often expose status and query interfaces via a REST API, so querying that API on a regular basis to look for status and error codes is a good start, but you should be careful that you are not exposing that API to the wider internet. Other checks specific to crypto nodes include looking at block height on nodes, and making sure that it has an expected value. Nodes should be producing blocks on a regular cadence and, depending on the node, role may be helping to support the consensus mechanism of the network. Using monitoring to look for deviations from normal block production or consensus participation behavior is a good early warning indicator of trouble.
Once you have a view from inside of your environment, it is equally important to get a point of view from outside of your environment. This means taking the perspective of your customers and seeing if your services are performing well from their viewpoint. I’ve experienced situations where all services are green from the internal point of view, but a WAN or internet issue means that certain customers are not able to use the service. A common cause is a physical line cut that creates bad network paths to your service until traffic is rerouted. Using a cloud provider with multiple external points of presence can help provide this outside in view of your services.
From an open source monitoring tools perspective, the old workhorse is Nagios and Nagios variants such as Checkmk, which I have used for years to monitor production environments. These tools are starting to show their age, but they are battle-tested and reliable. A newer option getting good traction is Prometheus with its more modern-looking Grafana-based visualizations. For a greenfield environment, Prometheus is a good choice.
Nagios/Prometheus work in a poll model, where servers provide data on a port and a centralized service routinely collects the data and makes it available. DataDog is an example of an alternate model where the data is streamed from the server itself with an agent to a centralized location. For alerting operational staff when there are critical alarms, I have always found PagerDuty to be a good choice, but OpsGenie or VictorOps will provide similar functionality. For external cloud based availability monitoring, ThousandEyes is a good choice, and something like Pingdom will get you basic external coverage for a low-cost entry point.
Concluding Thoughts
The two posts in this series have only scratched the surface of crypto DevOps practices. Other areas that may be the subject of future posts include networking and vpcs, blue / green deployments, docker vs vms, load balancing and failover strategies, ids / ips, storage management for blockchain nodes, crypto key management strategies, personal security best practices for DevOps staff, and other topics. Employing good practices across all of these areas are an important part of what it takes to provide secure and reliable crypto infrastructure.
Originally published at https://www.purestake.com on August 5, 2019.