The Celo Validator Community: Security Audits and Lessons Learned
cLabs working on Celo created a validator challenge before the launch of the Celo platform to aid in validator operational experience and to establish security best practices for the Celo community. The Great Celo Stake Off (“TGCSO”) ran on Baklava, Celo’s incentivized testnet. TGCSO operated in phases, with each phase focusing on a different part of the protocol or infrastructure.
cLabs worked with MultiSig, a blockchain security consulting company, to help perform an audit of the validator nodes on the Celo testnet. We believe that this was the first Proof of Stake (PoS) network that conducted a comprehensive audit to encourage validators to focus on their security practices. MultiSig provided this support during Phase 3, “Become a Master Validator,” of TGSCO. MultiSig and cLabs performed audits between February 10, 2020 and February 18, 2020.
This blog post describes a high-level overview of the results and scope of the audits. We will also delve deeper into some of the findings of the audit.
Results and Impact
We audited a total of 56 validators. 36 validators received an audit score of 80 or more and received the "Master Validator" badge. The highest score any team received was 130 and the lowest was 20.
We hope that this audit increased the overall security posture of the Celo network. In addition, the audit contributed to an increased level in security of the blockchain ecosystem since many validators were part of other blockchain networks.
Several validators complemented the audit team for helping them take action to secure their infrastructure. The validators acknowledged that their security posture and knowhow was strengthened after the audit.
We realized that Celo had a robust and thriving ecosystem. We were overwhelmed with the friendliness of the validator community even though these validators were semi-anonymous and situated around the globe.
Scope And Testing Methodology
On the Celo network validators were running three nodes to support the PoS infrastructure – Validator, Proxy, and Attestation Service. In addition, validators were also expected to run a standby validator for redundancy. We used industry standard best security practices to review the configurations on these nodes.
Each audit included three steps that were performed during a screen share session between the auditor and the validator.
Automated data collection: An automated scan was performed using a custom audit script developed by cLabs. Note: The script did not perform any intrusion or penetration testing like activity. It involved only the collection of configuration data and did not involve collection of any sensitive data such as passwords, secrets, or user files. In addition, the validators had the option to keep the data locally and not send it back to us. For those that agreed, the audit data was sent over a secure connection.
Manual testing: The validators were asked to perform some manual tests during the audit screen share sessions.
Survey questions: Survey questions regarding infrastructure were asked as a part of this audit.
We used a walkthrough document that helped us conduct these three steps and can be found here.
We also created an internal playbook that contained commands, tools, and other logistical steps for the audits. We wanted to ensure that the script files and the audit output files were not tampered with. We used the sha256sum utility to validate tampering. We also wanted to ensure the tool itself was not tampered with. We asked the validators to perform a sha256sum of a phrase of our choosing and it was unique to each validator. We compared the result with the result on our end.
The following provides an example:
> echo ChimingKittens>ctest.txt; sha256sum ctest.txt ; rm ctest.txt
> sha256sum baklava-opsec-audit.tar.gz
> sha256sum P-DEEP-100-localhost.localdomain.tar.gz
The output files had the format of <[V|P|A]-ValidatorID-hostname>.tar.gz (where V: Validator, P: Proxy, and A: Attestor)
Data Collection and Analysis
We analyzed all the collected output files using AWS. Once we downloaded the files, we performed the sha256sum test to ensure the files we received were not tampered with.
We then used a scoring checklist template to score each node that the validators were running.
In addition, we peer reviewed the scores of other auditors to ensure no mistakes were made. Once the announcement was made, we provided 48 hours for the validators to challenge their scores and updated scores upon identifying any mistakes or supporting documentation that was missed during the audits.
Findings and Recommendations
Key Security Strengths
Secure Datacenter Setups: Some teams took their security seriously and set up a hi-tech facility for their infrastructure. This included locating physical servers with locked access that only they had access to using pins, keys, keycards, and biometric data. They had redundancies for power, cooling, and network connections. Some facilities also provided validators live access to the CCTV and motion-controlled alerting systems.
High Availability: The validators used different setups to maintain high availability of their validators. One of the validators used the VMWare high availability utility instead of a master/slave node for standby failover system. This service provided real time failover in case of disruptions.
A Passionate, Cooperative, and Dedicated Community: We were overwhelmed with the friendliness of the validator community even though these validators were semi-anonymous and situated around the globe. We observed the community help each other with guidance and troubleshooting. Many were dedicated in following cLabs security guidelines to support the Celo ecosystem.
In addition, we observed many teams without professional technical backgrounds learn technical skills to create a secure professional grade infrastructure.
Many validators went above and beyond to demonstrate that their infrastructure was sufficiently hardened against potential security risks. We witnessed an overwhelming sense of pride from validators as they demonstrated and exhibited their infrastructure.
Scores Vs Security Posture: Some of the validators were concerned about raising the Lynis audit script scores rather than their security posture as they were interested in getting the bonus points. While there were only a few cases of such behavior, we believe that the Celo community is moving in the right direction in their security awareness and implementation.
We recommend that the Celo validator community focus on security controls and posture. While the audit was designed to be inclusive, many validators had unique setups. Such validators should consider additional security measures and controls to defend themselves.
Technical Knowhow: Some validators did not have the required technical knowledge and expertise that was required to support a POS ecosystem such as Celo. We recommend validators get professional support to help secure their infrastructure and assets.
Security Controls and Lessons Learned
The audit involved assessing the validators based on the following categories of tests:
- Physical Security
- Endpoint Configuration Management
- User Management
- Network Security
- Container Security
- Key Management
- Redundancy and Availability
Please refer to the scoring checklist template to see the details of configurations we tested for.
The following section provides brief notes and observations on some of the categories.
- Diverse configurations: We observed diverse configurations in infrastructure with validators using traditional co-located data centers. One validator had built a highly secure data center with locked cages, motion detection cameras, and alerting system built in their basement.
- Missing data center CCTV and alerting: Several teams were missing a live CCTV access and an alerting system to alert them when their servers were accessed. This would be an important control for a secure validator setup.
Endpoint Configuration Management:
- Cloud Vs Datacenter: Our configuration checks were the same for nodes in the cloud and in the data center. Some validators actually had to rebuild their nodes (in the cloud) because they tried to change the boot loader (GRUB) configuration. In the next iteration of this audit, we would like to provide different guidance to each of these groups and check different controls based on whether they are running in a data center or in the cloud.
- PAM modules: One of our checklist requirements was to have PAM (Pluggable Authentication Modules) configured. Some teams had decided against using PAM as they were using public key based authentication. However, we still recommend using PAM configurations to protect users and authentications mechanisms as these can help protect other kinds of attacks as well.
- Physical Firewall: We also observed a few teams use an exclusive physical firewall to protect their infrastructure.
- Firewall Misconfigurations: One of the recommendations was to block inbound internet connections to RPC ports (8545/8546). This was a common vulnerability in Ethereum which allowed external parties to steal funds when this port was exposed to the external internet. During the audits, we realized that some had actually exposed the RPC port (8545) to the public instead of closing them. We had to immediately raise a red flag and notify the community regarding this.
- DDOS Services: The most common DDOS protection service we found was Cloudflare. However, we found that most of the cloud providers had some sort of basic DDOS protection service. We had fun interacting with the validators and provided them with security best practices during the audits. We discovered some controls together during the audit. For instance, Google Cloud Provider had a DDOS protection service called Google Cloud Armor.
- Fail2Ban: The requirement of using Fail2ban was a contentious issue. Some teams had secured their SSH logins using custom methods such as using a non-standard port and restricting the number of SSH connections. While we agreed that such a method was secure, due to the limitations for the three of us to test all unique configurations, we opted for a standard check using Fail2Ban. This also helped non-security oriented teams to learn about protecting their SSH ports.
Redundancy and Availability:
- Standby Validator: Validators advised us that many of them had set up their failover infrastructure in different geographic locations. We noticed some teams distribute their nodes across different service providers as well.
- System Monitoring and Alerting: Some validators had system monitoring and alerting turned on for the availability and functioning of their nodes. Many did not have monitoring and alerting for their security tools such as Fail2Ban and IDS (Snort and OSSEC). While we were aware that these systems generated false positives and created a lot of noise by default, we also wanted teams to learn these systems and mature their processes to create a robust detection and prevention system.
This audit was the first of a kind for POS networks. Overall, we feel that the community was sophisticated enough to create a highly secure infrastructure during the course of this challenge.
The community was very welcoming and receptive to this audit exercise. We are extremely grateful to have been part of such an exercise and hope to have contributed to an increased level in security not only for Celo but for the entire blockchain ecosystem.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Author: Deepak Nuli
Contributors: Jon Tomczak, James Nettesheim
Editors: Claire Belmont, Tim Moreton, Brynly Llyr
About the auditors:
Deepak Nuli: Deepak is the founder of MultiSig Consulting, a security consulting and research company in the crypto-currency industry. He has been involved in the field of Information security for over 15 years with expertise in incident response, security operations, and social engineering.
Jonathan Tomczak: Jonathan's background stems from game engine programming and design. With most of his programming foundation in C and C++, Jonathan has taken his knowledge and applied it to find solutions to aid the security world.
James Nettesheim: James is a Security Advisor to cLabs and has spent the last 15 plus years securing infrastructure for critical networks throughout the world and responding to security incidents and intrusions of all varieties. James enjoys contributing to the security community at large and enjoys learning and sharing knowledge at conferences, meetups and other gatherings.