Thanos Development Story — 1

Thanos Infrasturcture Optimization

Steven Lee
Tokamak Network
4 min readAug 21, 2024

--

The TOP project team planned to launch the Thanos mainnet in December and unveiled the new test network Thanos-Sepolia to the world on July 1st. The Thanos network solved many technological challenges following the existing network Titan. Briefly introduced, it is as follows.

Native token support

  • Thanos supports TON(ERC-20) as the network’s gas token.

Predeployed UniswapV3, USDC Bridge support

  • Thanos supports the most basic Contracts by predeploying them in advance.

Optimized SDK support

  • Thanos supports an SDK that optimizes fees to help users minimize the fees they use for deposits and withdrawals.

Blob Batch support

  • Thanos can use Blob to reduce L1 data fees. This can be flexibly adjusted to minimize gas costs depending on network usage.

The Thanos development story is comprised of the parts each team member was responsible for developing, and is scheduled to consist of a total of 7 stories. Let’s start with Infrastructure Optimization first.

Previous architecture

In order to introduce a new and improved architecture, we must first explain what the previous architecture was. The previous architecture is as shown in the picture above. All applications run on Kubernetes, and the pods located on the left from the center are Core Network Pods. There are two main systems that exist to monitor these Core Pods. One is Prometheus and Grafana for monitoring metrics, and the other is Elasticsearch and Kibana for collecting logs. So what were the problems with the existing architecture?

  • EFS is used as a volume in Elasticsearch.
  • EFS is used as the volume of Prometheus.

If you don’t know why the above two things are a problem, let me explain a little more. EFS is AWS’s NFS service. The NFS is the problem. By default, Elasticsearch and Prometheus do not support NFS. To be precise, it is possible to use it, but it contains many problems. The biggest problem is the slow speed of NFS. Logs and metrics of all services need to be stored, but NFS does not have the appropriate speed for this. Although AWS’s EFS is very fast among NFSs, it has a write speed limit of up to about 1 GiBps (125 MB/s) based on the Seoul region. Considering that SSDs have an average speed of over 1 GB/s, there is a speed difference of about 8 times. In fact, it took more than 20 seconds to query Elasticsearch data (about 3000 log data) through Kibana.

New architecture

The project created the architecture shown above through a review by AWS. Here are the points that have changed.

  • Remove Elasticsearch from cluster and use EC2 + EBS
  • Remove storage from Prometheus

Let’s start with Elasticsearch. Query performance has improved significantly by replacing the existing EFS with EBS. The time it takes to query 3000 log_data has been reduced to less than 3 seconds. This is a surprising latency given that Elasticsearch is configured with EBS volume type (gp3) and a t3.medium instance alone.

Prometheus originally stored one year of Metric Data in EFS. However, based on network operation experience, We learned that the team currently has no use for past metric data. Therefore, in order to implement alerts through real-time metric data, we decided to use temporary storage of AWS Fargate (all pods are running using Fargate) to store metrics.

Implementation details

Elasticsearch

The picture above shows EC2 and DNS settings among the Terraform code to create Elasticsearch. While using the same VPC as the EKS cluster to communicate with Kibana, a private domain was registered and configured to communicate. We also registered a security group that only allows internal IPs within the VPC.

The above code shows how to install Elasticsearch using Terraform. Send commands through SSH to the created EC2 to install docker, mount EBS, and update necessary permissions. Afterwards, run Elasticsearch through docker-compose.

Prometheus

The picture above is the Prometheus helm config. By setting retentonSize to 10GB, we prevented the situation where the Pod is evicted due to lack of storage (Fargate temporary storage provides 25GB).

Result

As a result of infrastructure optimization, we were able to further improve usability while saving approximately 192 USD per month. Detailed results can be found on the page below.

https://tokamak.notion.site/AWS-costs-after-improving-storage-aac435ec7b9649b596515fd0c6a0840c

Conclusion

To develop the Thanos network, the project solved many infrastructure and protocol challenges. I hope that sharing more technologies in future development stories will be of great help to readers. Thank you.

--

--