Designing and scaling a PetaByte Scale System — Part 1

Tarun Kumar
Airtel Digital
Published in
4 min readJul 16, 2020

Our single application required storing 6–7 PetaByte of Compressed Data Annually and we needed to store 6 years of Data.

Our Initial Bandwidth Requirement was 48GB/sec which we reduced to 15–18 GB/sec with usage of better compression techniques and metadata.

First Question that comes to mind when designing a PetaByte Scale system is designing storage layer. There are 2 common solutions to this. First one is hadoop file system and another one is object storage.

Hadoop File system vs Object Storage

Hadoop has Single Point of failure: Biggest Reason for choosing Object Storage was Hadoop has single point of failure. Namenode keeps all metadata in memory and as file count increases, there is increased pressure on namenode. As a rule of thumb, Namenode requires 1GB per million files. So for 500 millions files we required 500GB memory. There is a limit upto which memory can scale in system. Although a new concept namenode federation which segregate into multiple namespaces is there. but we expected more than 500 million objects in our system in single namespace.

Scalability: Hadoop File System do not allow independent scaling.​In Hadoop, compute power and storage capacity need to scale in sync. You cannot scale storage and compute independently. Object storage, you can increase storage as and when you require .

Cost: Object storage is generally cost effective. It is fraction of cost of hadoop.

Elasticity: Another big benefit of separating storage and compute infra is, you can scale compute resources on demand upto limit supported by Object storage system. You can free up compute and put on another task when not in use.

We used Object Storage for our application.

Our Object Storage solution can scale up to 72 PB storage and 9 Racks.

Our Configuration at beginning which will scale with time as space gets filled up.

Figure of Single Site:

Routing on Multi Site:

  1. 3 Site system.

2. One Site/Rack has 3 system nodes, with 6 storage nodes.

3. Connection Threshold: Maximum number of connections per System Node is 400, for a total of 1200 per rack/system. or 3600 for 3 rack system.

4. Network Bandwidth: Each rack delivers up to 6 GB/sec throughput for high productivity or 18GB/Sec for 3 rack system.

5. Scalability: Our Object Storage solution supports scale-out for up to nine racks.Scaling up the system is offered in the form of its flexible capacity feature, which allows you to start with as little as one populated sled column (one sled in each Storage Enclosure Basic) and increase the number of populated columns over time. In other words, you can replace all the empty sleds in an entire column with populated sleds. Overall It can scale upto 72PB of Raw storage.

6. Data Replication : Geo-spreading/ Replication is an object storage capability that allows an object to be spread across three different locations in such a way that if one location is unavailable, the object is still available at the two other locations. This is unlike traditional architectures where 2-site replication means the data would be copied in both locations to respond to system unavailability. The system in a geo-spread configuration is comprised of multiple racks (instances) managed as a single namespace or failure domain. The system takes an object and breaks it into chunks of data and parity and spreads those across the three locations with enough parity to rebuild the objects even if one location (and the third of the chunks on it) are unavailable. This is done with strong consistency, so there is no chance of getting stale data. In a 3-geo configuration, if one site becomes unavailable the data will still be available in the two available sites and new objects will be written to the two available sites. When the formerly unavailable third site becomes available, data will again be spread across all three sites.

While choosing object storage as storage layer, consider network bandwidth requirement and parallel connection count very closely. As they can become bottlenecks in your system in future. You can scale compute infra only till underlying storage layer supports.

Part 2: https://medium.com/airtelxlabs/designing-and-scaling-a-petabyte-scale-system-part-2-365c28e9aecb

--

--