What is Decentralized File Storage and Why It’s More Important Than You Think…

Edd Smith
Aleph.im
Published in
10 min readOct 24, 2022

The rise of mobile applications, eCommerce, and the Internet of Things (IoT) has led to the creation of an astounding amount of data in every industry.

Currently, all of this data is processed and stored in large servers owned primarily by one of these three companies: Amazon, Google, or Microsoft. These centralized cloud services drive efficiency in operations through instant infrastructure availability and scalability.

The caveat is that centralized storage creates single points of failures, network outages, and other risks. So, in this article, we’ll look at these risks and explain how decentralized file storage systems work to mitigate them.

But first things first, let’s look at how file storage systems evolved through the years to better understand why the internet needs decentralized file storage systems.

The Evolution of File Storage Systems

The earliest data storage systems were stiff pieces of paper called punch cards. They used holes to represent the presence and absence of information and were initially developed to store data for the 1890 US Census.

However, their usage and popularity grew, which led to a demand for more storage. It was then that IBM came up with cards that had rectangular holes spaced tightly to accommodate more data. For almost four decades, this remained the most popular way to store, sort, and report data.

Later, in 1956, IBM created the first hard drive that could store about 5 MB of data. This was quickly supplemented by floppy disks and compact disks in the 1980s, and they reigned in the storage realm for quite a few years.

It wasn’t until the late 90s and early 2000s that we started seeing modern forms of portable storage flash memory that were capable of holding at least a few GB of data in a much smaller form factor.

In 2006, portable storage gave way to global online storage with the rise of cloud storage solutions. As a result, we today have the likes of Amazon AWS, Google Cloud, and Microsoft Azure.

Since then, there wasn’t a major breakthrough in the cloud storage space, but that changed in the first half of the 2010s when a computer scientist called Juan Benet started working on a decentralized storage system called the InterPlanetary File System (or IPFS).

The need for a decentralized file storage system was partly sparked due to the inception of Bitcoin and the decentralization movement it started. But the rising data privacy concerns spurred by events like the Cambridge Analytica Scandal and disappearance of websites due to competition between top web companies led to an increased demand for decentralized storage services.

Today, innovative projects like Aleph.im, Sia, Storj, and Arweave are strengthening the movement toward decentralized storage. They are utilizing trustless P2P software and cryptography to handle and store data instead of relying on trusted intermediaries like AWS to do the same, and rely on encryption for the confidentiality of the data instead of trust in an external entity.

Problems with Existing Cloud Storage Solutions

Centralized storage systems may seem like the most secure and efficient way to store data as it provides instant infrastructure availability at scale. However, it has serious shortcomings that impact a company’s sustainability and puts users at risk.

Let’s look at the six that have the biggest impact on you:

Censorship

Internet censorship refers to filtering and blocking content to prevent information from reaching users. Although one of its upsides is that it helps curb misinformation, the biggest downside is that it allows central entities to feed their audience only such information that might favor their reputation.

What does centralized storage have to do with this?

Most common websites are hosted on centralized clouds and accessed over URLs. This provides one common way for an authority to censor their content without the intervention of the authors of the website: requesting deletion from the cloud provider and censoring the domain used by the website.

When loading a web page, web browsers must resolve the domain name (ex: aleph.im) at the beginning of the page’s URL using a name server. Name servers are usually provided by the Internet Service Provider (ISP). These must follow regulations put in place by the local authorities, including the censorship of domain names.

Monitoring

Governments are known to regularly monitor and track the activities of citizens in order to crack down on illegal activities.

With the increasingly cheaper cost of storage and automation, the recording of users’ communications and behavior is becoming common, including when the users are not under investigation. The data collected can be used as a tool for profiling and targeting specific populations.

One of the primary reasons governments worldwide can access these large swathes of data is that data is not encrypted in transit.

This makes it easy for governments to use spyware like Pegasus to access data or intercept it, as shown by Ex-CIA contractor Edward Snowden.

Data Breaches

According to a recent IBM report, 83% of organizations have had more than one data breach, of which 45% of breaches occurred on the cloud.

The reason for the common occurrence of these breaches is that centralized servers don’t encrypt user data, making it easy for hackers to steal information This leaves users prone to identity thefts, frauds, and social attacks.

For instance, Facebook’s 2021 data breach resulted in the dataset of 553 million users published online on a hacking forum.

This translates to huge financial losses for all stakeholders involved if a breach occurs (the average loss caused by a data breach increased from $3.85 million in 2020 to $4.35 million in 2022, according to the IBM report above).

Server Downtime

Although cloud-based services are designed to provide high levels of service availability at a large scale, they aren’t fail-safe, which inevitably leads to downtimes.

However, the problem isn’t downtime per se. It’s that the cloud storage sector is highly centralized, and the smallest mistake can have devastating consequences.

Take the Amazon S3 fiasco, Google’s global outage, or the OVH cloud data center fire, for instance.

The Amazon event took down many websites, including Quora and Trello, for 4 hours because the members entered a command incorrectly during a debugging operation.

In another incident, Google suffered an outage due to a lack of storage space in authentication tools.

The OVH cloud data fire, on the other hand, took down at least a million websites, including government agencies and banks. This accident rendered many services inaccessible and created financial losses for businesses.

Rising costs

The global cloud computing market has grown rapidly in terms of storage and costs in the past few years.

According to an Andreessen Horowitz study, some companies exceed their committed cloud spend by at least 200% due to rising storage costs.

The global economic crisis triggered by the 2022 Russian invasion of Ukraine partially contributed to the price spike. (Google announced a price increase in March after the outbreak of the war.)

But a majority of cloud service providers engage in predatory pricing practices; they employ below-cost pricing to encourage you to use their services and rack up prices substantially after a certain time.

If you then want to switch to another service provider, you have to pay an exorbitant egress cost to write data out of your network or repatriate it back to your on-premise environment.

This becomes a real problem for a company’s sustainability as the costs can quickly outpace the profitability.

Lack of data ownership

Centralized service providers have an infamous track record of profiting from user data.

Amazon, for example, has been known to scoop up data from its sellers and launch competing products. Although it stated otherwise in its written statements, interviews with more than 20 employees indicated the above to be true.

In one instance, Amazon employees looked at the best-selling car-organizing trunks to create and launch a similar product for the company.

Now, data being the new oil, central entities in control of data can monopolize their business sectors. The ones at loss are business owners because they run the risk of being overshadowed and shut down by these giants.

Another worrisome aspect of cloud service providers is that they may de-platform you without prior notice or refuse to renew your contract with you for arbitrary reasons.

For instance, AWS is said to have de-platformed Wikileaks in 2010 after U.S. Congressional staffers started asking the company about its relationship with the platform.

But in its official statement, the company denied the claim and argued that the content displayed could put innocent people at the risk of being persecuted.

Either way, it only goes to show the immense power that cloud service providers have and how they can use their terms of service to de-platform users.

Decentralized File Storage: The Fix That Can’t Wait

Decentralized file storage is to cloud what decentralized finance is to traditional banking — a major disruptor that removes the need for trusted intermediaries to store data. It’s a new-age, blockchain-powered storage system that allows users to archive, retrieve, and maintain their own content in a decentralized and distributed fashion.

Let’s look at how these decentralized file storage systems work and why they can prove better than their centralized counterparts.

How Decentralized File Storage Works

The two existing solutions that help store data in a decentralized manner are blockchains and blockchain-based storage systems.

The former does justice to securely storing small bits of data such as transaction records, however, they are not efficient for larger files. For instance, it would cost about $20 million to store 1GB of data on Ethereum. So, for the purpose of this article, we won’t be focusing on blockchains.

Instead, we’ll look at how blockchain-based systems like Aleph.im store files. The working principle is simple and can be explained in four steps:

  1. The developer monetarily incentivizes storage resource nodes (computers) to join their channel so they can store their data. Think of a channel as a private chat room. When a node joins a channel, it can see the storage history, the latest file version, and the edits previously made to it.
  2. The nodes then upload the data on Aleph.im and prove that they’ve stored a full copy of the data. To ensure that the storage nodes are doing their job, core channel nodes step in and do a quality control sampling of random nodes. Core channel nodes are the validators of the network. Their owners have a stake in the system. To be regarded as a core channel node, the node owner should stake 200,000 ALEPH and entrust the node with at least 500,000 ALEPH. If the storage nodes try to play against the system, the core channel nodes slash their availability score — a score based on which the platform rewards core channel nodes and storage nodes. This way, rogue nodes eventually get left out of the system.
  3. Next, the file is split into chunks, encrypted, and stored across multiple storage nodes to maximize fault tolerance.
  4. If a user or developer wants to retrieve their file, they download it and decrypt it if relevant.

Why are Decentralized File Storage Systems Better?

To understand exactly how decentralized file storage solutions present an advantage over their centralized counterparts, let’s start by looking at how data is accessed in both systems.

Content-centric addressing

When you visit a website or access any sort of content like an image, PDF or video over the internet today, chances are you’re doing it through a Uniform Resource Locator a.k.a. URL, which points to the location the content is served from.

This location-centric way of accessing data seems very convenient, but it’s also very fragile. For starters, there’s no trivial way to verify you’re getting the right data.

It’s entirely possible that you end up on a malicious website due to a typosquatting or a bait-and-switch attack as location-centric addressing means whoever controls a specific location gets to decide what users see when they access that location through the link.

Another disadvantage is that your content could be lost forever due to broken redirects because you can access content only by identifying where it is.

But when it comes to decentralized storage, you access content by identifying what it is. This is made possible due to a new form of linking called content identifiers (CIDs) which look something like this:

ipfs://bafybeidoodypolrlzufnng5swfpotyhu7cdjh5vs6sc6jbuz47lprw6wfi

CIDs are labels that point to specific data instead of its location. They’re based on a unique cryptographic hash derived from the content of the data. That means the hash is directly linked to the content and not to the location where the content is stored. If the content changes, the cryptographic hash also changes, so you don’t have to worry about ending up on a malicious website.

Conversely, it is possible to fetch the same content from different locations with guarantees of its integrity. This can help go around censorship systems.

Data Redundancy

To ensure data availability and integrity, the storage nodes will add a level of redundancy to each data chunk. This will enable the system to recreate the file even if some nodes are unreliable or go offline.

Some decentralized storage systems like Storj use erasure coding to achieve this redundancy. On Aleph.im. data is split in chunks and replicated on multiple nodes.This is less computationally intensive and allows use cases like Compute Over Data.

But if you’re using a centralized storage provider, you need to pay an extra amount to back up your data, as there are no mechanisms in place to restore it if it’s lost due to a hack or other accidents.

Trustlessness

As we mentioned earlier, the biggest advantage of decentralized file storage systems is that they secure data without centralized intermediaries.

They achieve this through two related functions: penalties (slashing, losing network power and rewards, etc.,) and cryptographic proofs of storage like proof of replication (PoRep) and proof of spacetime (PoSt).

But the drawback of PoSt, as seen in the case of Chia, is that it’s generated using plotting and farming, which are write-intensive functions and can wear out cheap SSDs within weeks.

Aleph.im, on the other hand, goes about the verification using core channel nodes (as explained in our blog here) to avoid this damage.

Decentralization is the Future of Data Storage

Blockchain-based storage systems are off to a promising start as they fit the decentralization ethos of Web3 and offer economical and technological advantages over centralized storage systems.

Aleph.im is one of the most prominent entities in this space as it allows to write bulk content on blockchains without compromising on low latency. Further, it offers a more efficient and less computationally intensive way to maintain data integrity and verify storage. We’re excited to be part of ushering in the new era of data storage with our suite of decentralized, cloud based products!

Thanks and keep in touch

Join our live conversation on our Telegram Community Chat.

🌴 Linktr.ee | 🌐 Website | 🗞 Blog | 📄 Papers | 🐦 Twitter | 💬 Telegram |
💼 Linkedin | 💻 GitHub | 📒 Dev Docs | 🤖 Reddit

--

--