Instructions for Saving Endangered Data: It’s time to get decentralized.

The Situation: Endangered Data Woven into a Precarious Web

precarious adj. Dangerously lacking in security or stability: a precarious posture; precarious footing on the ladder.adj. Subject to chance or unknown conditions.

Don’t let important data get destroyed.

The Problem: Identifying Content by its Location

When you use an http:// or https:// link to point to a webpage, image, spreadsheet, dataset, tweet, etc, you're identifying content by its location. The link is an identifier that points to a particular location on the web, which corresponds to a particular server, or set of servers, somewhere on the web. Whoever controls that location controls the content. That's how HTTP works. It's location-addressed. Even if a thousand people have downloaded copies of a file, meaning that the content exists in a thousand locations, HTTP points to a single location. This location-addressed approach forces us all to pretend that the data are in only one location. Whoever controls that location decides what content to return when people use that link. They also decide whether to return any content at all.

  • Always be open, 24/7, in case someone wants to read the book.
  • Provide the book to everyone who seeks the book, whether it’s one person or hundreds of thousands of people.
  • Protect the integrity of the book by preventing anyone from tampering with it.
  • Never remove the book from its shelf — if they get rid of it, or even move it, my link is broken and nobody will be able to use my instructions to find the book.
  • Dictate who is allowed to see the book.
  • Move the book without telling anyone.
  • Destroy the book.
  • Charge people money to access the book or force them to watch ads when they walk in the door.
  • Collect data about everyone who accesses my book, using that information however they want.
  • Replace the book with something else — They might not even put a book there, since my instructions are just describing a location, a malicious actor could replace the book with something dangerous, turning the location into a trap!

The Solution: Identify Information by its Fingerprint, not its Location

Files and all of the blocks within them have unique fingerprints called cryptographic hashes.
When looking up files with IPFS, you’re asking the network to find nodes that can return the content corresponding to that unique hash.

How to Do It: Write Content onto IPFS and Publish the Hashes

IPFS is a content-addressed protocol designed as a replacement for HTTP. There are multiple free, open-source software implementations of the protocol. You can use that software to run an IPFS node, add your data to the IPFS network or to hold copies of data that other people have published.

  1. Install an IPFS node on a machine (laptop, desktop, server, etc.) that has internet access.
  2. Add the content to your IPFS node.
  3. Tell your peers the cryptographic hashes (aka fingerprints) for the content you added to IPFS.
  4. Let your peers replicate copies of the content onto their machines by “pinning” the hashes in their IPFS nodes

Writing Content onto IPFS

The first step is to install an IPFS node on your machine and write your content into that node. The IPFS node is how you participate in the peer-to-peer network, reading content from other nodes and providing content to nodes that request it. When you write content into your IPFS node, people will be able to request that content using its hash/fingerprint.

Pinning Data to Save It

IPFS has a notion of pinning content onto your IPFS node. When you “pin” content on your IPFS node, you’re adding the content’s hash (aka fingerprint) to the node’s pin set. As long as you have that hash in the node’s pin set, the node will keep a copy of the corresponding content on your machine.

Publishing the Hashes

The real power of the distributed web is the fact that anyone can participate. If you publish the hashes for the content that you want to save, anyone who cares about the data can pin their own copies, sharing the burden of storing and serving the data.

Do I have to worry about Bad Content coming onto my machine?

IPFS is peer to peer technology, which tends to bring up concerns about bad content. People want to know “If I run an IPFS node, will that mean people can use my machine to serve bad content without my permission?” and “Will my IPFS node pull bad content onto my machine without my knowledge?” The maintainers of IPFS take this issue very seriously. The IPFS protocol is explicitly designed to ensure that you have complete control over which content comes onto your machine through IPFS. Your IPFS node will only read the content you tell it to read from the network. It will only store the content you tell it to store. This allows you to be confident that bad content won’t accidentally arrive on your machine. If someone on the network publishes bad content, it won’t leak onto your IPFS node. You would have to explicitly request the content in order for it to arrive on your machine or for it to even pass through your machine.

Covering Your Bases: Strategies for Making your Content Resilient

In order to truly save your endangered data for the long term, you need to store and distribute the the data in ways that are resilient. This requires doing more than just writing your data to IPFS and asking your friends to pin copies of the data onto their machines. You also need to consider issues like redundancy, availability, authenticity, versioning and preservation. Here’s a quick overview of each of those issues with some tips about how to handle it in a decentralized context.

Talk to a Librarian

When grappling with these issues, it’s helpful to look at libraries for inspiration or guidance. Libraries often talk about providing three types of services around their collections: preservation, discovery and access. If you want people to engage with the content you’ve collected, you need to support all three of these things. If you slip in any of these areas, people won’t be able to use your content. This applies to the issue at hand — in order to save your endangered data, you need to cover all three of these bases. You need to preserve the content so that it still exists for people to use. You need to keep metadata about the content so can people can search or browse through the metadata in order to discover what you have in the collection. Finally, you need to give them a way to access the content itself.

Achieving Redundancy

Lots Of Copies Keep Stuff Safe. That’s a foundational idea in any preservation strategy. There’s even a project by that name which helps libraries preserve digital content (alas, it doesn’t use IPFS yet). In order to protect your content, you want to get it pinned in many geographic locations, by many organizations, under multiple jurisdictions.

Ensuring Availability

If you want the data to be available online, redundancy isn’t enough. You need to make sure that some of those copies are actually available on the network, otherwise nobody will be able to access the content. In order for data to be available online at all times, you need to ensure that there are always IPFS nodes connected to the network with copies of your data pinned on them.

Ensuring Authenticity

Once your data are out in the wild, how do we know which data are the real data? Until now, we have relied on location as a proxy for authenticity by saying “If it’s on your server, then it must be the real information that you want me to see.” This is a terrible way of establishing authenticity (the hosts could tamper with data, hackers could change it without anybody knowing, it could accidentally become corrupt, etc.). Nonetheless, that’s how we’ve been establishing authenticity of data on the web for a long time. It’s a strongly ingrained habit that we can’t rely on with distributed technologies. What’s the alternative?

Dealing with Versioning

This is not a one-time process. In most cases datasets change, grow and evolve over time. In order to accommodate those changing, growing, evolving datasets we need ways to keep track of the different versions of content. Thankfully, content-addressing gives you the basic building block that you need in order to track versions gracefully. Powerful versioning tools like git build on that same foundation of content-addressing and use cryptographic hashes to build trees of information to represent history, versions, forks, etc.

Preserving Data

Finally, beyond redundancy, availability, versioning, etc. there’s the question of preservation. In order to build a preservation strategy, you need to look at threat models and then figure out how to protect your data from those threats.

Why the Established Tools Aren’t Good Enough

All this talk of decentralization and content-addressing might sound excessive. It’s a major change from the way we’ve been doing things for the past 15 years. As closing observations, we’ll touch on some reasons why it’s not enough to rely on the established tools and technologies.

What’s wrong with just moving the data to a new, trusted location?

Merely moving the data to a new location is not enough because it perpetuates all the problems of location-addressing. It brings all the pain and inconvenience of breaking the location-based links we’ve been relying on but doesn’t bring any of the benefits of switching to a content-addressed approach.

Why isn’t it enough to have everyone download copies of the data?

Lots of Copies Keep Stuff Safe, but simply downloading copies of the data to many locations is basically adopting a decentralized approach without using any of the tools of decentralization. You need a content-addressed approach in order to answer basic questions like “Who has copies of the data?”, “Are these two copies of the data identical?” and in order to communicate things like “Here is the latest version of the data” and “I have the last three versions of the data. Which one do you want?”

Can’t we use the cloud to back up the data?

design by Chris Watterston

Can Libraries Save the Day?

Yes, libraries can play a huge role in this. Decentralized technologies are a perfect fit for libraries. This is an amazing opportunity for you to work with your libraries to create a resilient infrastructure for humans to share and hold digital information.

Become a Steward of Your Data

If you’d like to get help with the things discussed in this article, or if you’d like to contribute to IPFS and all the tools that make this possible, go here or email contact@protocol.ai. If you have a use-case in mind but IPFS needs more features or bug fixes, please post issues here.

--

--

Working to decentralize the web while striving to meet the world with bravery, generosity and kindness. Program Manager at Protocol Labs, creators of IPFS.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
flyingzumwalt

flyingzumwalt

Working to decentralize the web while striving to meet the world with bravery, generosity and kindness. Program Manager at Protocol Labs, creators of IPFS.