File encryption and encoding process before being uploaded to the Storj network

How files are encrypted, encoded and concurrently transferred peer-to-peer on the Storj network

5 min readOct 5, 2017

Intro

I’m frequently asked to explain the process for transferring files on the Storj network. The steps are completely automated so the actual process is not immediately apparent to an end user, and there isn’t any up-to-date documentation on how it works. This blog post explains the interaction between Client, Bridge and Farmer and exactly what happens to a file when it’s uploaded to Storj, and when it’s retrieved.

As an overview, a Client encrypts and encodes files, a Bridge monitors and stores the locations of shards for files and meta data, and Farmers store the shards of the files. The Client file encryption and encoding process is implemented in libstorj (with the SIP5 standard) as included in FileZilla.

I will also detail some of the current caveats and limitations. At this time, there can be data lost because of network churn, and it’s necessary to repair the decay of files — there is still further testing and improvements that are necessary in this area.

Upload Process

File is encrypted with AES-256-CTR with a key derived from the Encryption Key seed (this is the twelve to twenty-four words sometimes called a Mnemonic) and an index.
The file is then encoded with Reed Solomon erasure encoding, expanding the total size to 1 and 2/3 of it’s original size (technically this ratio is adjustable). A shard size is determined at this point, as a multiple of 2MiB (e.g. 2, 4, 8, 16…).
Each shard is then hashed, with SHA256 and RIPEMD160, and the Client will ask for a location, a Farmer, to store the shard from a Bridge.
The Bridge selects Farmers based on reputation and those least recently used, and asks many of them concurrently if they are willing to store the data. The Farmers respond and are put into a cache of available mirrors for the data. This implementation is currently being deployed, please see an early post Problems with Quasar based publish-subscribe systems… for more details and background information.
The Client receives in response a Farmer contact from the Bridge, which includes a nodeID, and IP address, port and token. The token is used to authorize the upload from the Client.
Each shard for the file repeats steps 4 and 5 concurrently, and data is uploaded to each Farmer at the same time for faster transfers.
When each shard is complete an Exchange Report is sent to the Bridge with a success or fail status.
An HMAC is generated from the hash of each shard, this is later used to verify the integrity of the file and that it hasn’t been modified.
Once complete, the file meta data is sent to a Bridge and finalizes the upload. The Bridge will then monitor the shards, replicate shards, and will later also heal missing shards when lost.

Download Process

The Client requests the file meta data from the Bridge. This includes the HMAC, the decryption index, the size, and a list of all of the hashes. The integrity of the file can be verified before decryption, using the Encryption Key, HMAC and shard hashes. This avoids any potential issues as detailed in the The Cryptographic Doom Principle blog post by Marlinspike.
The Client then requests the locations of the shards from the Bridge.
The Bridge then reaches out to about six of the known Farmers storing that shard, and asks for a retrieval token. Of those that receive a response, a token and the contact details for the Farmer are then sent to the Client.
The Client then downloads each encrypted shard directly from each Farmer to disk at the position it will exist in the file and verifies the hash of the shard.
For each shard an Exchange Report is sent to the Bridge to report its success or failure. This information can later be used to improve the ability to retrieve the files.
If there are any shards that didn’t receive location information, the Client will recover those shards from the Reed Solomon encoding.
The file will now be decrypted and the file returned to its original size.

Caveats & Limitations

Erasure encoding was added after the encryption step so that Bridges would be able to heal the data without interaction with a Client that holds the Encryption Key. However this process has not been implemented yet (as of October 2017). It’s also not completely necessary until storage contracts are renewed.
Contracts for shards are currently held for a length of 90 days. These can be renewed, however improvements to bandwidth accounting and auditing of when shards are no longer available at a Farmer is needed. This will be a more advanced version of the current monitor that creates new mirrors in the event that a Farmer goes offline.
Streaming files is not currently well supported. If the first shard is lost, erasure encoding will need to be used to recover it, which could mean downloading potentially more data to recover the first shard to start the stream. This could lead to long delays on the initial load.
Erasure encoding on 32-bit systems is currently limited due to the use of memory-mapped files. The maximum file size on 32-bit systems is typically around 2GiB (technically 2 ^ 32 bytes or 4GiB).
Larger, greater than 20GB, files will take much longer to decode and encode. Using Reed Solomon erasure encoding scales quadratically. There are other possible erasure encoding algorithms that may become useful in the future to solve these limitations.
Smaller, less than 2MiB, files currently lack Reed Solomon erasure encoding and rely exclusively on mirroring. Once the data is small enough the meta data necessary to store the information about the locations of the shards can be more than the file itself. So there is a balance in efficiency that is necessary to consider.
Monitoring, meta data storage and maintenance of the locations of shards is the responsibility of each Bridge. This data and responsibility is not shared among many different Bridges, and thus the availability of files is dependent upon the accessibility to each individual Bridge.

In Conclusion

There are still many improvements that can be made for the persistence of data; however, this shouldn’t discourage anyone from testing Storj as a means of making data securely available and convenient across multiple devices. It’s important to keep additional backups of the data to protect against data loss.

I’ve found myself using Storj while developing, for transferring patches and builds between different systems during cross platform testing. It ends up being the most convenient and secure way to move files around.