aka "My quest to evaluate distributed storage options for storage and retrieval of the digits of PI".
A mathematical enigma for millennia, the ratio between the circumference of a circle and it’s diameter (3.14159….) will no doubt continue to be a source of fascination for generations to come.
Although the useful accuracy of this constant (as far as calculations using it is concerned) is probably no more than 10–20 digits, the use of its calculation to a much higher accuracy as a verifiable test of computing power will likely remain popular for some time yet.
Technical uses for the digits range from benchmarking to randomness research while novelty uses of the digits range from works of art to creation of music!
50 Trillion digits..
The current world record for the number of decimal digits of PI calculated stands at 50 trillion! This took around 9 months using software called y-cruncher on a pretty hefty server with large amounts of local storage.
Although I was intrigued to see if distributed storage might be helpful in storing some of the raw data while it is being calculated (this raw storage requirement is several times larger than the result), it turns out that with y-cruncher (and most likely any other software that might get developed), storage access is the main bottleneck in calculations, so even though distributed storage can be fast (due to downloading chunks in parallel), it would only become a viable option if the network is significantly faster than the local storage.
However, already calculated digits are certainly a prime candidate for distributed storage, particularly since the storage required to store 50trillion+ digits (20TB+) is now starting to exceed the maximum HDD capacities readily available.
How PI digits are currently stored
Obviously having all the digits in one massive file isn’t very portable. So it makes sense to split ranges of digits into separate files. This is exactly what y-cruncher software does.
Since the calculation is done in binary, the most efficient way to store it is natively and in fact this is one of the options (to store in hex format).
But since we typically want the digits in decimal, it is not possible to just access a random chunk of hex/binary digits and convert them to decimal, they need to be converted in a batch, so it makes sense to store them in decimal format if that’s how they are to be retrieved.
Storing each single (0–9) digit per byte (0–255) would be incredibly wasteful (only 41.52% efficiency). So y-cruncher employs its own compression method to pack 19 decimal digits in a 64-bit (8-byte) chunk which is 98.62% efficient. (I would have chosen to pack 12 decimal digits in each 40-bit (5-byte) chunk to achieve 99.66% efficiency but maybe there was another reason it was more optimal to have 8-byte aligned chunks).
Obtaining the digits
Fortunately, Google have a website (pi.delivery) dedicated to many things PI, including some novelty uses and serving the digits via an API. It also offers a mirror of the y-cruncher generated files in 1trillion digit sized chunks.
Unfortunately, neither the API (which is limited to 1,000 digits at a time) nor the y-cruncher 1trillion digit compressed files (over 400GB each!) are particularly useful for users who want to be able to extract moderate chunks of digits easily which is one of my reasons for exploring this project!
Storing the digits
Before we can consider sharing the digits to potential distributed storage platforms it would be prudent to first collate the digits into one place. As stated, the 50 Trillion digits exceed the storage capacity of 20TB HDDs even in compressed form, so we need to either use multiple HDDs for different ranges of digits or preferably a RAID array to give one larger addressable storage volume. An additional benefit of RAID is that it can offer protection from data loss/corruption when a parity configuration is chosen so that is the option I have used.
Once downloaded, I can decide if I want to first re-encode the digits into smaller files or a different file format before committing them to distributed storage.
So what next?
As I write, I am still partway through downloading the digits. At over 20TB, I expect this will take several days with a 1Gb/s connection.
In part 2, I’ll be looking at the raw data once it’s been downloaded and decide how to proceed from there including which distributed storage platforms to consider.
I am already familiar with 0chain, Sia & Storj plus very basic experience of Arweave, IPFS & Filecoin but will also consider other platforms if deemed suitable.
About the author
I am an Ambassador and Head of IT for 0chain project. I have long since been a distributed storage enthusiast, having previously written several articles related to this subject.
(EDIT: 18th July - corrected the decimal digit efficiency figures)