As scientific discoveries become increasingly data-driven, the world needs better ways of moving and managing data.
To most of the world, this photo may not be nearly as interesting as the first-ever image of a black hole. To me, however, it’s an incredible visual representation of how data-driven science and research is today!
And it’s not just any kind of data. All of this is unstructured data — difficult to store, organize, move, and manage for access and collaboration between teams of researchers.
A truly distributed problem, scientists collected five petabytes (PB) of data in total from telescopes around the world. The now-infamous photo of Dr. Katie Bouman with her hoard of disks only shows 64 disks, which if multiplied by a typical drive size of 14TB, would amount to under 1PB, so that picture only shows a fraction of the data used! This data then needed to be processed in a central location to be used for the algorithm that created the black hole image. Since the data was too large to send via Internet, the physical hard drives were shipped to processing centers in Germany and Boston.
Many people might be surprised to hear that the data was shipped via FedEx or UPS when we use the Internet to send data from one part of the globe to the other at unprecedented speeds all the time in our daily lives. Why wasn’t this possible for the data used to create the black hole image?
Let’s start by investigating what it takes to move a petabyte of data, let alone five. In 2015, I did the math and calculated that it should take 11 days to move one petabyte of data over a 10Gbps network link…but that’s in an ideal world. Speaking from real life experience, it would likely take over a month to move a petabyte of data over a 10Gbps link.
Unfortunately, four years later, this estimate still holds true. So, physically shipping these hard drives is still the more time-efficient choice, even today.
But physically transporting data is by no means an easier or less labor-intensive process. It’s also a much riskier process when one thinks of the external forces that might conspire to delay or destroy your data en route.
Believe it or not — the disks coming back from the observatories contained the sole copies of the data because there was too much data to make extra copies. The safety of the disks were also potentially subject to such terrors as human error, customs, and in some cases, extremely specific shipping requirements. Luckily, all shipments of the black hole data went off without a hitch, but as anyone who has used the postal service knows, there are so many factors outside our control that could put the data at risk.
In an article in The Atlantic, Don Sousa, the computer-support specialist at Haystack Observatory who managed every shipment of data for the black hole project, recounts the one time he lost a shipment during his 32-year-long career — not due to anything within his control — but to a hijacking in Johannesburg, South Africa.
So here’s my question: Why aren’t we solving these problems with software, instead of FedEx? Imagine an intelligent, scalable system that could send huge amounts of data across the world for collaboration through automated workflows. That’s the kind of high-tech unstructured data management solution that groundbreaking scientific teams need.
In the aforementioned article, Sousa cites money and scale as the science world’s main barriers to using more modern, cloud native solutions, saying, “Too much data and too much money — that’s why we don’t do it that way. Nothing beats the bandwidth of a 747 filled with hard disks.”
However, as the scientific community continues to base discoveries on enormous unstructured datasets, I’m skeptical about how scalable stuffing 747s with disks really is. Sure, nothing beats the bandwidth of a 747, but what about the latency problem? There’s a real opportunity here to reduce latency and accelerate science using modern technologies.
This is the opportunity that my company, Igneous, is trying to address. I’ll be the first to admit that current unstructured data management technology isn’t all the way there yet, but I think we’re headed in the right direction. We’re learning from our customers in the scientific research community, such as PAIGE and Altius Institute, and continuing to build out capabilities that scientists will need as more discoveries come from massive unstructured data.