Torrenting the galaxy
This is the story of how a bunch of 1’s and 0’s on a satellite got into a 3D simulation of our Milky Way I constructed that shows the positions of over two million stars around our sun.
The European Space Agency’s Gaia satellite is currently twirling about 1.2 million kilometers beyond the moon. As it spins, it dutifully records the position of every point of light it sees. On September 14th, 2016, the first of many scheduled public releases of that data occurred, in the form of a press conference and about 5200 40MB gzipped CSV files. And that’s compressed CSVs, the total size of the data is 532GB, and over a billion rows, each one another sun.
Each row looks more or less like this:
And Gaia is accurate. For the brightest stars, their positions will be measured to 5 microarcseconds, which is the equivalent of a dime on the moon as viewed from earth. It’s bananas.
Its release garnered little fanfare in the media, 2016 being what it is. Also, despite the miraculous nature of the work being done by this satellite, the main image they released doesn’t really impress at newsfeed-scales.
It’s black and white, and it’s kind of glitchy, and it really belies the amount of information this brave little toaster has gathered for us. One thing this image fails to convey in particular: the fact that over 2 million of these stars now have known 3D coordinates in space, with many more to come. Here’s how I extracted that data from a small ocean of CSV files.
Determining a star’s distance with parallax
Once I’d downloaded all the data, I had to find a way extract just the stars with 3D positions. The data doesn’t actually contain the x,y,z coordinates for the stars, I would have to calculate those. But about 2% of the stars had measurements performed to find their parallax, which is the measurement that allows us to determine a star’s distance from us. Remember that “camera 1, camera 2” part in Wayne’s World?
That’s parallax. If you look at the same thing from two different locations, how much it moves tells you how far away it is.
As the earth (and the Gaia satellite) go around the sun, the stars move a little bit depending on how far away from us they are. The earth is about 150 million km from the sun, so if you took a picture of some stars on June 1st, and then the same exact picture on January 1st, you’d be seeing the stars from two different points, separated by about 300 million km.
By very carefully measuring the locations of stars at the two extremes of its orbit around the sun, Gaia can measure the change in the star’s position and use that change to determine the star’s distance. I wanted to calculate those positions and use that data to render these stars in a 3D simulation I could spin around and view from any angle.
In the weeds with a few hundred gigs of CSV files
My task was pretty simple: filter through all 1 billion rows of data and look at each row to see if it contains a parallax measurement. If it does, let that row through to a new datafile, otherwise skip it.
When you have to process a whole bunch of data, the best way to do it is with streams. Though they’ve been around since before the dawn of unix time, I only recently added them to my metaphorical programming toolbox. Doug McIlroy described them this way back in 1964:
We should have some ways of connecting programs like garden hose — screw in another segment when it becomes necessary to massage data in another way. This is the way of IO also.
When designing software for streams, it’s best to think about your data in discrete chunks or packets, and design a flow that each packet goes through to become the kind of data you want at the end. In my case, on one end I had a directory full of compressed CSV files, and on the other end I wanted to have a single much smaller CSV file that contained only the rows I was interested in. Here’s how I did it:
If you have no programming background, you can still get the gist of what’s going on with a little explanation. If I hide a bit of code, here are the hoses in my data stream and what they are doing:
The initial createReadStream reads a compressed data file from the ESA and starts emitting packets of it as they are ready downstream. They are piped through a series of functions that operate on each packet one at a time, and emit new data at the other end. Here’s the command I run that takes a compressed file and emits a filtered uncompressed one.
node processStars.js GaiaData_000.csv.tar.gz > GaiaData_000.processed.csv
And this is the end of the stream. Gzipped CSV goes in, filtered uncompressed CSV with some calculations performed comes out.
This took a super long time to run. Even when I spread it across my rendering rig and a couple laptops, multithreaded across every core (using this handy bash util) it still took most of the day. And, it turned out later, as it so often does, that I had made an error in my calculation and needed to run the whole thing again. This was a successful proof of concept, but I needed a better way to run it in parallel.
How to throw a few dozen computers at the problem
(*Note: I used Amazon’s stuff for everything, but that’s not an explicit endorsement of Amazon and their technologies or to say that’s the only way to do it. You could do this same work on a pile of Raspberry Pi’s if you wanted to with minimal modification, but I happened to use Amazon’s cloud and it worked pretty well.)
I knew if I wanted to really be able to painlessly work with this much data going forward, I’d need to have it in a place lots of computers could access. Amazon S3 data transfer is free when it stays within Amazon’s data centers, so if I could get the ESA data to S3, then I could use EC2 to throw lots of servers at the problem pretty cheaply.
Downloading all the data to S3 concurrently seemed like a good starter project to get the hang of Amazon’s container service and docker clustering stuff. Using JS streams again, I made a little docker container who’s only job was to download a file from the ESA, and stream it straight back up to S3 as it came in.
This worked pretty well, and I could pretty easily start 20 copies of this process, each one getting a different file url to retrieve, across an arbitrary number of servers without really worrying about the IO part. Within a couple hours I’d downloaded the entire 180GB dataset straight to to my own private S3 bucket. Then I re-ran my original calculation script, but this time instead of running around with usb hard drives in my house, I just clicked a few buttons and suddenly had 5 octocore servers somewhere working on the problem for me. This took only 2–3 hours to complete the whole set, and the result was 5000 or so CSV files again, though this time much smaller, a total of about 2 million rows of information. I downloaded these, concatenated them into a single CSV, and imported them into a local sqlite3 database so I could easily query, sort, and manipulate the data in preparation for the next step.
Packing 2 million stars for a trip to your GPU
CSVs are nice formats because they’re easy to understand. If worse comes to worse, you can query a dataset with the find command in any reliable text editor, and copy and paste data you need into spreadsheets. But whenever you’re dealing with several million of something, every bit of data you can save counts. For example, here’s a single coordinate for a star in CSV format:
Straightforward enough. That’s three numbers, each representing a position in space in X, Y, and Z dimensions in parsecs. This star is 111 parsecs to the right, 111 parsecs up, and about half a parsec forward from the sun. How much memory and storage does that information require?
CSV files are strings, and so the smallest unit of storage is a character. If we count, that row of data has 29 characters, comprised of 2 commas, three decimal points, and a bunch of digits between 0 and 9. In a plaintext ASCII file, each character is stored as a byte, or 8 bits. For each of those 29 characters, there is 1 byte, or 8 1’s and 0’s in a certain combination representing it in a long list in the computer’s memory and on its disk. So for example, a capital A is actually the number 65 in binary, or 01000001, and the number 1 is actually stored as the number 49, or 00110001. When you write a capital A into a text editor, it’s actually flipping 8 transistors somewhere in the machine to represent that value. The letter A is just one way the computer has of representing the number 65.
This all may seem like indulgent neckbeard nonsense, but really knowing how much data your computer is moving around and operating on is a universally useful tool to diagnose bugs and performance issues. Follow the bits! If you can use fewer of them, things will generally be better.
CSV files aren’t terribly efficient from a computer’s perspective. If you’re storing numbers, then converting those numbers into letters that your computer converts back into numbers, it adds a lot of work and overhead to the process. It would be best if we could store the data already in whatever format that the computer will eventually be using.
Computers can process lots of different kinds of information, but it all ends up as bits in the end. Because the information I am dealing with is numbers that have decimal points, I’m using a 32bit floating point value for each coordinate. If we really want to know our data, it’s worth going into how those work a little bit.
32 bit floats are packed like this. The first bit determines if they’re positive or negative. The next 8 bits determine an exponent, and the last 23 bits represent the “significand”. It’s all a bit mysterious, but the short version is this: you’re storing a big integer number in the significand space, and then applying a positive or negative exponent to it to get the final value with a decimal point. It’s definitely more complicated than that and I don’t fully understand it. Suffice it to say, each x, y, and z value for our star ends up as 32 bits in the format above when it is written to memory or mathematically operated upon by the computer. Using bits to store data in Float32 format, the 3 numbers from our CSV above end up like this:
byte0 byte1 byte2 byte3
-------- -------- -------- --------
111.23603 = 01000010 11011110 01111000 11011001
111.36941 = 01000010 11011110 10111101 00100011
0.646683 = 00111111 00100101 10001101 00000100
By cleverly using the number of combinations of bits available to us in groups of 32 (2³² ~ 4.3 billion-ish), we can store data much more efficiently than as a list of individual characters. We also can store data in a pre-formatted way. Meaning that later on when we retrieve these bytes straight from a file, we can immediately use them as numbers without needing to parse through a string and run a conversion. Compare that to the amount of space required for the CSV in plaintext format:
byte0 byte1 byte2 byte...n
-------- -------- -------- --------
111.23603, = 00110001 00110001 00110001 00101110
00110010 00110011 00110110 00110000
111.369408, = 00110001 00110001 00110001 00101110
00110011 00110110 00111001 00110100
00110000 00111000 00101100
0.646683 = 00110000 00101110 00110110 00110100
00110110 00110110 00111000 00110011
With the floating point version, the numbers will always contain the exact same number of bits no matter how big they are. This means that you can just count bits as they come in, and you’ll know exactly how many stars you have received just by the count. You can also start using that data as soon as it is available because you know exactly when each value starts and ends with some simple division. With a string, it’s unclear how much information you are waiting for and the amount changes row to row.
To make a binary file, I again use streams to dump the 32 raw bits of each float32 number directly to a file with no additional formatting. At this point, I’m working with a local database that I derived from the 2 million or so rows I extracted from the CSV files in a previous section, so I can just query it for values with valid parallax. It looks like this:
The file I am using contains an x,y and z coordinate in float32 format for two million stars. So each star requires 12 bytes (96 bits) of data to determine its position. You can’t really get any smaller than that without giving up precision. We are outside the realm of media meant for human consumption, where clever algorithms can throw away data our meat can’t sense. We’re in the realm of pure information now, and you can’t really trick data into being smaller here.
All told, stars.bin is 24,313,140 bytes. So, knowing that each star is exactly 12 bytes, dividing by 12 that should contain exactly 2,026,095 stars.
Torrenting the galaxy
Because the binary file produced above is just one long list of groups of 12 bytes, I can process it however I want, including in pieces as it downloads. The simplest way would be an HTTP request, the same way you download an image or video. Using streams yet again I can process the data as it downloads in discrete chunks as they arrive.
All of this combines to be a pretty resilient method for data storage, and one that lends itself to being easily shared via a flat file over http, or even using webtorrent.js. By default when you arrive at Gaia 3D, you download the star data peer to peer over a torrent. This means that while you are using the simulation, you’re also serving data to other people who are using it. So, if this were to be a viral sensation ten minutes from now, theoretically the load increase on my servers would not include the 24 megabyte data file full of star positions. The more people are using the simulation, the more people are serving the data it requires.
Regardless of how it gets downloaded, the data is passed to the graphics card in the same way. I use Three.js to render a list of coordinates as points in three dimensions in a browser window. This code snippet runs in an update loop every time a frame is rendered, and if a new chunk of data has been downloaded, it passes its data to an array the graphics card will treat as a 3D point cloud. The chunk comes in as an unformatted list of bytes.
Because we stored our bits directly as Float32 values, there is no processing required, and data from the network buffer can be shuttled directly to the GPU, already in its final floating point form. This has huge performance benefits, and lets you scale graphical applications right to the edge of what the metal can do. It also lets you run a simulation like this, with big data files and lots of geometry even on mobile devices.
The Final Result
At the end of it all, the bits end up somewhere in your graphics card. When I copy numbers to a portion of the GPU’s memory, it takes them to represent a point in 3D space where it will render a white dot exactly 1 pixel in size.
Stars load in groups, sequentially in the order they were originally packed by the Gaia team. The mission is still going on, so black parts of the image are parts of the sky Gaia has yet to scan completely. More and more information will arrive over the next few years to fill in this picture. And, because of a meticulous data workflow, stars can download progressively while the frame rate stays usable even on older mobile devices, and I am (theoretically) prepared for a big spike in interest by distributing the data file peer to peer with webtorrents. There’s something nice about sharing the stars with other people.
I hope this was a useful journey for someone. Please do try the simulation itself if you haven’t, the screenshots don’t do it justice. Every single dot you see is a whole other sun, and we know just where it is. If you’re feeling generous, leave your browser window open and you’ll share the stars with any other strangers that happen by. The scripts I use to create files from Gaia are up on github, please feel free to use them on your own projects.
Many thanks to the Gaia team and the ESA for their beautiful and awe-inspiring work. We’re mapping the galaxy in a way that was science fiction ten years ago. I can’t wait for the next batch of data to come in.