For two years, I’ve been working on developer relations for Google’s Cloud Bigtable. It’s a database meant to handle petabytes of data and powers many core Google services, including Search, Analytics, Maps, and Gmail. However, the largest table I’ve created was around 100MB — not even close to what Bigtable can support.
The Bigtable team recently launched an improved version of their monitoring tool Key Visualizer, and given some recently acquired extra free time, now seemed like a great time to try loading in a ton of data and using the updated tool.
In the end, I wrote 10TB of data and discovered that I could reverse-engineer the key visualizer to create works of art.
What is Bigtable
Cloud Bigtable is Google’s NoSQL Big Data database service. It’s ideal for running large analytical workloads and building low-latency applications.
If you’re trying to get a mental model of Bigtable, there are rows and columns, and each row/column intersection is a cell. Cells can have multiple values in them stored as versions, so a Bigtable table is a 3 dimensional table on row, column and version. Bigtable is a fairly low-level database, so it can provide great QPS and scalability, but it gives very basic querying capabilities that focus on the rowkey. You can get a single row or scan a range of rows by rowkey.
There are many potential arrangements for how you organize your data, but basically, you don’t want to query the same rowkey, or range of rowkeys too frequently which can cause performance problems. That’s where the key visualizer tool comes in. It allows you to see which rowkeys or groups of rowkeys are being queried too frequently.
I saw the key visualizer could produce detailed images like the one below based on read throughput, so I wondered could I reverse the process and come up with a set of reads that would produce a specific image?
Loading 10TB of data into Bigtable (don’t try this at home)
My first step for using the Bigtable key visualizer was to create and fill up a database. I wanted to load in a petabyte, but that seemed a little excessive and didn’t want to hog too many resources that could be used for critical businesses.
All of the code used and instructions on how to run it are available on Github, so I will give a high level overview of what the code does, but won’t go into too many of the details in this post.
It’s very easy to create and scale Bigtable instances through the Cloud Console. For 10TB, I can use 8 nodes which gives me more than enough storage and throughput to quickly load in my data.
You may notice on the sidebar that this is fairly expensive, so don’t try this at home! For a cheaper alternative to try at home, you can create a 1 node instance which you should shut off once you’re done.
Once my instance is created, I use Dataflow to write 100MB per row of random bytes to my table. 10TB total and 100MB per row means we’ll write 100,000 rows. To avoid any issues with sequential writes, I use a rowkey that reverses the iteration number and pads it with zeroes.
The reversed rowkey helped out, but I ended up adding a few more nodes to stay under the max CPU utilization. I reduced my nodes once the data was loaded.
Creating queries that will activate pixels in KeyViz
Now that the data is in the table, I can start to use the key visualizer to monitor the usage patterns.
I discovered that if I continuously do range scans on certain cells, they will activate in the key visualizer. I wrote a Dataflow job that could perform range scans based on an inputted CSV. I started out with a simple drawing of a smiley face.
This let me know I’d be able to draw various images once I had them written in that format, but I wondered if I could add more depth to the image and use gradients by scanning certain areas more frequently than others. Key visualizer provides outputs in 15 minute windows. If my job has the input .7 for a range, it would have a 70% chance to scan that range. With hundreds of scans in each window, I hoped the usage would respond accordingly. I tried a scan with the CSV below and was happy to see I would be able to include depth in my images.
Once I knew the capabilities of the key visualizer, I did a bit of math and scripting to take any image and convert it into the CSV needed. I wrote a handy CodePen to do this, so I could have easy access to add more images. I also added an input for the amount of hours which determines the quality of the image produced .
Once I have that image created and uploaded to a public bucket, I run a pipeline that does the following:
- Download the image CSV
- Divide all the rowkeys evenly based on the dimensions of the CSV
- Create a scan based on several rowkey ranges – In order to get different intensities/depths in the visualizer, use the pixel values in the CSV to conditionally use the specific ranges
- Scan the table
- Repeat this every second, moving on to the next column every 15 minutes (the minimum width of each monitoring update)
You can also view and run the ReadData pipeline from Github.
And then…. I waited! In order to get the highest quality images, the images needed to generate over a few days. The key visualizer gives updates in 15 minute intervals, each additional pixel of image width took 15 minutes. The image height is determined by the number of tablets which is around 1 per gigabyte stored, which is pretty large, so it isn’t a huge constraint here. If you followed along at home with a smaller table, your image quality will be lower, but should still be recognizable.
I set up a few tables and jobs to run at the same time, so I could get more results, and here they are:
There are several parameters you can play with. Brightness changes the scaling of the image, which is helpful if you want to take an in-depth look at a smaller area.
You can also adjust which metric is displayed. “Read bytes client” seems to produce smooth images while “Ops” produces images with more lines which can look really cool on some images.
And finally, if you are as big a fan of drag and RuPaul’s drag race as me, then you’ll understand why I had to immortalize the queen of drag in the key visualizer as well.