Weekly Programming Challenge #7
Implementing a B+ Tree
I’m a sucker for data structures. One of my favorite things is to dig into a new (to me) data structure and try to implement it. This week, we’re going to look at a data structure that I’ve played with before, but which I’ve always enjoyed and which I think (hope?) will be something fun for everyone else, too: B+ trees!
First, though, let’s recap week #6…
Week #6 Recap
For the 6th weekly programming challenge, we tackled Perlin noise. The goal was to produce an image depicting perlin noise distributed across the canvas, which is something I’d never done before. I was excited to give it a try!
Besides myself, one other person rose to the challenge: Lasse Ebert, a regular to the weekly programming challenges. His solution was once again in Elixir, and even recycled his PPM image library from week #4. For successfully generating 2D Perlin noise, he gets one point! Awesome work, Lasse. Check it out on GitHub.
For my own implementation, I revisited a language I have loved for a long time, but which I rarely get an excuse to use: C. (I get weird looks when I say I love the C programming language, but it’s absolutely true. Pointer arithmetic, memory management, obscure syntax…it’s such a blast to write C code.) I tried various approaches to making Perlin noise (some were even successful!) but then I started analyzing Ken Perlin’s reference implementation (in Java) to see how it works, and wound up implementing a C port of it. I got 3D Perlin noise working, as well as terrain coloring (and others, like fire and clouds), nailing down three solid points. Check it out on GitHub.
I think Perlin noise is something that’s going to go in my “recreational programming” toolbox. It’s very fun to play with.
Weekly Programming Challenge #7
B+ trees are a staple of file systems and databases. The “B” stands for “Balanced”, and in that respect they have a fair bit in common (conceptually, at least) with the self-balancing binary search trees you might have played with in the first weekly programming challenge. (In fact, B trees — sans the plus — are a generalization of self-balancing binary trees, supporting arbitrarily many children nodes.)
The difference, though, is that whereas with a binary tree, each node may have up to two children, these B+ trees may have quite a few more children per node. In fact, you often want these trees to have a high “fanout” (number of children per node), because it increases their efficiency when used with IO operations.
(For this reason, they are popular with file systems and databases, where disk IO is the name of the game, and you want to minimize how often it happens.)
Furthermore, whereas a binary tree (and a B tree) store the value for each key at the corresponding node, a B+ tree only stores values in the leaves. The interior nodes simply act as directories, pointing you to subsequent nodes in the hierarchy, and ultimately to a leaf node.
So! Our challenge this week is to implement a B+ tree. Once again, we’ll lean on the Wikipedia article as the canonical description of the data structure and its associated algorithms.
Normal Mode
For normal mode, and one point, you must implement the following:
- A basic B+ tree with configurable arity (children per node).
- The insertion algorithm, allowing you to add a new key/value pair to the tree.
- The search algorithm, allowing you to find the value for a named key.
- You may choose any data type you wish for the key and value. Dates => strings, so you can query people by birthdate? Integers => geolocations, so you can query cities by population? It’s up to you.
Hard Mode
Once you satisfy normal mode, you can shoot for hard mode. Here, you may earn one additional point each for implementing extra features:
- The deletion algorithm, for removing a given key (and associated value) from the tree.
- A configurable sorting function, allowing data to be sorted in a variety of ways.
- Configurable key and value data types. Permit a B+ tree to be constructed with different key and value types!
- When searching the children of a node (for any operation), use a binary search for efficiency’s sake. This makes it far more feasible to have trees with a fanout of a hundred or more.
- Efficient bulk-loading, allowing a tree to be initialized from a list of data. (Note that, as described in the Wikipedia article, simply adding items one at a time to an empty tree has pretty poor performance, so don’t do that.)
- Link the leaf nodes together in a linked list, making it blazing fast to iterate over the values in sorted order.
- Make like a real database index, and store each node in a separate “block” (configurable in size) inside a file on disk. The root node would be the first block, with children pointers indicating where in the file their corresponding blocks lie. (This arrangement may require changes to your basic implementation, as each operation suddenly has to consider the structure of the file itself!) TWO POINTS for this one.
For bragging rights (but not really any extra points), shoot for the coveted “surprise me” award! Do something creative with your implementation. It doesn’t need to be any harder (necessarily) than normal mode, but it should be notable for some kind of “wow” factor. See what you can come up with!
This challenge will run until Saturday, September 17th, at 12:00pm MDT (18:00 UTC).
The deadline for this challenge has passed, but feel free to try your hand at it anyway! Go ahead and submit a link to your solution in the comments, anytime. If you’re following along, the next week’s challenge is “Implement a Simple Parser and Interpreter”. See you there!