Sharing is Caring — even for data in distributed clusters

Chamin Hewage
LiIDS MAGAZINE.am
Published in
7 min readOct 6, 2021
Yo folks- Share your foappliedod with others!

Can you recall the feeling that you get when you share your food with your significant other/ friend/ partner/ or even with a random stranger. It gives a so much cathartic feeling right. Of course you’ll be eating little less, but, the simple act of sharing will give you so much joy, and perhaps it could lead to a long lasting bond. So in a nutshell, the point I want to make is that sharing is a great thing and when ever possible we should embrace it.

Now you may wonder what this “simple act of sharing” has to do with data in clusters. Though clusters don’t have feelings, I’ll practically demonstrate you how distributed clusters feel great when they are able to share data among their nodes.

So let’s get started… and here’s the story.

As some of you already know, I’m a PhD candidate from University College Dublin, Ireland. My PhD is in Computer Science stream. More specifically, I’m researching on development of highly scalable integrated data systems for LiDAR (a form of 3-dimensional data) and remote sensing imagery (set of 2-dimensional data) data integration.

Recently I was doing some experiments in a 18 node cluster. Before dive into the mass scale experiment with a new index structure, I wanted to do a sanity check. In the experiments, I simply wanted to measure the query response time. So… as the first order of business, I ingested my remote sensing data into the 18 node HBase cluster. To make sure that the spatially adjustment image data stays closer to each other on the database storage layer (i.e. disk), I gave each image a four dimensional Hilbert index (now I see you are frowning and saying in your head… “what the F is a four-dimensional Hilbert index dude?”).

So my friend, in simple terms, a Hilbert curve is a function that employ to preserves spatial proximities of multi-dimensional data such as 2D and 3D data. Since my images are 2D data, I used a Hilbert curve, and passed the image border’s — which is the image bounding box’s bottom left coordinates (i.e. xMin, yMin) and top left coordinates (i.e. xMax, yMax) to Hilbert function and obtained the Hilbert code. So, when I arranged my image data according to their Hilbert values, I should be able to obtain a degree of spatial proximity among my image data when they are stored in the cluster. Therefore, I thought, organising my image data in the cluster according to their Hilbert values could improve my query response time and it was my hypothesis.

To validate my hypothesis, I assigned random ids to each image(each Hilbert code is an id of itself to the image data). These random ids have the same length as Hilbert codes. Then I ingested the data set to a different table. I expected this data set to take more query response time because, the imagery data are not spatially organised in the disk, and they are scattered all over the cluster without any order. Nevertheless, to my absolute surprise, when I measure the query response time, I noted that it is relatively much less compared to spatially organising images on the disk. And this was a typical “oh F” moment in a PhD.

I wanted to know what I did wrong there. So, I decided to diagnose and understand the root cause. After a bit of reading, I learn more about the use of distributed cluster. The most crucial thing I learn is that, though we use a cluster, due to our mistakes, we might not use all the available resources — i.e. all 18 nodes. Instead we might use a very few available resources. This is exactly what had happened. Here, for the purpose of clarity, I’ll share the most important steps in the root cause analysis.

Diagnosis step 1: Find the data distribution (check weather our image data get stored in all nodes or not)

To perform this diagnosis, I checked the image data distribution for both cases: (i) case when use Hilbert code, and (ii) case when use random string id. Here, I’d like to explain a little about how HBase distributes it’s data.

When data is ingested into HBase table, it actually gets ingested into a HBase region. HBase regions reside in HBase region servers. You can think (which typically the case in most circumstances) an HBase region server as a node in the cluster, and HBase region as a place that holds the data. In HBase, when the data gets ingest, as stated previously, the data goes into HBase regions. These regions typically has a maximum size and as the data continuously gets ingested, the size of the regions also grows. The data inside HBase regions are sorted by their keys. So when using 4D Hilbert codes as the index or key values, we get spatially adjacent images stored in the same region (similarly, in scenarios where using timestamps as row keys, we can obtain the temporally adjacent data records within the same region). Due to this data concentration on specific nodes, we might not use all available resources in the cluster.

With this knowledge accumulated, I decided to investigate the data distribution of my cluster for the aforementioned datasets. In table below I have shared the data distribution for two scenarios. As you can see, when using the 4D Hilbert index as the row key, my cluster has used only 4 nodes. Contrary, when I employed a random string that has the same length as the 4D index, all my data have got distributed all across my nodes. (to obtain the data distribution across nodes, I used get_splits() command in the HBase shell)

As a result of this high data distribution in the cluster in the second scenario, more worker nodes are participating to answer my queries.

With this knowledge, I assumed, if I could distributed my data all across nodes while preserving spatial proximity. So, I went ahead and re-ingested all my image data to a different table while using salted-4D Hilbert indexes as the row keys (You could read more about HBase key-salting from this post). Nevertheless, in this instance, I made sure that all my data gets distributed across all the nodes in the cluster (I did this by explicitly stating the HBase table pre-splitting).

So, in total I had three tables. In one table, the 4D Hilbert index used as the row key. In the second table, a random string is used as the row key, and in the third table, a salted 4D Hilbert index is used as the row key. Once the data in the database tables, as a sanity check, I re-checked the data distribution. Below table depicts the data distribution results for the three arrangements.

After I convinced that my data is distributed according to the way I wanted, I performed several range queries and measure the query response time. When performing range queries, I arbitrarily choose 10 different ranges and these ranges vary from small regions (R1, R2, etc.) to much larger regions (R9, R10 etc.). Bigger the region, my queries yield more image data. My expected outcomes were as follows:

  • The query response time is high (higher the bad) at the use of 4D Hilbert index (because of the low data distribution)
  • The query response time lower when using salted 4D Hilbert index (cause the data is fully distributed among all nodes hence all workers can be participated in the querying process).
  • Query response time when using random string would lies in the middle.

However, when I plot the query response times, I obtained the following.

Query response times for the three data distribution scenarios for different number of regions.

From the above figure, as per my expectations, I could see that the use of 4D Hilbert index as the row key impacts the query response time adversely due to the lack of data distribution. Thus the query response time is high. Nevertheless, I could see that for the other two scenarios, i.e. the use of random string as the row key and the salted-4D Hilbert index as the row key, always yield the similar query response times.

At this point, I don’t have a reason. Nevertheless, I suppose it’s because my data set is so small (around 7000 images). I reckon, to prove my hypothesis around preserving spatial proximity and yielding full distribution of data requires much much bigger data sets (probably in several hundreds of millions). If you feel like going for it, kudos to that! :)

And that’s it for this blog post :). I hope you learn something new and enjoyed reading this post as well. Let me know your comments.

Ciao for now!

Additional notes (if you are curious to know):

  • I use New York University’s Peel cluster .
  • Yes, I have Instagram :P (feel free to follow me on Instagram @chamin.ck).
  • If you enjoy this post, please hit the *clap* button ;) (yeah, it’s silly to ask for claps, but why not if YouTubers ask you to hit subscribe ;)).
  • Of course go ahead and read my other posts. Hope you’ll enjoy those as well.

--

--

Chamin Hewage
LiIDS MAGAZINE.am

I am a Data Systems' scientist (PhD). I work as a senior Database Engineer at HPE. I aspire to bring state-of-the-art to mainstream through innovation.