Our team builds infrastructure services for many clients across Adobe. We have services ranging from commenting and tagging to structured data storage and processing. We need to make sure that data is safe and always available; the services have to work fast regardless of the data volume.
This article is about how we got started using HBase and where we are now. More in depth reasoning can be found in the second part of the article
If one would have asked me a couple of days ago why or how we chose HBase, I would have answered in a blink that it was about reliability, performance, costs, etc.(a bit brainwashed after answering “correctly” and “objectively” too many times). However, as the subject has become rather popular lately1, I reflected deeper about “how” and “why”.
The truth is that, in the beginning, we were attracted to working with bleeding edge technology and it was fun. It was a projection of the success we were hoping to have that motivated us. We all knew stories about Google File System, Bigtable, GMail and what made them possible. I guess we wanted a piece of that, and Hadoop and HBase were one logical step to reach that.
We didn’t even have a cluster when we started. I begged and bribed for hardware from teams that had extra cycles on their testing machines. We were going to use them just as SETI@Home does, well sort of. Once we got 7 machines, we had a cluster2 running Hadoop and HBase stack (HStack3). We even went on and refurbished some old broken machines to work as extra test agents, besides our own laptops.
Technology driven decisions tend to fall over when assessed from a business perspective. I never thought about costs, data loss, etc. We were somehow assuming that these were all fine. If others ran it why wouldn’t we be able to do it? We knew this architecture would enable scalability, but we didn’t challenge whether the implementation actually works.
Scalability and performance “lured” us in. But in reality it’s the implementation that dictates costs, consistency, availability and performance. A good and scalable architecture is just a long term promise, unless it is backed up by the implementation. In our case the architecture choice paid off, as you’ll see.
Once we realized the potential of HBase through early experiments, we subjected it to a full analysis. It took a while to get an objective opinion, but after all the tests, we really knew we were on to something.
The 40M mark
We had already scaled MySQL, so denormalization, data partitioning and replication weren’t all that new to us. When mid-2008 one of our clients asked us to provide a service that could handle 40M records, real time access, aggregation and all that, we thought we had an answer. This was our first step towards doing “big data”.
There were no benchmarking reports4 then, no “NoSQL” moniker, therefore no hype :). We had to do our own objective assessment. The list was (luckily) short: HBase, Hypertable and Cassandra were on the table. MySQL was there as a last resort.
We abstracted the implementation details, and made a stub (so that the clients could start developing), and we started testing each technology stack.
Cassandra was out first. It had just come out, was barely usable, lacked any decent resources or active mailing lists and it could keep only one table per instance. This has changed a lot in time, but we had a deadline then.
Funny enough, HBase was the second one out :).
When we started pushing 40 million records, HBase5 squeaked and cracked. After 20M inserts it failed so bad it wouldn’t respond or restart, it mangled the data completely and we had to start over. It was performing bad and seemed to lose data.
I literally dreamt logs for a week, trying to identify the issues. It was looking as if we would discard HBase, but I insisted we should be able to switch later even if we went ahead with Hypertable. The team agreed, even though it didn’t look as a viable choice in comparison with Hypertable that handled the data rather well.
HBase community turned out to be great, they jumped and helped us, and upgrading to a new HBase version fixed our problems. Hypertable6 on the other hand seemed to perform better.
Plan for failure
When testing failover scenarios, HBase started to gain ground, handling node failures gracefully and consistently. Hypertable had problems bringing data back up after node failures and required manual intervention. We were, once more, left with a single option.
But there were more questions to be answered, concerns that we never had with MySQL:
What’s the guarantee that every bit you put in comes back in the same form no matter what?
We had to be able to detect corruption and fix it. As we had an encryption expert in the team (who authored a watermarking attack), he designed a model that would check consistency on-the-fly with CRC checksums and allow versioning. The thrift serialized data was wrapped in another layer that contained both the inner data types and the HBase location (row, family and qualifier). (He’s a bit paranoid sometimes, but that tends to come in handy when disaster strikes :). Pretty much what Avro does.
Another scenario we had to cover was a total failure of the Hadoop Distributed File System (HDFS) that HBase relies on.
We adapted an outdated HBase export tool to ensure data consistency during and after backup. We kept backups in 3 places: locally, HDFS and another distributed file system. We also had it prepared in MySQL format so we could switch to a MySQL cluster in case of disaster. We had scripts that would take the latest backup and bring up an alternative storage cluster (HBase or MySQL).
Things went pretty smoothly. We created MapReduce jobs to compute recommendations out of the data that was stored, we had an automated backup, and a mechanism for disaster recovery.
In October 2008 our system went live on time.
Disasters WILL happen
On the 3rd of December 2008, around midnight7, sanity alerts started pouring in: the service was running in degraded mode. Our HBase cluster would write data but couldn’t answer correctly to reads. Following the procedure, I was able to make another backup and restore it on a MySQL cluster. Then I enabled the MySQL cluster backup. We were paranoid enough to have a backup procedure for the MySQL “backup” cluster as well :)
We were up and running about half an hour later. MySQL had saved the day, and we congratulated ourselves. Except that our thorough, tested backup plan had a glitch: the master-master replication was not setup for the new tables. We shortly got two different data sets on each server. We had to stop the replication, switch to a single node and fix the consistency issue. Master-master replication is pretty much a hack and needs proper care. Otherwise, it’s pretty easy to screw up your data.
After a thorough investigation8, a postmortem and a couple of patches that brought both HBase and HDFS to the latest released version, we switched back to HStack9 on the 5th of December. We had no data loss and our clients only experienced a short interruption. Our client team congratulated us for the fast response.
Most importantly, this was our first reality check and a new lesson learnt. The system still runs today and we had no problems with it ever since. After just a few months our biggest client switched direction and never got to the 40 million records.
The system never reached its planned capacity. Even though we took up new clients on board and implemented new services on top of it, we weren’t yet the “stars” we’d imagined to be.
In reality, all we had would have been easy to handle with a MySQL cluster and just a little operational overhead.
The Billion Mark
We decided to switch focus in the beginning of 2009. We were going to provide a generic, real-time, structured data storage and processing system that could handle any data volume. By this time we caught the attention of bigger potential clients and the requirements changed a bit. 40 million became 1 billion, with access times under 50ms and serious processing power. All this with no downtime and definitely no data loss.
This time, we were going to do it right:
We wrote down all the failure scenarios10 that we could think of: bad disk, bad memory, packet loss, network card failure, machine and rack power failure, disk failure, raid controller failure, etc. Nothing was “sacred”. At one point, in sprint demos we would ask for two random numbers (two machines in the cluster) and unplugged the power cable, unplugged the network cable, or just randomly took out hard drives while running.
We initially failed to reach 1B records as we filled up all the disks after less than 500M records. However, the performance and resilience to failures under rough conditions helped us “bootstrap” for something bigger. So, we got a new cluster11 installed, that could handle the necessary capacity.
We also looked at the operational overhead; we always try to automate as much as possible. Our latest deployment system took any number of barebone machines (using the IPMI interface) and would deploy the OS, partition everything and then deploy and configure Hadoop, HBase along with our systems, completely unattended. Everything was set up using a combination of Kickstart scripts and Puppet for service deployment. Even now, deploying a new cluster is basically a
git push away.
We had a 3B record table, on which we ran all our benchmarks: cold start (no memory caches), reads, writes, combined tests in different proportions. Random reads under 15 ms, huge throughput, all the cool stuff you’d imagine.
We started contributing to HBase, HDFS and Map-Reduce to support all our failover and performance scenarios and make appropriate fixes when necessary.
Right now, our system is currently “beta” inside the organization. We’re still testing it, but already supports distributed data import, random access, distributed data processing, we have the APIs, web interfaces and it’s running fast.
This is a historical account on how we started working with Hadoop and HBase. The second part of this article will show more practical and objective aspects on why we stick with this technology stack.
- Especially in the “NoSQL” community. Check out a Google trend and a social media trend ↩
- We’ve been watching Hadoop since 2007, but it was this article that triggered me to move from playing to actually deploy it on a cluster. On the 16th of April 2008 our first Hadopp/HBase cluster (called HStack) was operational. ↩
- Our umbrella term for Hadoop, HBase, Zookeeper and friends. ↩
- Like this. Mind you, we have sensible differences to these results, stay tuned :). ↩
- Our first real tests were running against HBase-0.2.0. HBase later changed it’s versioning scheme to mirror Hadoop’s versions, so HBase-0.3 became HBase-0.18. ↩
- Hypertable-0.9.27 ↩
- 11:51PM (GMT+2), to be more exact :) ↩
- The problem had started at least a month before, when a bug silently disabled
.META.compactions. So, the
.META.table had a lot of StoreFiles and, from a point onward HDFS started to throttle open file handles. A full cluster restart would have temporarily fixed it and a proper log monitoring would have alerted us long before it got too late. ↩
- Our umbrella term for Hadoop, HBase, Zookeeper and friends. ↩
- Like this. This kind of research comes in very handy, when you try to see how expensive a failure is. ↩
- 7 machines; each dual quad core hyper-threaded CPUs, 32GB RAM, 24 10K RPMS SATA disks, battery backed raid controllers, and IPMI. Commodity hardware does NOT mean crappy hardware. When you want performance, you need lots of spindles and memory. ↩