The video we shared in 2015 introduced our efforts to open up our big data platform to other teams within Xandr. Data Platform as a Service (DPaaS) is our internal offering that allows other teams at Xandr to run analytics on our wealth of data. Our users want to be confident that they’re using the platform safely and appropriately. They only want to see the jobs and resources that are relevant to them, and not impact other users or mainline production processes. As operators of the platform, we want to ensure the safety and stability of the system as a whole and reasonable isolation between our users.
This clearly points to requiring a AAA solution — authentication, authorization and accounting.
- Authentication — identifying an entity acting upon your system
- Authorization — allowing/disallowing that entity to perform actions
- Accounting — keeping track of which entities have performed which actions
We will be tackling these one at a time in the natural order listed above. Each item has its own complexity and intricacies, and this post will discuss those around the first A — authentication.
Which systems need authentication?
The short answer is “all of them!” Actually — that’s the only answer. Security is definitely one of the things in this world where “the chain is only as strong as its weakest link.” Luckily we had already included authentication in many of the systems we’d built in DPaaS — requiring entities (both humans and systems) to log in with credentials. In order to strengthen the chain further, we wanted to have authentication for our Hadoop infrastructure as well.
Non-authenticated Hadoop means that Hadoop simply trusts the self-identification of any entity making a call into the cluster. This means that HDFS API calls, map-reduce jobs, Hive queries, etc. will simply take the username as trusted input and will run the operation as that user. There’s no check whatsoever for that user’s credentials or real identity.
To address this, we explored two ways of locking down our Hadoop infrastructure.
We could choose to skirt the issue a bit by simply not letting users have logical access to the Hadoop cluster at all. We would gate all access through controlled DPaaS-specific API endpoints and that’s it. Only operators and specific systems would be able to access Hadoop directly. Once we would add the proper authentication at the API level, we would be done.
This solution was great on paper, but is very difficult to implement in practice — both technically and logistically. On the technical side it would have been an infrastructural change. Our clusters share subnet space with other systems and would have had to be peeled off — a herculean effort on its own. We’d also have needed to manage a whole other network along with bastion hosts and poke holes for whitelisted services. Ugh! Logistically, executing the project would have required a lot of cross-team dependencies which are always challenging to manage and schedule. Additionally, some of our existing users are accustomed to having logical access to Hadoop, and we didn’t want to shut them out.
Kerberos authentication in Hadoop
Alternately, we could bite the bullet and turn on Hadoop’s support for Kerberos authentication. This would close our security loopholes and give us the control we need over access to Hadoop. We already use Kerberos in-house for some other systems, so the core infrastructure has been in place. Implementation and rollout would mostly be handled by one team, so the logistics were manageable. And looking at the documentation, it didn’t seem like it would be too hard…
Turning on Kerberos in Hadoop had two major pieces — each with its own major challenges. The first was enabling authentication within Hadoop — ensuring HDFS, Yarn, and Hive would all interact properly in a kerberized environment. The second piece was ensuring that all our systems and processes that interact with Hadoop could authenticate and work seamlessly with the locked-down cluster.
Locking down Hadoop looked like it was just about following the recipe published on the Cloudera web site (our Hadoop distribution provider). The Cloudera docs helped a lot and gave us a jump start. During implementation we ran into many challenges which required code deep-diving, debugging and trial & error.
The most complicated of which was adapting to a quirk in our network/host infrastructure — our hostnames don’t match their fully qualified domain names in DNS. This causes consternation for Hadoop. Without going into excruciating detail, there are some assumptions made on how your Kerberos principals are named and configured, and the validation Hadoop tries to do when connecting to any given server in the cluster. Because Xandrs’ infrastructure doesn’t conform to those assumptions, it took us a while to find a workaround (other than completely revamping how our hostnames are set up). The trick was an obscure feature for allowing a glob pattern to be used for server principals (see: Hadoop Jira Issue or the code itself). Every kerberos
principal configuration value also allows a
.pattern version of it.
Configuration excerpt example
After discovering that, it was a lot of trial and error to find the myriad of configuration values which needed an additional
The next big challenge was Hive. We’d been happily running our production queries using the
hive command line tool in embedded mode for years. Unfortunately, in a kerberized Hadoop environment, this was no longer going to fly for us. We had to migrate to using the
beeline client connecting to hive-server2. We also enabled authentication on Hive metastore server, which caused some fun difficulties on the client side detailed below.
Xandr client systems
Fixing this was not terribly complex. Our data loading code is written in Java and we were able to easily leverage the Hadoop libraries to make properly authenticated HDFS calls into our cluster. We created a Kerberos principal for a data loader user which we gave Hadoop impersonation privileges to. This is because our data loading system loads data which can potentially be owned by several different users, and we wanted to keep this flexibility going forward. The following code creates a properly authenticated connection to HDFS with impersonation:
More information on secure impersonation in Hadoop can be found at the official hadoop documentation site.
Our job execution is a mix of custom map reduce jobs and Hive queries. This required a refactoring of our job launching software, and luckily for us, we had the foresight to make it a single framework which all our jobs use. The framework is written in Java, so we leveraged the same libraries as above. We hoped all we’d have to do is put a big
doAs() block around our job launcher.
Here’s a pared-down version of what we ended up doing. Some of the full class names are specified where there might be ambiguity.
The above was good enough for our MapReduce jobs, but not quite for Hive. Our system was using the ‘hive’ command line tool in embedded mode for all production Hive queries. This was not going to work at in a kerberized environment. The ‘hive’ CLI tool has been deprecated for some time and we had to migrate to hive-server 2 (as mentioned above). We switched all our jobs to use the ‘beeline’ client instead of the ‘hive’ CLI. This was a pretty straightforward change.
Hive metastore was another story, unfortunately. The metastore client we were using (or the way we were using it) wasn’t communicating properly with a Kerberos-enabled metastore server. We ended up switching from using the
org.apache.hadoop.hive.service.ThriftHive.Client client to using the
org.apache.hadoop.hive.metastore.HiveMetaStoreClient. ‘HiveMetaStoreClient’ uses ‘ThriftHive.Client’ under the covers, but it gets configured directly from ‘hive-site.xml’ (as long as it’s on the classpath - a hard-learned lesson). We came to this conclusion by inspecting the Hive Metastore code itself. After seeing which client code it uses internally, we determined it used the same library.
This library was identical to the “Data Loading” use case above, except we didn’t need to do any impersonation. We just used the Kerberos authenticated user to directly access HDFS.
Throughout this process, we kept in mind the complexity of the production deployment. Unfortunately, this deployment falls into that undesirable “big bang” category where you have to flip it all on at once, or not at all. There’s no staged roll-out approach where we just secure part of the cluster, or just some of our client software — it was all-or-nothing.
The key piece for us was to ensure that the Kerberos authentication could be easily turned on and off across the board. The deployment would turn it on, and if something went awry, we’d turn it off. Therefore, we made sure that the authentication method (“kerberos” vs. “simple”) was a configuration parameter in all of our software. The actual code required to authenticate was deployed to production in a dormant state well ahead of flipping the switch. Both the authenticated and non-authenticated code paths were well tested and exercised. This also enabled us to easily practice the deployment in our pre-production environments.
There was no magic to deploying this. It required a lot of planning and coordination. We staged as many things up front as we could such as the Kerberos principals and keytabs, and all the code & configuration as mentioned above. The deployment was an orchestrated procedure to:
- Pause our data pipeline
- Safely bring down the Hadoop cluster
- Flip from “simple” authentication to “kerberos”
- Bring the cluster back up
- Resume the data pipeline
- Breathe :)
Now that we have of our authentication pieces in place, we’re tackling the second A — authorization — next. Stay tuned to the Xandr Tech Blog as we’ll be sure to share our designs and trials & tribulations around authorization.