Taming the Three-headed Beast: Understanding Kerberos for Trouble-shooting Hadoop Security

10 min readApr 5, 2020

Kerberos takes its name from a three-headed dog in Greek mythology.

Three years ago, I was tasked with migrating the data analysis stack of a company with millions of daily active users from its traditional SQL-based infrastructure to Apache Hadoop. Back then, I considered securing the cluster, although it wasn’t really necessary as the cluster was only accessed internally by the data engineers and scientists. Soon I found out that it is not as simple as just a few configurations in some XML file. In fact, I was surprised about how much extra complexity is needed for something that seemed pretty trivial: “I just want my cluster to have auth!”. After a couple of days of struggling with Kerberos, KDC, key-tab files, principals, and other things that I didn’t 100% understand back then, I just decided to move on. Why fix it if it’s not broken.

Fast-forward three years, I found myself working on a different Big Data project in a different organization on a different continent. This time, we had to have Hadoop security as we were installing a multi-tenant solution. This is where it all hit back.

What is Kerberos

Kerberos is, in fact, the go-to solution for the centralization of auth servers for most network admins. Many people are using it every day without even knowing about it, partly thanks to the fact that Microsoft basically adopted Kerberos and then renamed it and made sure their version doesn’t work with the one everybody else used (kind of like how they POSIX). It is the corner-stone of Active Directory, LDAP and Samba.

Truth is, you don’t really need to fully understand how it works under the hood, although I strongly recommend that you do take a few minutes to learn the basic concepts from this amazing tutorial (believe me, I’ve watched a few of them). The main idea is, instead of different services managing passwords and access themselves, let a centralized server do it and issue “tickets” for the user. The user can present this ticket as the token, and using a very cleverly designed protocol the password does not need to be passed around in the network. We’ll explore that in a simple analogy later.

Why does Hadoop use Kerberos and make my life harder

Rule of thumb: If you’re implementing your own auth and security utils from scratch, chances are you’re screwing up. Small errors in the design or implementation of the protocols and routines can cause catastrophic vulnerabilities. You should not reinvent the wheel. This is why the Hadoop project did not invent its own authentication scheme. There are many many modules involved in a Hadoop cluster, and new projects based on Hadoop emerge pretty frequently. Using a well-known standard is a good engineering decision.

What do I need to know?

Hadoop is known to be a nightmare in configuration and maintenance, so it is important to understand a few key concepts here, explained in simple English:

KDC: Imagine instead of showing your driver’s license everywhere, there was this machine that scanned your driver’s license and issued you disposable tickets with your name on it. That machine is called the KDC (Key Distribution Centre) in Kerberos. You could then use that ticket to get into the club or something. So simply put, a principal is an entity or a user.

Principal: Think of Kerberos principals as the name on an ID card. It is the equivalent of a user entity in the Kerberos framework.

Keytab File: a keytab file is close to an RSA key in its functionality. Basically, it is the “driver's license” in the ticket machine analogy. It’s what you need to present to the KDC for it to issue you a ticket, but not the only way to do so. The machine can also take your password or PIN, but in Secure Hadoop we mostly use keytab files. Of course, the keytab/password doesn’t get sent to KDC out in the clear and there are more steps involved, but for now let’s not focus on those details. Why not show the driver’s license (keytab file) itself to the club bouncer instead of the ticket? Because we don’t really trust bouncers not to copy it, or we simply don’t wanna keep carrying our “driver’s license” around and risk losing it (That’s part of why Kerberos is so secure. Also, the ticket that you get for presenting at the club is kind of written in bouncer language. So if someone steals it, they can’t use it at the bank).

Hadoop Kerberos Configs

First of all, I’d highly suggest using Ambari if you’re going with the secure cluster for the first time. Trying to figure out Hadoop Security without using Ambari is like learning how to moonwalk wearing ice cleats.

For securing the cluster, you need to install Kerberos first. The instructions here are designed for Centos 7, but it’s a fairly similar process on Debian-based systems as well. Install Kerberos client on all cluster machines. Here, we assume one master node called oldtrafford and two slaves called amfield and stamford:

pdsh -w oldtraford,amfield,stamford sudo yum install -y krb5-workstation krb5-libs krb5-auth-dialog

Install the Kerberos server on the master node:

yum install krb5-server krb5-libs krb5-auth-dialog krb5-workstation

The most important config file in Kerberos is the /etc/krb5.conf file which among other things, holds all the details about where the KDC and everything else is. In the /etc/krb5.conf

[libdefaults]...default_realm = IMANAKBARI.COM...[realms]IMANAKBARI.COM = {kdc = oldtrafford.imanakbari.comadmin_server = oldtrafford.imanakbari.com}[domain_realm].imanakbari.com = IMANAKBARI.COMimanakbari.com = IMANAKBARI.COM

Copy the /etc/krb5.conf file to all hosts in the cluster.

The capital case is conventional. The configs above define a realm named IMANAKABRI.COM in all caps, which is conventional. Then point to the domain name of the KDC and admin server for this realm. notice that in Kerberos, we always use FQDNs, the “full” name of the host. A realm is essentially an auth domain. It can represent an institution or a company’s auth system. For instance, in University of Waterloo, we use the realm: CSCLUB.UWATERLOO.CA for our campus network authentications.

Then we use the utility kdb5_utils to create the Kerberos database. Be careful not to lose the KDC master password.

sudo kdb5_util create -s

Now, update the /var/kerberos/krb5kdc/kdc.conf file on the server:

[realms]IMANAKBARI.COM = {acl_file = /var/kerberos/krb5kdc/kadm5.acldict_file = /usr/share/dict/wordsadmin_keytab = /var/kerberos/krb5kdc/kadm5.keytabsupported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal}

Update ACL on the Kerberos server by editing the /var/kerberos/krb5kdc/kadm5.acl file:

*/admin@IMANAKBARI.COM *

Start the KDC:

sudo service krb5kdc startsudo service kadmin start

And make sure it runs on startup:

sudo chkconfig krb5kdc onsudo chkconfig kadmin on

Create the admin principal (the other principals will be made by Ambari itself)

sudo kadmin.localaddprinc root/admin@IMANAKBARI.COM

Download JCE policy 8 and place it in Java libs on all hosts:

sudo unzip -o -j -q jce_policy-8.zip -d <JAVA_HOME>/jre/lib/security/

And now you can run the Kerberization wizard in Ambari by going to Admin>Kerberos and clicking Enable Kerberos.

The process takes quite a while, and it has to restart all services. By default, Ambari generates all the keytabs needed by HDFS, YARN, SPNEGO, Spark, etc.

Now you can not expect that after the wizard everything would just work. Troubleshooting Kerberos has had me pulling my hair and I’m sure I’m not the only one. Here are a few of the problems that I had to figure out:

Trouble-shooting Kerberos-enabled Cluster

The _HOST value

In HDFS configs, they use a trick so that they do not have to write a separate XML config for each host in the cluster. The _HOST macro, in principal names, automatically resolves to each host’s name. Now that is recipe for disaster, because in a lot of environments, a host may have different “names”. This is typical in “multi-homed” environments, on which a very good tutorial is available here.

In order to avoid errors like this when starting the cluster after Kerberization, you need to make sure you know what each variable in Hadoop configurations translates to:

Traceback (most recent call last):File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 361, in <module>NameNode().execute()File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 375, in executemethod(env)File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 978, in restartself.start(env, upgrade_type=upgrade_type)File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/namenode.py", line 99, in startupgrade_suspended=params.upgrade_suspended, env=env)File "/usr/lib/python2.6/site-packages/ambari_commons/os_family_impl.py", line 89, in thunkreturn fn(*args, **kwargs)File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 234, in namenodecreate_hdfs_directories()File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/hdfs_namenode.py", line 301, in create_hdfs_directoriesmode=0777,File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 166, in __init__self.env.run()File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in runself.run_action(resource, action)File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 124, in run_actionprovider_action()File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 606, in action_create_on_executeself.action_delayed("create")File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 603, in action_delayedself.get_hdfs_resource_executor().action_delayed(action_name, self)File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 330, in action_delayedself._assert_valid()File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 289, in _assert_validself.target_status = self._get_file_status(target)File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 432, in _get_file_statuslist_status = self.util.run_command(target, 'GETFILESTATUS', method='GET', ignore_status_codes=['404'], assertable_result=False)File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 177, in run_commandreturn self._run_command(*args, **kwargs)File "/usr/lib/python2.6/site-packages/resource_management/libraries/providers/hdfs_resource.py", line 250, in _run_commandraise WebHDFSCallException(err_msg, result_dict)resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X GET --negotiate -u : 'http://oldtrafford.imanakbari.com:50070/webhdfs/v1/tmp?op=GETFILESTATUS'' returned status_code=403.<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/><title>Error 403 org.apache.hadoop.security.authentication.client.AuthenticationException</title></head><body><h2>HTTP ERROR 403</h2><p>Problem accessing /webhdfs/v1/tmp. Reason:<pre>    org.apache.hadoop.security.authentication.client.AuthenticationException</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/><br/>

or this one:

resource_management.libraries.providers.hdfs_resource.WebHDFSCallException: Execution of 'curl -sS -L -w '%{http_code}' -X GET --negotiate -u : 'http://oldtrafford.imanakbari.com:50070/webhdfs/v1/tmp?op=GETFILESTATUS'' returned status_code=403.<html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/><title>Error 403 org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Invalid argument (400) - Cannot find key of appropriate type to decrypt AP REP - AES256 CTS mode with HMAC SHA1-96)</title></head><body><h2>HTTP ERROR 403</h2><p>Problem accessing /webhdfs/v1/tmp. Reason:<pre>    org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Invalid argument (400) - Cannot find key of appropriate type to decrypt AP REP - AES256 CTS mode with HMAC SHA1-96)</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/><br/><br/>

When running into a problem like this, the first thing to check is the upper/lower case of realm name in the configurations. Kerberos is of course, case-sensitive.

But another source of the problem is _HOST substitution. _HOST is by default substituted to Java’s InetAddress.getLocalHost().getCanonicalHostName().toLowerCase() unless a config named hadoop.security.dns.interface is set in core-site.xml.

The hadoop.security.dns.interface essentially tells Hadoop that on each host, look at the machine’s IP address on this interface (say enp0s4), and then make a reverse DNS call on that address and then replace _HOST with that domain name. You can also tell Hadoop which DNS server to use for this, using the hadoop.security.dns.nameserver configuration.

On the other hand, when Ambari generates the keytabs, it uses the hostnames you used in Ambari, which should typically be the “full” host name (FQDN) like: oldtrafford.imanakbari.com . You can check those names in the Hosts tab of Ambari:

You can check the principal names in the generated keytabs using the klist -kt command:

$ klist -kt /etc/security/keytabs/spnego.service.keytabKeytab name: FILE:/etc/security/keytabs/spnego.service.keytabKVNO Timestamp         Principal---- ----------------- -----------------------4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM4 18/03/20 13:08:29 HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM

So in communications between Hadoop components, the identity (principal name) that Spnego on oldtrafford would present, is: HTTP/oldtrafford.imanakbari.com@IMANAKBARI.COM . This should 100% match what HTTP/_HOST@${realm} translates to, so if _HOST gets resolved to oldtrafford and not oldtrafford.imanakbari.com , everything gets messed up because other components expect different credentials than what they expect.

The best way to troubleshoot these problems and make sure these configs are correct is to inspect what the configurations resolve to. If hadoop.security.dns.interface is not set, the following Java snippet can let you check what _HOST will be replaced with on each machine:

If there is a discrepancy, you can use the hadoop.security.dns.interface and hadoop.security.dns.nameserver configurations. To check what the DNS returns, you can use the nslookup tool:

$ nslookup 192.168.1.1001.100.168.192.in-addr.arpa name = amfield.imanakbari.com.

Also, keep in mind that if there is no DNS set, the /etc/hosts file will be used, and the order of domain names matters on how reverse DNS lookups are performed. For example, in the following config, looking up 192.168.1.100 will return amfield.mycluster.imanakbari.com and not amfield.imanakbari.com which can cause errors.

192.168.1.100         amfield.mycluster.imanakbari.com amfield.imanakbari.com

krb_error 41 Message stream modified

I’d like to share my experience with another error that was not discussed in StackOverflow and Cloudera forums enough too. When using Spark client inside a Docker container, I ran into the following error when starting the Spark on secure YARN session:

Exception: krb_error 41 Message stream modified (41)

Although for many others the problem was with upper/lower case of configurations, in my case the problem was with the renew_lifetime = 7d config in /etc/krb5.conf file. Removing it, did the trick.

Conclusion

There’s a lot of bells and whistles on the Hadoop/Kerberos configurations. I recommend (1) Taking the time to understand the core concepts (2) Using Ambari for generating configs and keytabs, at least the first time dealing with this (3) Paying attention to the principal names in Hadoop XML config files matching the ones in the generated keytabs.

Acknowledgments

Credits are due to Faizul Bari for the KRB installation instructions.