Apache Ranger Docker POC With Hadoop(HDFS, Hive, Presto)
If you are digging up on the on-prem Hadoop data platform like me, you definitely have been encountered authorization and authentication problems across many services in your platform. The data platform currently handles it with a mixture of the Apache Sentry and Internal LDAP but it needs some manual touch whenever we need 1) to satisfy some out-of-regular requirements and 2) to add more new services that are not compatible with the current structure.
So, here I share a thorough overview(and docker hands-on) of the Apache Ranger which is the result of my comprehensive POC to consider any other solution to the above problem. Hope that it will help many others who face similar circumstances.
Here we need two ‘docker-compose up’: one for the Ranger cluster and the other for the Hadoop Environment. I share a brief overview of the container structure first and show how each component of the Hadoop(HDFS, Hive, Presto) works with Ranger.
Requirements
- Git
- Docker with enough memory (recommendation — more than 12 GB)
- Basic Knowledge of Apache Ranger, Hadoop Stack
docker-compose up
Let’s pull the git repo and docker-compose up like following:
git clone https://github.com/kadensungbincho/apache-ranger-docker-poc.gitdocker create network ranger-envcd apache-ranger-docker-poc/docker-composes/ranger
docker-compose up -dcd ../hadoop
docker-compose up -d
When you typed the first ‘up’
You got 4 containers:
- ranger-admin: this container has a Ranger Security Admin service that offers 1) Policy Managing, 2) Audit Logs Viewing, 3) Ranger User Managing functionality with Web UI environment. As you see, it uses a MySQL DB as a data storage(Policy, User) and ES as an audit log storage.
- ranger-usersync: Ranger user synchronization service (for this case, it uses unix user in the container and you can have multiple usersync like the gray one-from LDAP).
- ranger-db: a container for MySQL.
- ranger-es: a container for ElasticSearch to store audit logs from the agents in the supported services(HDFS, Hive, Presto, etc).
After you wait for 30 ~ 60 seconds, you can check http://localhost:6080 with the default username/password (admin/rangeradmin1):
When you typed the second ‘up’
It starts up the Hadoop stack so that we can try the integrated tests using services like HDFS, Hive, and Presto:
The points that I want to make on the above diagram are thickened 3:
- Plugin: Every service that uses the Apache Ranger has an agent specialized for each case(here they are HDFS, Hive, Presto)
- Audit Logs to ES: audit logs are sent to the ES directly from the agent(plugin) to the audit store(ES in this case)
- Policy Checking through Ranger Security Admin and applying: the agents validate whether the user’s request to the resource is authorized through checking the policies on the Ranger Security Admin(sometimes it caches — need more research on it).
How each component works with Ranger
Now, we can test how Ranger handles each component’s request in the above environment. We check how we can set the configuration for each one and validate whether it works as we intended.
HDFS
Ranger evaluates the request from a user on a specific resource(HDFS path) by two stages:
- It checks whether there is a policy that applied to the resource, then ‘enforces’ it if it exists(then it shows ‘ranger-acl’ on the audit logs below)
- Otherwise, it enforces the ‘hadoop-acl’.
Below we check the first case: what happens if we set some policies on the HDFS resources.
After you get in through ‘Access Manager’ -> ‘HDFS +’, let’s set HDFS configuration like the following image and save(sometimes it fails on the ‘Test Connection’ so keep ahead although you miss it):
- Service Name: I set it by changing it here, so the above setting should match with it
- Username and Password: the account used in the HDFS client to access the Ranger admin (default user is admin and I changed the password as I mentioned before)
- Namenode URL: No wonder if you are familiar with docker. Changed here.
Then, you can check ‘Audit’ -> ‘Plugins’ to make sure that the HDFS plugin is correctly set.
HDFS TEST1 — Block All User Access on the root path of HDFS
So, let’s modify the HDFS policy in the Ranger Web UI and check. First, move to ‘Hdfs’ -> ‘Add New Policy’
Then, set the policy like the following image:
Now, if you move into the ‘datanode’ container and try to access the HDFS through a non-owner user(hive), you are blocked.
# in your terminal
docker exec -it datanode bash# in the 'datanode' container as root user
useradd hive
su hive# in the 'datanode' container as hive user
/opt/hadoop-3.2.1/bin/hdfs dfs -ls /user/hive
ls: Permission denied: user=hive, access=EXECUTE, inode="/user/hive"
And you can check the audit log like this:
HDFS TEST2 — Allow access(Recursively) to the user ‘hive’
Then we can enable access on the default hive warehouse path(/user/hive/warehouse) to the user ‘hive’. First, we create a user in the Ranger UI and add a policy for that user.
After you get to the UI through ‘Settings’ -> ‘Users/Groups/Roles’ -> ‘Add New User’, fill up to create a user.
Then, create a policy.
But if you try to access by typing:
# in your terminal
docker exec -it datanode bash# in the 'datanode' container as root user
su hive# in the 'datanode' container as hive user
/opt/hadoop-3.2.1/bin/hdfs dfs -ls /user/hive/warehouse
ls: Permission denied: user=hive, access=EXECUTE, inode="/user/hive/warehouse"
You’ve got the error again because currently, the former policy(blockRootAccess) has priority(Deny is prior to Allow).
Drop the ‘blockRootAccess’ policy and then you could check it works as we intended. So, here we could realize that using policy mixed with ‘Deny’ and ‘Allow’ needs care.
Hive
Hive agent handles resources like ‘Table’, ‘Database’, etc. The data of the table stays on top of the storage(HDFS) so that it is different from HDFS in that it’s also affected by the HDFS policies when it physically accesses the data.
Hive also has more access-levels than HDFS so Ranger’s AccessType aggregates some access-levels of Hive like this image:
Then, let’s test some Hive access patterns (It works on the Hive Server not Hive Metastore).
Hive TEST1 — Block Hive Access
First, we need to register the Hive Plugin. Get into the Hadoop SQL Registration page through ‘Access Manager’ -> ‘Hadoop SQL +’, then fill-up the form like below: (later, I changed the jdbc.url because I realized that we need a Zookeeper for Hive Server when we have an additional Authorizer: Ranger in this case)
And make sure that your Hive plugin is registered on the Audit page, before move on.
Then, connect to the Hive Server with beeline as a user ‘hive’ to check whether the user ‘hive’ can access the databases or tables.
# in your terminal
docker exec -it hive-server bash# in the 'hive-server' container as root user
/opt/hive/bin/beeline# in the beeline console
!connect jdbc:hive2://127.0.0.1:10000 hive hivehive1(or type the password you put when you create hive user)
Then, sadly you faced an error like this:
20/12/17 12:56:46 [main]: WARN jdbc.HiveConnection: Failed to connect to localhost:10000Error: Could not open client transport with JDBC Uri: jdbc:hive2://localhost:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.ranger.authorization.hadoop.exceptions.RangerAccessControlException): Permission denied: user=hive, access=EXECUTE, inode="/tmp/hive" at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:466)
It means that the Hive Server tries to create a temporary ‘folder’ for the user who login on the HDFS storage path “/tmp/hive” as a user ‘hive’ (because I turn on the doAs option) but it failed. We should set up the HDFS policy for the user ‘hive’ so that it can make the temporary ‘folder’. After we delete the formerly created ‘blockAllAccess’ policy on the HDFS, we can check it is properly connected.
!connect jdbc:hive2://127.0.0.1:10000 hive hivehive1(or type the password you put when you create hive user)Connecting to jdbc:hive2://127.0.0.1:10000
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://127.0.0.1:10000> show databases;...INFO : OKINFO : Concurrency mode is disabled, not creating a lock managerDEBUG : Shutting down query show databases+----------------+| database_name |+----------------+| default |+----------------+1 row selected (1.072 seconds)
Then, let’s edit the default policy on ‘all-databases, tables, columns’ to block all access of user ‘hive’ by adding the ‘Deny’ condition like the following:
Then, your access is properly blocked:
0: jdbc:hive2://127.0.0.1:10000> show databases;Error: Error while compiling statement: FAILED: HiveAccessControlException Permission denied: user [hive] does not have [USE] privilege on [Unknown resource!!] (state=42000,code=40000)
Conclusion
I personally can understand the underlying structure of HDFS and Hive more as I tried to debug the problems between Ranger and the Hadoop Services(They are quite tightly entangled).
Hope that this short POC can give you an environment to freely explore the Ranger and Hadoop cluster.