Enabling Data Governance at Bazaar with Apache Ranger (Part-1)

Micro-managing Data Governance across Hadoop

Published in

Bazaar Engineering

8 min readOct 18, 2022

Bazaar is growing and making a name in the industry. With growth comes an influx of Data that consists of sensitive and confidential data such as PII’s (Personal Identification Information). We as a company require great control over our data as to not hinder our internal users in their work requiring Data, all the while keeping a firm grasp over the GDPR compliance policy to ensure complete anonymity to our internal users.

Now this comes as challenge, because extremely fine tuned level access is complex to manage, this is where Apache Ranger comes in!!

What is Apache Ranger?

Apache Ranger is a framework to enable, monitor, and manage comprehensive data security across the Hadoop platform.
Apache Ranger has the following features:

Centralised security administration to manage all security related tasks in a central UI or using REST APIs.
Fine-grained authorization to do a specific action or operation with a Hadoop component or tool, managed through a central administration tool.
A standardized authorization method across all Hadoop components.
Centralized auditing of user access and administrative actions (security related) within all the components of Hadoop.

Apache Ranger uses two key components for authorization:

Apache Ranger policy admin server
Apache Ranger plugin

Setting up Ranger is divided into 3 steps:

Setting up PostgreSQL with Apache Admin Sever
Setting up Ranger Trino plugin with Trino
Setting up policies to govern our data

Pre-requisites

Kubernetes
Helm
docker
trino

Getting started

If you want to build the Apache Ranger from source code including the trino plugin use this repository. This step is optional as the build artifacts will be linked where needed. Now let’s get right into it!

Setting up PostGres

First of all we will have to setup Postgres on our local Kubernetes system using helm and Minikube. You can use k3s or kind as well if you are not familiar with Minikube.

To begin first make a ranger directory this will be our root. now opening a terminal run the following command.

We will be using the helm chart provided by Bitnami, Now after adding repo we can confirm it with:

now we will have to make our values.yaml file to configure our personalized postgres deployment. The values.yaml files basically configure our key values for our deployment so first of all we will setup the auth for our postgresql by editing the following properties :

and the auth for our postgres user:

Now what do these properties mean well most are pretty self explanatory the important properties here is the Database and credentials for accessing postgres and the user. These will be used by Ranger to communicate with the DB.

Once the above changes are made to the values.yaml file we can now install our postgres on our Minikube environment

$ helm install postgres -f values.yaml repo/postgresql -n postgres --create-namespace

if followed correctly you should see something like this in your console

now our helm chart was correct its time to check if postgres ran successfully or not to do that run

$ kubectl get pods -n postgres

it will take a few seconds when you see the pod is ready it means your deployment was successful. Yay!!

To access our db and see if our database is created as we wanted we perform the following steps

$ kubectl port-forward --namespace postgres svc/postgres-postgresql 5432:5432
$ PGPASSWORD="postgres" psql --host 127.0.0.1 -U postgres -d ranger -p 5432

now you should be in your postgres cli now write \l to list all databases like so

Finally its time to get onto the good part setting up our ranger admin server!!

Apache Ranger

This part is gonna be tricky so hold on it’ll make sense in the end. The first step will be to create the docker image for our Ranger Admin server. It would be great if you can have your dockerhub registered ,but you can still continue without it.

Dockerfile

Now to explain the Docker file we are taking a base image from phusion and setting up our environment for Ranger to use. We also get our postgres connector to be able to communicate with our database we previously setup and lastly we will pull in our ranger tar file. either you can use the file you got by building ranger from source or if you want to just get right into it you can use the configuration above.

Now that we have our image we can build and push this image to docker hub to pull in our helm deployment further on However don’t do that just yet!!

Ranger uses shell scripts to setup its environment and configurations to be able to work so for that we will need to make 2 more files:

install.properties (to provide environment variables to ranger’s script)
docker-entrypoint.sh (to make sure ranger uses our configurations not its default configs)

You can get the base install.properties file and make the necessary edits for your deployment.

install.properties

ranger offers multiple choices for its database in our case we will choose postgres as our DB_FLAVOR, as well as the root user and password we set for postgres in the previous section. If you remember previously, in the Dockerfile we downloaded our connector in the root folder so we will provide the same path to ranger. Now ranger needs to know the location of the db_host since we are installing ranger in the same environment as postgres we can get our db host by running

kubectl get svc -n postgres

you should see something like this:

we will use this service ip in our db_host property with the default port like

db_host=10.102.220.12:5432

Next we will tell ranger the name of the database it will be writing to and it credentials so alter the following properties

We will also set the password for our 4 users that ranger will automatically create. the first is the most important which will allow us to work with ranger-admin-server as an admin. The next three are three more services that ranger provides.

Ranger Tag Sync: used to synchronize the tag store with an external metadata service such as Apache Atlas.
Ranger User Sync: used to sync users from an external LDAP/UNIX store.
Key admin: this is the KMS service provided by Apache Ranger.

Unfortunately, we will not be looking at these three in this article’s scope (maybe in the future in another article). However you can still explore Key admin using the password above and username keyadmin. With this our properties have been configured and the rest will be default.

For our last file we will make a file docker-entrypoint.sh

touch docker-entrypoint.sh

now in our bash script add

docker-entrypoint.sh

we will unzip our ranger-2.1.0-admin.tar.gz that we downloaded in our docker image and move our properties file in the correct location so ranger can use them. Then we start the setup script so ranger can do its magic and connect to our postgres and generate the default users. lastly, we start the service. with this our setup is ready and we can get to deploying our ranger service. Now you can build the docker image and push it to dockerhub with the following command

docker build -t <your-docker-hub-username>/ranger-admin:v1.0 .
docker push <your-docker-hub-username>/ranger-admin:v1.0

with all our material ready and images ready to be pulled lets make our ranger helm chart.

helm create ranger

This will create a template for our ranger helm. With just a few changes in our values.yaml we can get started.

provide your docker image repo so the helm know where to pull the image from.

make sure your service port 6080 which is where ranger will be running.

Now just one small change in our templates/deployment.yaml file

this command will run our docker-entrypoint.sh script whenever our pods is deployed and make sure containerPort is 6080 as well. With this lets deploy our ranger helm right away.

helm install ranger path/to/ranger/helm -n ranger --create-namespace

you should see something similar to this

lets check if our service port is correct

kubectl get svc -n ranger

we can see its correct

now if we inspect our pods

its fine for it to restart a few times because ranger accesses admin actions in postgres which might be occupied sometimes so let it run for a while and it will start and there you have it our ranger is now up!!

To open its ui we will port forward it to port 6080 and access it via localhost:6080

kubectl --namespace ranger port-forward <your-pod-name> 6080:6080

now going to localhost:6080 on your browser

logging into our ranger admin with credentials we provided

and there you have it you have successfully deployed Apache Ranger on Kubernetes!!!

In the next part we will look into setting up trino plugin to attach it to a trino server so we can apply policies via Ranger!!

Disclaimer:

Bazaar Technologies believes in sharing knowledge and freedom of expression, and it encourages it’s colleagues and friends to share knowledge, experiences and opinions in written form on it’s medium publication, in a hope that some people across the globe might find the content helpful. However the content shared in this post and other posts on this medium publication mostly describe and highlight the opinions of the authors, which might or might not be the actual and official perspective of Bazaar Technologies.