Data Discovery + Access Control + Encryption

If you are planning to migrate analytical workloads from on-prem to the public cloud then you must read this article. What I have encountered in the last 9 months is that organizations moving their data to the cloud not only struggle with ETL but also with an end to end data protection process which is comprised of Auto Data Discovery, Access Control, and Encryption.

What does Auto Data Discovery mean?

The goal of ETL jobs is to land data at the storage layer which can be ADLS, S3, GCS, or any of the traditional databases like Oracle, SQL server, or in case of cloud Aurora, RDS, Big Query, Snowflake and Databricks, etc. …


Azure Kubernetes Service (AKS) offers serverless Kubernetes, an integrated continuous integration and continuous delivery (CI/CD) experience, and enterprise-grade security and governance. Unite your development and operations teams on a single platform to rapidly build, deliver, and scale applications with confidence. Source

Privacera provides an enterprise solution to provide centralized data governance and access management across all of enterprise data services.

This article is divided into 3 different parts.

Part 1 — Prerequisites

Part 2 — Setting up AKS, K8 and Helm

Part 3 — Privacera installation

Part 1

Prerequisites:

Azure Client

az login → configure azure cli with your account

Kubectl


Try this :)

[root@xx ~]# crontab -l
* * * * * /root/hostname.sh

[root@xx ~]# cat /root/hostname.sh
#!/bin/bash
hostname newhostname

I did try the following but no luck

cat /usr/share/dracut/modules.d/99base/parse-hostname.sh
type hostname >/dev/null 2>&1 || \
hostname() {
if [ -n “$1” ]; then
printf — “%s” “$1” > /proc/sys/kernel/hostname
else
cat /proc/sys/kernel/hostname
fi
}

if hname=$(getarg hostname=); then
hostname “$hname”
fi

cat /proc/sys/kernel/hostname

cat > /proc/sys/kernel/hostname

new hostname

control+D

hostname

hostname -f


Azure Synapse is a scalable analytics service that brings together enterprise data warehousing and Big Data analytics capabilities. It gives users the freedom to query data on their terms, using either serverless or provisioned resources at scale. Azure Synapse brings these two operating models together with a unified experience to ingest, prepare, manage, and serve data for business intelligence (BI) and machine learning (ML)use cases. Source

This article provides an overview of Privacera’s “Policy Sync” module which delivers fine-grained access control for Azure Synpase. …


Power BI is a collection of software services, apps, and connectors that work together to turn your unrelated sources of data into coherent, visually immersive, and interactive insights. Your data may be an Excel spreadsheet or a collection of cloud-based and on-premises hybrid data warehouses. Power BI lets you easily connect to your data sources, visualize and discover what’s important, and share that with anyone or everyone you want. Source

This article explains the integration of PowerBI with Databricks and how fine-grained access control take effect which is having table, column, and row-level access controls.

Let’s connect to Spark data…


This is a basic demo of how my life is so easier by using one platform to secure access for various data services like HDFS, Hive, AWS, EMR Hive, Databricks.

Looking forward to getting your feedback.

Neeraj Sabharwal


Qubole is a cloud-native data analytics platform that supports a number of enterprise-grade data processing engines such as Apache Spark, Presto, Hive, Quantum, Airflow, and more. It is used by companies like Expedia, Under Armour and Adobe.

As its popularity grows, more and more users from different departments with different roles across the enterprise are accessing data stored in Qubole. This increases the need for robust data access governance capabilities to comply with regulations like GDPR and CCPA.

Privacera, based on Apache Ranger, enables IT and data platform teams to automatically discover and classify sensitive data, define and enforce access control policies to that data, and monitor activity and report for compliance. …


Launch EC2

Steps to install ambari server

[root@ip-172–31–41–233 ~]# cd ~/.ssh/
[root@ip-172–31–41–233 .ssh]# ls
authorized_keys id_rsa id_rsa.pub
[root@ip-172–31–41–233 .ssh]# cat id_rsa

You would need the above entry for the following:

[root@ip-172–31–41–233 ~]# ambari-server setup — jdbc-db=mysql — jdbc-driver=/usr/share/java/mysql-connector-java.jar
Using python /usr/bin/python
Setup ambari-server
Copying /usr/share/java/mysql-connector-java.jar to /var/lib/ambari-server/resources/mysql-connector-java.jar
If you are updating existing jdbc driver jar for mysql with mysql-connector-java.jar. Please remove the old driver jar, from all hosts. Restarting services that need the driver, will automatically copy the new jar to the hosts.
JDBC driver was successfully initialized.
Ambari Server ‘setup’ completed successfully.
[root@ip-172–31–41–233 ~]#

Twitter @123nsab


I am using the following setup to test Apache Ranger policies with Tableau for EMR Hive. The goal is to do table, column and row level access control in Tableau.

The following setup is running in my mac. EMR Hive is running and Kerberos is in place.

neeraj_mac:~ neerajsab$ kinit neerajsab@example.com

neerajsab@example.com’s password:

neeraj_mac:~ neerajsab$ klist

Credentials cache: API:56B9D7E0–6DC7–46D4–91E1–710039407C26

Principal: neerajsab@example.com

Issued Expires Principal

Feb 6 15:12:56 2020 Feb 7 01:12:56 2020 krbtgt/example.com@example.com

neeraj_mac:~ neerajsab$

The prinicipal/user neerajsab is part of KDC and I have Kerberos ticket based on realm example.com

The private IP is listed in my /etc/hosts in mac pointing to public IP of EMR master node and also, KDC master.

Reach out to me on twitter @123nsab in case any questions.


A Jupyter notebook is a web-based application used to create and share documents that contain both live code and rich text elements. It is popular with data scientists who use Jupyter notebooks for a number of use case including machine learning, statistical modeling, and data visualizations. One reason for Jupyter’s popularity is that it is language agnostic. Data scientists can run jobs in Jupyter notebooks using the language of their choice, such as PySpark and Scala.

Data scientists at Netflix, for example, use Jupyter notebooks to analyze and better understand user behavior and to develop new models to improve the user experience, as well as to share the results of their analysis and collaborate with colleagues. …

About

Neeraj Sabharwal

Director of Sales Engineering @Privacera

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store