Provisioning Azure HDInsight Spark Environment with strict Network Controls

Inderjit Rana
Microsoft Azure
Published in
4 min readJun 17, 2020

The purpose of this post is to share a reference architecture as well as provisioning scripts for an entire HDInsight Spark environment. There are quite a few samples which show provisioning of individual components for an HDInsight environment but my goal was to showcase how to bring up an entire environment with VNET, Storage Account, external metastore using Azure SQL DB, etc. Azure ARM templates are the Azure native technology for Infrastructure as Code but Terraform from the open-source world is very popular as well, I decided to use Terraform for the most part except when I ran into limitations resorting to use Azure ARM Templates.

I do expect some level of familiarity with Infrastructure as Code using Azure ARM Templates and Terraform as well as basic knowledge of Azure Platform services like Azure Data Lake Gen2, Azure Bastion, Azure HDInsight, etc.

Background

When using Cloud Platform, network level security its extremely important especially in the Regulated Industries like Health Care. I have observed numerous instances where applying network controls on individual components of an environment breaks down the communication between various components so what I wanted to highlight in this reference architecture was to make sure all pieces work together. There are many layers of security, some other aspects like HDInsight Enterprise Security Package are not addressed at this point but consider what I am sharing here as a foundation which you can build upon.

Refer to my Github Repo for the scripts to provision this entire environment with one simple commandhttps://github.com/isinghrana/my-azure-utils/tree/master/hdinsight. As time permits, I will add enhancements and would love to see contributions from the community as well.

Key Considerations

  • HDInsight created inside a customer managed VNET as this gives full control on the network security for the cluster.
  • Environment is secured using NSG rules to whitelist IP Ranges (most likely corporate) for inbound connections.
  • Environment accessible only using Port 443 for connectivity from outside Azure Cloud like corporate environments, assumption is that VPN or Express Route are not available.
  • External Ambari DB — This is recommended if you are going to run a very large size HDInsight cluster (similar concept can be used for external Oozie and Hive metastores as well)

Architecture

Network Controls around each component of the environment

HDInsight

  • HDInsight clusters are created in hdi-subnet and the NSG rules control the traffic
  • HDInsight Ambari Web UI — Access is allowed from a specific set of IP ranges (most likely Corporate NAT IP Range) over port 443
  • HDInsight SSH Access — Connect to a VM in Azure using Azure Bastion over 443 then SSH to HDInsight Head Node (it is also possible to have an HDInsight Edge Node, in this sample its just Head Node and Worker Nodes but same concepts should apply for Edge Node as well). This ensures SSH ports are not accessible outside the environment and access from outside Azure is on port 443.
  • Traffic from HDI Management IP Ranges needs to be allowed, this can be region specific and is documented here — https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-management-ip-addresses

SSH Access to Head Node from within VNET

The usual SSH endpoint <clustername>-ssh.azurehdinsight.net is a Public Endpoint and cannot be used from within the VNET in the above setup, instead use the following endpoint to SSH from the VM inside VNET to headnode hn0-<abbreviated-clustername> where abbreviated-clustername is first 6 characters of the cluster name.

Azure Bastion

  • Azure Bastion is in AzureBastionSubnet (name has to be spelled exactly like this and the subnet needs to be dedicated just to AzureBastion)
  • Access is allowed from a specific set of IP ranges (most likely Corporate NAT IP Range) over port 443
  • GatewayManager ServiceTag needs to be allowed inbound connections on Port 443

Virtual Machine (VM)

  • VM does not need Public IP and will be accessed through Azure Bastion which is already in the VNET and NSG by default allows communication from with the VNET. Hence default NSG is used and there is no need for additional rules.

Azure Data Lake Gen2

  • Allow Access only from Selected Networks — VNET and specific set of IP Ranges
  • Azure HDInsight authenticates to ADLS Gen2 (storage account) using User Managed Identity. User Managed Identity needs to be given Storage Blob Data Owner Role on the Storage Account.

Azure SQL Database

  • Deny Public Access set to No — Don’t be mistaken this is still secure, Public Endpoint exists but you specifically whitelist who can connect to this database.
  • Allow Azure services and resources to access this server set No — This is the preferred setting for most customers.
  • Allow HDI Management IP Addresses (Region specific is fine)
  • Allow access from VNET using VNET Service Endpoint

--

--