Customising HDInsight clusters with script actions and ARM templates

Andrew Kelleher
Azure Architects
Published in
5 min readNov 6, 2019

Having recently worked with Azure’s HDInsight service I’ve begun to appreciate that whilst it’s a PaaS service there’s often a significant amount of custom configuration required.

If you’re new to HDInsight it’s one of Azure’s original services and provides a Hadoop managed service in the cloud. It’s popular for open source big data scenarios that involve Spark, Hive, Kafka etc.

Azure supports several methods for customising your Linux-based clusters through the use of Bootstrap scripts and script actions. In this post I’ll be focusing on script actions.

What is a script action?

A script action is a regular Bash script that runs against the nodes within your HDInsight cluster. Scenarios where you might want to customize your cluster could include -

  • Configuration changes via Ambari i.e. enabling ACID transactions
  • Downloading and installing additional components i.e. Kafka Connect
  • Manipulating configuration files i.e. core-site.xml

In fact, as you’re running a regular Bash script you have a lot of flexibility in configuring your clusters.

How do I run script actions?

Azure provides a number of ways to run your script actions -

  • Azure Portal
  • ARM templates
  • Azure PowerShell and CLI
  • HDInsight .NET SDK

You also need to consider the following -

  • Which cluster nodes will the script run against i.e. Head or Worker nodes?
  • Does the script need to be persisted i.e. should it automatically run on new nodes that are added to the cluster?

Running scripts via the Azure Portal is a good place to start. Especially when you’re developing a script and want to quickly test it against a sandbox cluster.

However, eventually it’s likely you’re going to want to automate the running of the scripts as part of the cluster deployment.

There’s some great reference Microsoft documentation on script actions. But it can be a little daunting to get started with enabling scripts as part of an ARM deployment.

So for the rest of this post we’ll focus on just that — automating your HDInsight scripts via ARM templates!

ARM template deployment

In this scenario we’re going to run two Bash scripts as part of an HDInsight Spark cluster deployment. The two scripts are -

  • enable-acidtransactions.sh
  • install-kafkahdfsconnect.sh

The first script enables ACID transactions in Hive. The second downloads HDFS Connect, untars it and deploys it to the cluster.

You can take a closer look at the scripts on Github here. The HDFS Connect installation files can be downloaded directly from Confluent here.

ARM template breakdown

Let’s breakdown an example HDInsight ARM template a little.

Depending on the type of HDInsight cluster being deployed i.e. Kafka, Spark etc. determines the number and types of server roles that are deployed.

In our example Spark deployment we’re deploying 2 x head nodes and 2 x worker nodes -

You’ve probably already spotted that the section we’re most interested in is the “scriptActions” section. It’s here that we’ll reference the scripts we want to run. The script actions object requires the following values -

  • name — the name of the script action
  • uri — the location URI for the script
  • parameters — the parameters for the script

Hosting the scripts

During the initial deployment the scripts need to be hosted on a location that’s accessible from the HDInsight cluster. Supported locations include -

  • ADLS Gen1 Data Lake Storage
  • Azure blob storage that’s configured as the primary or secondary cluster storage
  • A public file-sharing service accessible through http:// paths i.e. Azure Blob, Github etc.

For simplicity, we’re going to use the last option and upload the scripts to public Azure blob storage.

For production scripts consider careful whether publicly hosting your scripts is acceptable to your organisation. Especially if they have data or references that shouldn’t be exposed externally.

I’ve created a storage account called hdinsightscripts and a blob container named scripts -

Within the blob container I’ve uploaded the two scripts and the installation file required by the Kafka HDFS script -

Updating the ARM template

The next step is to update the ARM template to reference the scripts located on blob storage.

The updated template is listed in full below. The updated sections are -

  • Parameters — added parameters for the script action URI’s
  • Resources — added scriptActions for both the headnode and workernode sections

Even if your script doesn’t require any parameters I’ve found that ARM requires a value here. In our example I’ve just used a dummy value “node”

Deploying the ARM template

We’ll now deploy the updated ARM template from my local machine. Ideally you’ll want to be hosting your ARM templates in source control and deploying them via a CI/CD platform such as Azure DevOps.

We’ll use the New-AzResourceGroupDeployment cmdlet to deploy the template -

New-AzResourceGroupDeployment -mode incremental -ResourceGroupName hdinsighttest-rg -templatefile ".\azuredeploy.json" -templateparameterfile ".\azuredeploy.parameters.json" -verbose

Whilst the cluster deploys we can check on the status of the scripts in the Azure Portal. The scripts have been persisted to the cluster but haven’t been run yet as the cluster is still deploying —

Once the cluster deployment has finished we can check the script actions again to verify the scripts have run successfully —

The scripts have now been deployed and run. As they’re persistent they will automatically run again when you scale out the cluster nodes.

Summary

That’s it! A few other things worth considering are -

  • As best practice, ensure both your ARM templates and scripts are stored in source control i.e. an Azure DevOps repo
  • Consider how you’ll copy scripts from source control to the storage accounts. Feasibly, this could be managed via a commit trigger on the code repository to run an Azure Devops pipeline to copy the scripts
  • If using a public storage account ensure no sensitive information is embedded in the scripts
  • Use the HDInsight Helper methods to help manage clusters via scripts

A final side note if you prefer Terraform to ARM templates. Unfortunately, the Terraform HDInsight provider doesn’t yet support script actions. As a result you’ll need to embed the HDInsight ARM template within the Terraform file.

I’ve also been using this approach successfully with a client who uses Terraform. Although it works fine it is a little clunky to manage.

I hope you found this post useful and feel free to reach out if you have any questions.

Useful Links

--

--

Andrew Kelleher
Azure Architects

Freelance Azure Architect | Helping organizations design and build cloud stuff | CarbonLogiQ.io