Access Azure Blob Storage from Spark

Ankur
Ankur
Jun 20 · 3 min read

Over the weekend I was working on a big data platform & during the POC I found that in big data platform libraries are very important. If you used different version things will not work as expected.

This blog will help you to integrate apache spark with Azure blob storage as a data lake.

For this activity we need followings
- Azure Account(Blob Storage)
- Linux Instance
- Spark-3.1.2

Let’s start with Azure Blob Storage. For that, you need an Azure account and create a storage account like below

Azure Storage Account

Now will create a container, by default spark use containers in blob storage

Azure storage container: store spark data
Access Keys

Azure Access key’s to access blob storage from spark using core-site.xml.

Now we have done half part of our POC let’s move towards the next task Apache spark. To set up spark we need java installed on the server, I am assuming that everyone aware of how to install java in Linux. Here I am using centos 7.

Download apache-spark from the official website using the below link

We downloaded the tar file, extract that and move it to the OPT directory. Now define below environments variable in the ~/.bashrc file

Now we need Azure jar libraries so that we will access Azure blob storage from the spark. For that, we need the below jar files

  • azure-storage-2.0.0.jar
  • azure-storage-blob-12.0.0.jar
  • hadoop-azure-2.7.7.jar

Now we need to set up core-site.xml to azure access keys as below

Let’s try to access the azure blob storage from the spark.

TL;DR

Create an Azure Storage Account. Download Apache Spark and the required Azure library Jar file. Update the access key’s in core-site.xml.
Please make sure you are using the right version of the jar with the spark.

Opstree

Opstree is an end-to-end DevOps consultant company

Opstree

Opstree is a end-to-end DevOps consultant company which helps different organization to implement DevOps practices

Ankur

Written by

Ankur

DevOps Engineer with 10+ years of experience in the IT Industry. In-depth experience in building highly complex, scalable, secure and distributed systems.

Opstree

Opstree is a end-to-end DevOps consultant company which helps different organization to implement DevOps practices