Opstree
Published in

Opstree

Access Azure Blob Storage from Spark

Over the weekend I was working on a big data platform & during the POC I found that in big data platform libraries are very important. If you used different version things will not work as expected.

This blog will help you to integrate apache spark with Azure blob storage as a data lake.

For this activity we need followings
- Azure Account(Blob Storage)
- Linux Instance
- Spark-3.1.2

Let’s start with Azure Blob Storage. For that, you need an Azure account and create a storage account like below

Azure Storage Account

Now will create a container, by default spark use containers in blob storage

Azure storage container: store spark data
Access Keys

Azure Access key’s to access blob storage from spark using core-site.xml.

Now we have done half part of our POC let’s move towards the next task Apache spark. To set up spark we need java installed on the server, I am assuming that everyone aware of how to install java in Linux. Here I am using centos 7.

java -version
openjdk version "1.8.0_292"
OpenJDK Runtime Environment (build 1.8.0_292-b10)
OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)

Download apache-spark from the official website using the below link

wget https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
tar -xvf spark-3.1.2-bin-hadoop3.2.tgz
mv spark-3.1.2-bin-hadoop3.2 /opt/spark

We downloaded the tar file, extract that and move it to the OPT directory. Now define below environments variable in the ~/.bashrc file

vim ~/.bashrc
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk"
export SPARK_HOME="/opt/spark"
export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$PATH
export PATH=$JAVA_HOME/bin:$PATH

Now we need Azure jar libraries so that we will access Azure blob storage from the spark. For that, we need the below jar files

  • azure-storage-2.0.0.jar
  • azure-storage-blob-12.0.0.jar
  • hadoop-azure-2.7.7.jar
wget https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/2.0.0/azure-storage-2.0.0.jar -O /opt/spark/jars/azure-storage-2.0.0.jar
wget https://repo1.maven.org/maven2/com/azure/azure-storage-blob/12.0.0/azure-storage-blob-12.0.0.jar -O /opt/spark/jars/azure-storage-blob-12.0.0.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.7/hadoop-azure-2.7.7.jar -O /opt/spark/jars/hadoop-azure-2.7.7.jar

Now we need to set up core-site.xml to azure access keys as below

vim /opt/spark/conf/core-site.xml
-------
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.AbstractFileSystem.wasb.Impl</name>
<value>org.apache.hadoop.fs.azure.Wasb</value>
</property>
<property>
<name>fs.azure.account.key.StorageAccountName.blob.core.windows.net</name>
<value>xxxxx....</value> //AccessKey
</property>
<property>
<name>fs.azure.block.blob.with.compaction.dir</name>
<value>/hbase/WALs,/data/myblobfiles</value>
</property>
<property>
<name>fs.azure</name> <value>org.apache.hadoop.fs.azure.NativeAzureFileSystem</value>
</property>

<property>
<name>fs.azure.enable.append.support</name>
<value>true</value>
</property>
</configuration>

Let’s try to access the azure blob storage from the spark.

from pyspark.sql import SparkSession
import datetime
from pyspark.sql.functions import year, month, dayofmonth
df = spark.read.format("jdbc").option("url", "jdbc:mysql://xxxxxxxxxxxxxx:3306/database_name") \
.option("driver", "com.mysql.cj.jdbc.Driver").option("dbtable", "table") \
.option("user", "anverma").option("password", "xxxxxxxxxxx").load()
df.write.options(header='True').mode("overwrite").parquet("wasbs://sparkpoc@sparkpoc.blob.core.windows.net/test")

TL;DR

Create an Azure Storage Account. Download Apache Spark and the required Azure library Jar file. Update the access key’s in core-site.xml.
Please make sure you are using the right version of the jar with the spark.

--

--

--

Opstree is a end-to-end DevOps consultant company which helps different organization to implement DevOps practices

Recommended from Medium

10 Software Methodologies in under 10 minutes

Java Log file tailer (tail -f) in Spring Boot

How to Automate Continuous Integration and Development, Versioning and Publishing

String escape sequences while reading String from a file in Kotlin/Java

How I used Typescript to generate my COBOL programs

5 Reasons to Join Upcoming Webinar

Azure Data Platform & BigData

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ankur

Ankur

DevOps Engineer with 10+ years of experience in the IT Industry. In-depth experience in building highly complex, scalable, secure and distributed systems.

More from Medium

Databricks CI/CD using Azure DevOps — part I — CI

Integration testing AWS + Spark jobs using Localstack + Docker

How to Flatten Json Files Dynamically Using Apache Spark(Scala Version)

Apache Spark with Scala: read files from S3 using python and Boto3.