Reading data from different sources using Spark 2.1

Knoldus Inc.
Knoldus - Technical Insights
2 min readMar 6, 2017

Hi all,

In this blog, we’ll be discussing on fetching data from different sources like csv, json, text and parquet files.

So first of all let’s discuss what’s new in Spark 2.1. In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.1 the same effects can be achieved through SparkSession, without explicitly creating the SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession.

SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with DataFrame and Dataset APIs.

To create an sbt Spark2.1 project you need to add the following dependencies in your build.sbt file

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "2.1.0"

and your scala version should be 2.11.x . For this project I am using:

scalaVersion := "2.11.8"

To initialise the spark session you can use :

val spark = SparkSession
.builder()
.master("local")
.appName("ReadDataFromCsv")
.getOrCreate()

Reading data from a csv file

val spark = SparkSession
.builder()
.master("local")
.appName("ReadDataFromCsv")
.getOrCreate()
val df = spark.read.csv("./src/main/resources/testData.csv")
//To display dataframe data
df.show()
Reading data from a Json file
val spark = SparkSession
.builder()
.master("local")
.appName("ReadDataFromJson")
.getOrCreate()
val df = spark.read.json("./src/main/resources/testJson.json")
//To display dataframe data
df.show()
Reading data from a text file
val spark = SparkSession
.builder()
.master("local")
.appName("ReadDataFromTextFile")
.getOrCreate()
val df = spark.read.text("./src/main/resources/textFile.txt")
//To display dataframe data
df.show()
Reading data from a parquet file
val spark = SparkSession
.builder()
.master("local")
.appName("ReadDataFromParquet")
.getOrCreate()
val df = spark.read.parquet("./src/main/resources/testJson.parquet")
//To display dataframe data
df.show()
You can also create a temporary table from a dataframe and perform sql queries on it using the following code:
df.registerTempTable("tempTable")
spark.sqlContext.sql("select * from tempTable").show
Here is the link for a demo project on Spark2.1: ReadingDataUsingSpark2.1Happy coding !!

--

--

Knoldus Inc.
Knoldus - Technical Insights

Group of smart Engineers with a Product mindset who partner with your business to drive competitive advantage | www.knoldus.com