ACCESSING HADOOP FILESYSTEM API WITH PYSPARK

somanath sankaran
Analytics Vidhya
Published in
4 min readApr 4, 2020

This is one of my stories in spark deep dive

Spark is a parallel processing framework which sits on top of Hadoop filesystem.So there are few common use-case which requires handling hadoop filesystem.Let us the see the following in detail

  1. Hadoop Filesystem API
  2. Accessing Hadoop file-system API with Pyspark
  3. Common Use- Case with Hadoop FileSystem API

Hadoop Filesystem API:

It is a abstract class in Java which can be used for accessing distributed filesystem.

Since it is a abstract class it has a get method which need the configuration of the filesystem and it return a static FileSystem class which will use to access hadoop filesytem and we will use this to do common operation like ls ,copyToLocalFile

public static FileSystem get(Configuration conf)
throws IOException

Accessing Hadoop file-system API with Pyspark

In pyspark unlike in scala where we can import the java classes immediately.

In pyspark it is available under Py4j.java_gateway JVM View and is available under sc._jvm

Step1:import the abstract class

we can import the abstract class

Step2:Creating the static class

Before passing the hadoop conf we have to check if the spark integration to hadoop uri is made correctly

For example in my case this is not pointing to hadoop filesytem .So I will set this to hadoop filesystem (this is optional as in most prod systems it will be set)

Next step is to create the static class by passing this hadoop conf object

Common Use- Case with Hadoop FileSystem API

Let us see the following use case

  1. Make Directory
  2. copyfile from local
  3. list hdfs directory

4.Copyto local

Make Directory

we can make dir with mkdirs

public boolean mkdirs(Path f)
throws IOException

We will import the Path class also from same jvm

And it is verified with hadoop fs -ls

copyfile from local

public void copyFromLocalFile(Path src,
Path dst)
throws IOException

Since we need to copy from local file:/// is used

Checking file exists

list hdfs directory

we can use glob status to match all the dir with glob pattern as shown below.

public FileStatus[] globStatus(Path pathPattern)
throws IOException

this with return a java object and we can use list comprehension to get the attributes of file

Copyto local

public void copyToLocalFile(Path src,
Path dst)
throws IOException

check if it is copied to local

That’s all for the day !! :)

Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv

Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)

Learn and let others Learn!!

--

--