ACCESSING HADOOP FILESYSTEM API WITH PYSPARK
This is one of my stories in spark deep dive
Spark is a parallel processing framework which sits on top of Hadoop filesystem.So there are few common use-case which requires handling hadoop filesystem.Let us the see the following in detail
- Hadoop Filesystem API
- Accessing Hadoop file-system API with Pyspark
- Common Use- Case with Hadoop FileSystem API
Hadoop Filesystem API:
It is a abstract class in Java which can be used for accessing distributed filesystem.
Since it is a abstract class it has a get method which need the configuration of the filesystem and it return a static FileSystem class which will use to access hadoop filesytem and we will use this to do common operation like ls ,copyToLocalFile
public static FileSystem get(Configuration conf)
throws IOException
Accessing Hadoop file-system API with Pyspark
In pyspark unlike in scala where we can import the java classes immediately.
In pyspark it is available under Py4j.java_gateway JVM View and is available under sc._jvm
Step1:import the abstract class
we can import the abstract class
Step2:Creating the static class
Before passing the hadoop conf we have to check if the spark integration to hadoop uri is made correctly
For example in my case this is not pointing to hadoop filesytem .So I will set this to hadoop filesystem (this is optional as in most prod systems it will be set)
Next step is to create the static class by passing this hadoop conf object
Common Use- Case with Hadoop FileSystem API
Let us see the following use case
- Make Directory
- copyfile from local
- list hdfs directory
4.Copyto local
Make Directory
we can make dir with mkdirs
public boolean mkdirs(Path f)
throws IOException
We will import the Path class also from same jvm
And it is verified with hadoop fs -ls
copyfile from local
public void copyFromLocalFile(Path src,
Path dst)
throws IOException
Since we need to copy from local file:/// is used
Checking file exists
list hdfs directory
we can use glob status to match all the dir with glob pattern as shown below.
public FileStatus[] globStatus(Path pathPattern)
throws IOException
this with return a java object and we can use list comprehension to get the attributes of file
Copyto local
public void copyToLocalFile(Path src,
Path dst)
throws IOException
check if it is copied to local
That’s all for the day !! :)
Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv
Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)
Learn and let others Learn!!