Read files on HDFS through Python

Example to read CSV file on HDFS through Python

Aman Ranjan Verma
Towards Data Engineering

--

When trying to read files from HDFS, I have been using Spark. There was one use case where I had no option but to use Python to read the file. This piece of code below does exactly the same.

Photo by Clément Hélardot on Unsplash

Method: 1

Replace these pieces of information from the below script:

  • active_name_node_ip
  • port
  • user name
import pandas as pd
from pyarrow import fs
fs = fs.HadoopFileSystem("hdfs://<name_node_ip>:<port>?user=<user_name>")df=pd.read_csv(fs.open_input_file("<hdfs_file_path_to.csv>"))

Method: 2

Replace these pieces of information from the below script:

  • nodes variable with a list of active and standby name node IP or just the active one.
  • user name
from pyhdfs import HdfsClientnodes = ["xx.yy.zz.xyz", "xx.yx.zx.zyx"]
client = HdfsClient(hosts=nodes, user_name="<user_name>")
df=pd.read_csv(client.open("<hdfs_file_path_to.csv>"))

How to find an active name node?

--

--

Aman Ranjan Verma
Towards Data Engineering

Senior Data engineer, QuillBot | Ex-Flipkart | Ex-Sigmoid. I publish weekly. Available for 1:1 at topmate.io/arverma