Read files on HDFS through Python
Example to read CSV file on HDFS through Python
Published in
1 min readDec 10, 2021
When trying to read files from HDFS, I have been using Spark. There was one use case where I had no option but to use Python to read the file. This piece of code below does exactly the same.
Method: 1
Replace these pieces of information from the below script:
- active_name_node_ip
- port
- user name
import pandas as pd
from pyarrow import fsfs = fs.HadoopFileSystem("hdfs://<name_node_ip>:<port>?user=<user_name>")df=pd.read_csv(fs.open_input_file("<hdfs_file_path_to.csv>"))
Method: 2
Replace these pieces of information from the below script:
nodes
variable with a list of active and standby name node IP or just the active one.- user name
from pyhdfs import HdfsClientnodes = ["xx.yy.zz.xyz", "xx.yx.zx.zyx"]
client = HdfsClient(hosts=nodes, user_name="<user_name>")
df=pd.read_csv(client.open("<hdfs_file_path_to.csv>"))