Feed file Generation from data stored in HDFS and exposed using Hive
Problem Statement:
As part of analytics, Often we may have requirement where we would need to send multiple file feeds (either in .txt or .csv format etc) to the downstream systems for different analysis purposes.
This may be to the marketing team to find out the target customers, or to the data science team for their machine learning models etc.
In this article, we will explore some options to generate feed from the data stored in HDFS for sharing with the third party downstream applications:
Pre-requisite:
Data is already stored in HDFS and exposed using Hive tables to the end users.Note: We can also use MapReduce/Spark to generate the feed files, but it requires coding skills as it involves extensive coding, and also maintenance of the code is more than hive.
As we already have data exposed as hive tables so we can directly use Hive to generate the feed files.
In Hive, there are multiple ways to generate a text or csv feed file from the information available:
- Using hive -e “query” option
- Using hdfs cat option
- And my personal favorite (which i used in my use case)- Using INSERT OVERWRITE LOCAL DIRECTORY option
lets discuss them one by one with example, pros and cons:
- Using hive -e “query” option
hive -e “HIVE_QUERY” > /FEED_LOCATION/sample_feed.txt
“-e” option with hive is used to run SQL from command line and then result of the query is redirected to the output file as can be seen in the below snippet
Contents of the feed can be seen in the below snippet:
This command can be run from command line, no need for the user to log in to hive cli, and the output file will be generated on client node as well.
Pros :
- Simple to use and Column header can be included in the file feed
Cons :
- Output files generate with a Tab delimiter, need to convert Tab delimiter to comma delimiter which can take lot of time when generating huge files
- In case attribute needs to be enclosed in quotes, there is no provision
2. Using hdfs cat option
hadoop fs -cat hdfs://servername/user/hive/warehouse/databasename/table_csv_export_data/* > ~/output.csvCopy
We can issue this command from the client node and the output file will be generated locally as well as can be seen from the below snippet:
Pros :
- Simple, with comma as delimiter in CSV output.
Cons :
- Column headers can’t be generated.
- Can’t perform any transformation or filter while generating the feed
3. Using INSERT OVERWRITE LOCAL DIRECTORY option
INSERT OVERWRITE LOCAL DIRECTORY ‘FEED_LOCATION’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
SELECT * FROM TABLE;
OVERWRITE- This keyword overwrites everything present in the “FEED_LOCATION” with the file generated with this command
LOCAL- Usage of LOCAL keyword is optional, In case output file is expected to be generated in HDFS directory then “LOCAL” keyword is not used, else if the file is expected to be generated in local directory then “LOCAL” keyword is used
Sample query can be seen in the below snippet:
Output feed file can be seen in the below snippet
Pros:
- Any transformations or filters can be applied on the data stored in hdfs while generating the output file
- We can generate the output file either in HDFS directory or local directory depending on our use case
- Not all output files shall require to be generated with comma delimeter, we can define our custom delimiting character while generating the output file
- No need to take extra overhead to redirect the result to output file, we just need to define the output location
Cons:
- File will be generated with name “000000_0”, might require additional step to rename the file to the required name
Additional tips to keep in mind before generating the feed
- Hive compresses the data by default when it loads the same into the table, as we need to generate the uncompressed data, it is important to set the below hive parameter
SET hive.exec.compress.output=false; - You can always speed up the generation of the feed by using tez engine on top of hive
SET hive.execution.engine=tez; - In case the attributes of the feed are required to be enclosed in quotes, then you can use the OpenCSVSerde along with the command as can be seen from the below command:
INSERT OVERWRITE LOCAL DIRECTORY ‘FEED_LOCATION’
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
WITH SERDEPROPERTIES (
‘separatorChar’=’,’,
‘quoteChar’=’”’
)
STORED AS TEXTFILE
SELECT * FROM TABLE;
SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.