Feed file Generation from data stored in HDFS and exposed using Hive

Sumit Goel
Analytics Vidhya
Published in
4 min readNov 4, 2019

Problem Statement:

As part of analytics, Often we may have requirement where we would need to send multiple file feeds (either in .txt or .csv format etc) to the downstream systems for different analysis purposes.
This may be to the marketing team to find out the target customers, or to the data science team for their machine learning models etc.

In this article, we will explore some options to generate feed from the data stored in HDFS for sharing with the third party downstream applications:

Pre-requisite:
Data is already stored in HDFS and exposed using Hive tables to the end users.

Note: We can also use MapReduce/Spark to generate the feed files, but it requires coding skills as it involves extensive coding, and also maintenance of the code is more than hive.

As we already have data exposed as hive tables so we can directly use Hive to generate the feed files.

In Hive, there are multiple ways to generate a text or csv feed file from the information available:

  • Using hive -e “query” option
  • Using hdfs cat option
  • And my personal favorite (which i used in my use case)- Using INSERT OVERWRITE LOCAL DIRECTORY option

lets discuss them one by one with example, pros and cons:

  1. Using hive -e “query” option

hive -e “HIVE_QUERY” > /FEED_LOCATION/sample_feed.txt

“-e” option with hive is used to run SQL from command line and then result of the query is redirected to the output file as can be seen in the below snippet

Contents of the feed can be seen in the below snippet:

This command can be run from command line, no need for the user to log in to hive cli, and the output file will be generated on client node as well.

Pros :

  • Simple to use and Column header can be included in the file feed

Cons :

  • Output files generate with a Tab delimiter, need to convert Tab delimiter to comma delimiter which can take lot of time when generating huge files
  • In case attribute needs to be enclosed in quotes, there is no provision

2. Using hdfs cat option

hadoop fs -cat hdfs://servername/user/hive/warehouse/databasename/table_csv_export_data/* > ~/output.csvCopy

We can issue this command from the client node and the output file will be generated locally as well as can be seen from the below snippet:

Pros :

  • Simple, with comma as delimiter in CSV output.

Cons :

  • Column headers can’t be generated.
  • Can’t perform any transformation or filter while generating the feed

3. Using INSERT OVERWRITE LOCAL DIRECTORY option

INSERT OVERWRITE LOCAL DIRECTORY ‘FEED_LOCATION’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
SELECT * FROM TABLE;

OVERWRITE- This keyword overwrites everything present in the “FEED_LOCATION” with the file generated with this command

LOCAL- Usage of LOCAL keyword is optional, In case output file is expected to be generated in HDFS directory then “LOCAL” keyword is not used, else if the file is expected to be generated in local directory then “LOCAL” keyword is used

Sample query can be seen in the below snippet:

Output feed file can be seen in the below snippet

Pros:

  • Any transformations or filters can be applied on the data stored in hdfs while generating the output file
  • We can generate the output file either in HDFS directory or local directory depending on our use case
  • Not all output files shall require to be generated with comma delimeter, we can define our custom delimiting character while generating the output file
  • No need to take extra overhead to redirect the result to output file, we just need to define the output location

Cons:

  • File will be generated with name “000000_0”, might require additional step to rename the file to the required name

Additional tips to keep in mind before generating the feed

  • Hive compresses the data by default when it loads the same into the table, as we need to generate the uncompressed data, it is important to set the below hive parameter
    SET hive.exec.compress.output=false;
  • You can always speed up the generation of the feed by using tez engine on top of hive
    SET hive.execution.engine=tez;
  • In case the attributes of the feed are required to be enclosed in quotes, then you can use the OpenCSVSerde along with the command as can be seen from the below command:
    INSERT OVERWRITE LOCAL DIRECTORY ‘FEED_LOCATION’
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
    WITH SERDEPROPERTIES (
    ‘separatorChar’=’,’,
    ‘quoteChar’=’”’
    )
    STORED AS TEXTFILE
    SELECT * FROM TABLE;

SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.

--

--