Manisha Dube
3 min readJul 30, 2016

What Is Apache Hive?

Apache Hive is a knowledge factory facilities designed on top of Hadoop for offering information summarization, question, and research. While designed by Facebook or myspace, Apache Hive is now used and designed by other manufacturers such as Blockbuster online and the Economical Market Regulating Power. Amazon preserves a application package hand of Apache Hive that is a part of Amazon Flexible MapReduce on Amazon Web Services. Oracle dba certification teaches you about Apache Hive and Pig.

Hive

Hive is a element of Hortonworks Data Platform(HDP). Hive provides a SQL-like customer interface to information saved in HDP. In the first guide, Pig was used, which is a scripting terminology with a concentrate on dataflows. Hive provides a data source question customer interface to Apache Hadoop.

Hive or Pig?

People often ask why do Pig and Hive are available when they seem to do much of the same thing. Hive because of its SQL like question terminology is often used as the consumer interface to an Apache Hadoop centered information factory. Hive is regarded customer friendly and more acquainted to customers who are used to using SQL for querying information. Pig matches through its information circulation strong points where it requires on the projects of offering information into Apache Hadoop and working with it to get it into the proper execution for querying. An excellent review of how this performs is in Mike Gateways publishing on the Yahoo Developer weblog named Pig and Hive at Yahoo! From a technological point of perspective, both Pig and Hive are function finish, so you can do projects in either device. However, you will discover one device or the other will be preferred by the different categories that have to use Apache Hadoop. The best part is they have a option and both resources work together.

Our Data Handling Task

The same information processing process as it was just done with Pig in the first guide. They have several data files of baseball statistics and we are going to take them into Hive and do some simple processing with them. We are going to discover the gamer with the highest operates for each year. This data file has all the research from 1871–2011 and contains more that 90,000 series. Once we have the highest runs we will increase the program to convert a gamer id area into the first and last titles of gamers.

Apache Hive facilitates research of huge datasets saved in Hadoop’s HDFS and suitable data file techniques such as Amazon S3 filesystem. It provides an SQL-like terminology known as HiveQL with schema on study and transparently transforms concerns to MapReduce, Apache Tezand Ignite tasks. All three performance google can run in Hadoop YARN. To speed up concerns, it provides indices, such as bitmap indices. Other functions of Hive include:

Listing to give speeding, catalog type such as compaction and Bitmap catalog as of 0.10, more catalog kinds are organized.

Different storage space kinds such as simply written text, RCFile, HBase, ORC, and others.

Meta-data storage space in an RDBMS, considerably lowering the time to carry out semantic assessments during question performance.

Focusing on compacted information saved into the Hadoop environment using methods such as DEFLATE, BWT, quick, etc.

Built-in customer described functions (UDFs) to operate schedules, post, and other data-mining resources. Hive facilitates increasing the UDF set to manage use-cases not reinforced by built-in functions.

SQL-like concerns (HiveQL), which are unquestioningly turned into MapReduce or Tez, or Ignite tasks. You can take up with the Oracle Certification to make your career in this field as an Oracle dba or a database administrator.