Hive Introduction – Benefits and Limitations, Principles

In the following post, we will cover Hive Introduction and key principles of Hive.

Hive Introduction – Benefits and Limitations

Hive is a data warehouse tool developed on top of Hadoop to process structured data. This is basically a wrapper written on top of map reduce programming layer that makes querying and analyzing easy.

It facilitates analysis of large data sets, ad-hoc queries, and easy data summarization through a query processing language named HQL (Hive Query Language) for the data residing on HDFS.

Due to SQL-like language, Hive is a popular choice for Hadoop Analytics. Hive’s SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).

It provides massive scale-out and fault tolerance capabilities for data storage and processing of commodity hardware.

Hive was originally developed by Facebook in 2007 to handle massive volumes of data, and later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.

It is nowadays used by many companies. For Example, Amazon uses it for Elastic MapReduce.

It is important to note that Hive is not a relational database which does not support low-level insert, update or delete operations.

It is not used for real-time data processing. Hive is not designed for online transaction processing. However, it is best suited for traditional data warehousing.

Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query execution. Therefore, it uses the concept of MapReduce for execution and HDFS for storage and retrieval of data.

Principles of Hive

Hive commands are similar to that of SQL which is a data warehousing tool similar to Hive.
It is an extensible framework which supports different file and data formats.
We can easily plug-in map reduce code in the language of our choice using user-defined functions.
Performance is better in Hive since Hive engine uses the best built-in script to reduce the execution time while enabling high output.