Hive Introduction – Benefits and Limitations, Principles

In the following post, we will cover Hive Introduction and key principles of Hive.

Hive Introduction – Benefits and Limitations

Hive is a data warehouse tool developed on top of Hadoop to process structured data. This is basically a wrapper written on top of map reduce programming layer that makes querying and analyzing easy.

It facilitates analysis of large data sets, ad-hoc queries, and easy data summarization through a query processing language named HQL (Hive Query Language) for the data residing on HDFS.

Due to SQL-like language, Hive is a popular choice for Hadoop Analytics. Hive’s SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).

It provides massive scale-out and fault tolerance capabilities for data storage and processing of commodity hardware.

Hive was originally developed by Facebook in 2007 to handle massive volumes of data, and later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.

It is nowadays used by many companies. For Example, Amazon uses it for Elastic MapReduce.

It is important to note that Hive is not a relational database which does not support low-level insert, update or delete operations.

It is not used for real-time data processing. Hive is not designed for online transaction processing. However, it is best suited for traditional data warehousing.

Relying on MapReduce for execution, Hive is batch-oriented and has high latency for query execution. Therefore, it uses the concept of MapReduce for execution and HDFS for storage and retrieval of data.

Principles of Hive

  1. Hive commands are similar to that of SQL which is a data warehousing tool similar to Hive.
  2. It is an extensible framework which supports different file and data formats.
  3. We can easily plug-in map reduce code in the language of our choice using user-defined functions.
  4. Performance is better in Hive since Hive engine uses the best built-in script to reduce the execution time while enabling high output.

Hive Components – Metastore, UI, Driver, Compiler and Execution Engine

Some of the key Hive components that we are going to learn in this post are UI, Driver, Compiler, Metastore, and Execution engine. Let us understand these Hive components one by one in detail below.

Apache Hive components

Hive User Interfaces (UI)

The user interface is for users to submit queries and other operations to the system. Hive includes mainly three ways to communicate to the Hive drivers.

  • CLI (Command Line Interface)

This is the most common way of interacting with Hive where we use Linux terminal to issue queries directly to Hive drivers.

  • HWI (Hive Web Interface)

It is an alternative to the CLI where we use the web browser to interact with Hive.

  • JDBC/ODBC/Thrift Server

This allows the remote client to submit the request to HIVE and retrieve the result. HIVE_PORT environment variable needs to be specified with the available port number to let the server listen on.

It is important to note that CLI is a fat client which requires a local copy of all the HIVE components as well as the Hadoop client and configurations.

Hive Driver

This component receives the queries from user interfaces (UI) and provides execute and fetch API’s modeled on JDBC/ODBC drivers.

Hive Compiler

This very component parses the query, does semantic analysis on different query blocks and finally generates the execution plan.

This is done with the help of tables and partitioned metadata that needed to be looked up into Metastore.

Hive Metastore

A Metastore is a component that stores the system catalog and metadata about tables, columns, partitions and so on.

For example – A create table definition statement is stored here. Metastore uses a relational database to store its metadata.

Apache Hive uses Derby database by default. However, this database has limitation such as multi-user access.

Any JDBC compliant database such as MySQL, Oracle can be used for Metastore. The key attributes that should be configured for Hive Metastore are given below:

hive components
HIVE Components

Hive Execution Engine

This component is responsible for executing the execution plan created by the compiler.

The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. It processes the query and generates results same as MapReduce results. It basically uses the flavor of MapReduce.

HIVE Architecture – Hadoop, HIVE Query Flow | RCV Academy

The below diagram represents Hadoop Hive Architecture and typical query that flows through the HIVE system.

HIVE Architecture
HIVE Architecture

The UI calls the execute query interface to the driver. The driver creates a session handle for the query and sends the query to the compiler to generate an execution plan.

The compiler needs the metadata to send a request for “getMetaData” and receives the “sendMetaData” request from metastore.

This metadata does the typecheck of the query expression and prunes the partitions based on query predicates.

The plan generated by the compiler is a sequence of steps where each step is either a MapReduce job, a metadata operation or an operation on HDFS.

The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2 and 6.3). Once the output is generated it is written to a temporary HDFS file through serializer.

The content of the file is read by execution engine directly from HDFS and displayed to UI clients.

Hive Data Types – Hive File Formats, Primitive, Collection Data Types

In this post, we will discuss Hive data types and file formats.

Hive Data Types

Hive supports most of the primitive data types that we find in relational databases. It also supports three collection data types that are rarely supported by relational databases.

Hive Primitive data types

TypeSizeLiteral Syntax Example
TINYINT1-byte signed integer, from -128 to 12730
SMALLINT2-byte signed integer, from -32,768 to 32,76730
INT4-byte signed integer30
BIGINT8-byte signed integer30
FLOAT4-byte single precision floating point number8.16
DOUBLE8-byte double precision floating point number8.16
BOOLEANBoolean Ture or FalseTRUE
TIMESTAMP Only available starting with Hive 0.8.02016-02-02 11:41:56′
DATEDATE values describe a particular year/month/day, in the form YYYY-­MM-­DD‘2013-­01-­01’
STRINGSequence of characters with either single quotes (‘) or double quotes (“)‘Learn HIVE’
VARCHARVarchar types are created with a length specifier (between 1 and 65355)‘Learn HIVE’
CHARChar types are similar to Varchar but they are fixed-length‘Learn HIVE’
BINARYArray of Bytes11101

Hive Collection Data Types

TypeExpressionLiteral Syntax Example
ARRAYIt is the collection of similar type of elements that are indexed. It is similar to arrays in Java.array(‘mak’,’mani’);The second element is accessed with array[1]
MAPCollection of key, value pair where fields are accessed by array notation of keys.MAP(‘first’,’Kumar’,’last’,’Mak’).’first’ and ‘last’ is the column name in the table and ‘Kumar’ and ‘Mak’ are the values. The last name can be referenced using name[‘last’]
STRUCTIt is similar to STRUCT in C language where fields can be accessed using “dot” notation.For a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a

Below table structure demonstrates the Hive data types and their uses.

CREATE TABLE EMP

(
Ename STRING,

EmpID INT,

Salary FLOAT,

Tax MAP<STRING, FLOAT>,

Subordinates ARRAY<STRING>,

Address STRUCT<city: STRING, state: STRING, zip: INT>

)

The ename column is a simple string which is the name of each employee, EmpID is the number assigned to each employee to uniquely identify the employee working for an organization and float is for employee’s salary.

Tax column is a map that holds the key-value pair for every deduction that will be taken out from employee’s salary.

For example, Tax can be (“professional tax”) where value can be a percentage value or an absolute number. In traditional databases, there might be a separate table which will be holding the Tax type and the corresponding values.

Data examples of the above create table statement can be a JSON (JavaScript Object Notation) file with below data:

{

“Ename”: “Kumar Mak”,

“EmpID”: 1234,

“Salary”: 5000.0,

“Tax”: {

“Professional Tax”: .3,

“State Tax”:05

“Fedral Tax”:.1

},

“Subordinates”: [“Smith”, ”John”, ”Carley”],

“Address”: {

“city”: “London”,

“State”: “UK”

“Zip”:560076

}
}

 

A file format is a way in which information is stored in a computer file. Below is the default delimiters table for the fields:

hive data types
Default Delimiter Table

Below is the same “EMP” table with the format specified:

CREATE TABLE EMP

(

Ename STRING,

EmpID INT,

Salary FLOAT,

Tax MAP<STRING, FLOAT>,

Subordinates ARRAY<STRING>,

Address STRUCT<city: STRING, state: STRING, zip: INT>

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\001'

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

The “ROW FORMAT DELIMITED” keyword must be specified before any other clause, with an exception of the “STORED AS” clause.

It specifies the format of data rows. ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\001’ meaning that HIVE will use “^A” character to separate fields. ‘\001’ is the octal code for “^A”. Similarly, ‘\002’ is the octal code for “^B” and so on.

“STORED AS” specifies the type of file in which data is to be stored. The file can be a TEXTFILE, SEQUENCEFILE, RCFILE, or BINARY SEQUENCEFILE.

Schema on Read

When we write data into traditional databases either by INSERT, UPDATE operation or through loading external data etc. The database has control on its storage. It can enforce schema as the data is written. This is called schema on write.

Hive has no control over the underlying storage. There are many ways to create, update or delete the data from the HDFS storage through the Hive Query. Therefore,

Therefore, Hive can only enforce queries on read. Hence, this is called as schema on read.

HIVE Query Language (HQL) – HIVE Create Database, Create Table

In this section, we will discuss data definition language parts of HIVE Query Language(HQL), which are used for creating, altering and dropping databases, tables, views, functions, and indexes.

We will also look into SHOW and DESCRIBE commands for listing and describing databases and tables stored in HDFS file system.

Hive Database – HIVE Query

A database in Hive is just a namespace or catalog of tables. If we do not specify database, default database is used.

HIVE CREATE database syntax 

hive> CREATE DATABASE IF NOT EXISTS financial;

HIVE USE Database syntax

This command is used for using a particular database.

hive> use <database name>

hive> use financial;

HIVE SHOW all existing databases

hive> SHOW DATABASES;

default
financial

For each database, HIVE will create a directory and the tables say “EMP” in that database and say “financial” is stored in sub-directories.

The database directory is created under the directory specified in the parameter “hive.metastore.warehouse.dir”. 

Assuming the default value of the parameter is “/usr/hive/warehouse”, when we create the “financial” database, HIVE will create the sub-directory as “/usr/hive/warehouse/financial.db” where “.db” is the extension.

HIVE DESCRIBE database syntax

This will show the directory location of the “financial” database.

hive> DESCRIBE DATABASE financial;

financial
hdfs://usr/hive/warehouse/financial.db

HIVE ALTER Database Syntax

We can associate key-value pairs with a database in the DBPROPERTIES using ALTER DATABASE command.

hive> ALTER DATABSE financial SET DBPROPERTIES (‘edited by’ = ‘kmak’);

HIVE CREATE Table Syntax

CREATE table statement in Hive is similar to what we follow in SQL but hive provides lots of flexibilities in terms of where the data files for the table will be stored, the format used, delimiter used etc.

CREATE TABLE IF NOT EXISTS financial.EMP

(

Ename STRING,

EmpID INT,

Salary FLOAT,

Tax MAP<STRING, FLOAT>,

Subordinates ARRAY<STRING>,

Address STRUCT<city: STRING, state: STRING, zip: INT>

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\001'

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

In the above example, the default location of the table will be “/user/hive/warehouse/financial.db/EMP”.

If we want to explicitly override the location then we need to use the “LOCATION” keyword in the above create table statement.

HIVE SHOW TABLES Syntax

This command lists the tables in the current working database.

hive> SHOW TABLES;

EMP

Table1

Table2

HIVE DESCRIBE Syntax

The DESCRIBE statement displays metadata about a table, such as the column names and their data types.

hive> DESCRIBE financial.emp.salary

salary    float

HIVE DROP Database Syntax

DROP Database statement deletes all the related tables and then delete the database.

hive>DROP DATABASE IF EXISTS financial;

hive> DROP DATABASE IF EXISTS financial CASCADE;

HIVE DROP Table Syntax

hive> DROP TABLE IF EXISTS emp;

HIVE ALTER Table Syntax

hive> ALTER TABLE emp RENAME TO employee;--Rename Table Name

hive> ALTER TABLE emp CHANGE name firstname String;--Rename Column Name

hive> ALTER TABLE emp CHANGE sal sal Double; -- Rename data types

hive> ALTER TABLE emp ADD COLUMNS (dno INT COMMENT 'Department number'); -- Add column name

hive> ALTER TABLE emp REPLACE COLUMNS (eid INT empid Int,  ename STRING name String);--Deletes all the columns from emp and replace it with two columns

HIVE DESCRIBE EXTENDED Syntax

This command is used for describing details about the table. If we replace EXTENDED with FORMATTED then it provides more verbose output.

The output lines in the description that start with LOCATION also shows the full URL path in the HDFS where hive will keep all the data related to a given table.

hive>DESCRIBE EXTENDED financial.EMP;
Ename STRING,
EmpID INT,
Salary FLOAT,
Tax MAP<STRING, FLOAT>,
Subordinates ARRAY<STRING>,
Address STRUCT<city: STRING, state: STRING, zip: INT>

Detailed Table Information&nbsp;&nbsp; Table( tableName: Emp, dbName:financial, owner:kmak…
Location:hdfs://user/hive/warehouse/financial.db/Emp,
Parameters:{creator=kmak, created at=’2016-02-02 11:01:43’,last_modified_user=kmak,last_modified_time=1443553211,…}

HIVE Managed and External Tables

HIVE Managed Tables

By default, HIVE tables are the managed tables. HIVE controls metadata and the lifecycle of the data.

HIVE stores all the data related to a given table in the subdirectory under the directory defined by the parameter “hive.metastore.warehouse.dir” which is “/user/hive/warehouse” by default.

Dropping a managed table deletes the data from the table by deleting the sub-directory that has got created for the respective table. “DESCRIBE EXTENDED” command output will tell whether a table is managed or extended.

The drawback of the managed table is less convenient to use with other tools. For

For Example: We have inserted some data into some table by Pig or some other tool. Now, we wanted to run some HIVE queries on the data inserted by Pig but do not want to give the ownership to HIVE. In such case, we create an external table which has access to the data but not the ownership.

HIVE External Tables

Let us suppose we want to analyze the marketing data getting ingested from different sources.

We ingest south and north zone data using many tools such as Pig, HIVE and so on. Let us assume data file for the SALES table resides in directory /data/marketing.

The following table declaration creates an external table that can read all the data files from /data/marketing distributed directory with data stored as comma-delimited data.

CREATE EXTERNAL TABLE IF NOT EXISTS SALES (
SaleId INT,
ProductId INT,
Quantity INT,
ProdName STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/marketing';

The keyword “EXTERNAL” tells HIVE that this table is external and the data is stored in the directory mentioned in “LOCATION” clause.

Since the table is external, HIVE does not assume it owns the data. Therefore, dropping table deletes only the metadata in HIVE Metastore and the actual data remains intact.

Therefore, if the data is shared between tools, then it is always advisable to create an external table to make ownership explicit.

Copy the schema (not the data) of an existing table

CREATE EXTERNAL TABLE IF NOT EXISTS financial.emp1
LIKE financial.emp
LOCATION ‘/data/marketing’;