Hive Data Types – Hive File Formats, Primitive, Collection Data Types

Hive Data Types – Hive File Formats, Primitive, Collection Data Types

In this post, we will discuss Hive data types and file formats.

Hive Data Types

Hive supports most of the primitive data types that we find in relational databases. It also supports three collection data types that are rarely supported by relational databases.

Hive Primitive data types

TypeSizeLiteral Syntax Example
TINYINT1-byte signed integer, from -128 to 12730
SMALLINT2-byte signed integer, from -32,768 to 32,76730
INT4-byte signed integer30
BIGINT8-byte signed integer30
FLOAT4-byte single precision floating point number8.16
DOUBLE8-byte double precision floating point number8.16
BOOLEANBoolean Ture or FalseTRUE
TIMESTAMP Only available starting with Hive 0.8.02016-02-02 11:41:56′
DATEDATE values describe a particular year/month/day, in the form YYYY-­MM-­DD‘2013-­01-­01’
STRINGSequence of characters with either single quotes (‘) or double quotes (“)‘Learn HIVE’
VARCHARVarchar types are created with a length specifier (between 1 and 65355)‘Learn HIVE’
CHARChar types are similar to Varchar but they are fixed-length‘Learn HIVE’
BINARYArray of Bytes11101

Hive Collection Data Types

TypeExpressionLiteral Syntax Example
ARRAYIt is the collection of similar type of elements that are indexed. It is similar to arrays in Java.array(‘mak’,’mani’);The second element is accessed with array[1]
MAPCollection of key, value pair where fields are accessed by array notation of keys.MAP(‘first’,’Kumar’,’last’,’Mak’).’first’ and ‘last’ is the column name in the table and ‘Kumar’ and ‘Mak’ are the values. The last name can be referenced using name[‘last’]
STRUCTIt is similar to STRUCT in C language where fields can be accessed using “dot” notation.For a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a

Below table structure demonstrates the Hive data types and their uses.

CREATE TABLE EMP

(
Ename STRING,

EmpID INT,

Salary FLOAT,

Tax MAP<STRING, FLOAT>,

Subordinates ARRAY<STRING>,

Address STRUCT<city: STRING, state: STRING, zip: INT>

)

The ename column is a simple string which is the name of each employee, EmpID is the number assigned to each employee to uniquely identify the employee working for an organization and float is for employee’s salary.

Tax column is a map that holds the key-value pair for every deduction that will be taken out from employee’s salary.

For example, Tax can be (“professional tax”) where value can be a percentage value or an absolute number. In traditional databases, there might be a separate table which will be holding the Tax type and the corresponding values.

Data examples of the above create table statement can be a JSON (JavaScript Object Notation) file with below data:

{

“Ename”: “Kumar Mak”,

“EmpID”: 1234,

“Salary”: 5000.0,

“Tax”: {

“Professional Tax”: .3,

“State Tax”:05

“Fedral Tax”:.1

},

“Subordinates”: [“Smith”, ”John”, ”Carley”],

“Address”: {

“city”: “London”,

“State”: “UK”

“Zip”:560076

}
}

 

A file format is a way in which information is stored in a computer file. Below is the default delimiters table for the fields:

hive data types
Default Delimiter Table

Below is the same “EMP” table with the format specified:

CREATE TABLE EMP

(

Ename STRING,

EmpID INT,

Salary FLOAT,

Tax MAP<STRING, FLOAT>,

Subordinates ARRAY<STRING>,

Address STRUCT<city: STRING, state: STRING, zip: INT>

)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\001'

COLLECTION ITEMS TERMINATED BY '\002'

MAP KEYS TERMINATED BY '\003'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

The “ROW FORMAT DELIMITED” keyword must be specified before any other clause, with an exception of the “STORED AS” clause.

It specifies the format of data rows. ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\001’ meaning that HIVE will use “^A” character to separate fields. ‘\001’ is the octal code for “^A”. Similarly, ‘\002’ is the octal code for “^B” and so on.

“STORED AS” specifies the type of file in which data is to be stored. The file can be a TEXTFILE, SEQUENCEFILE, RCFILE, or BINARY SEQUENCEFILE.

Schema on Read

When we write data into traditional databases either by INSERT, UPDATE operation or through loading external data etc. The database has control on its storage. It can enforce schema as the data is written. This is called schema on write.

Hive has no control over the underlying storage. There are many ways to create, update or delete the data from the HDFS storage through the Hive Query. Therefore,

Therefore, Hive can only enforce queries on read. Hence, this is called as schema on read.