HIVE Partition Concepts – Managed Table and External Table Partition

In the below post we are going to understand about HIVE Partition and Concepts of HIVE partitioning in detail.

Partitions are essentially horizontal slices of data that divides the larger set of data into small manageable chunks.

In HIVE, partitioning is supported for both managed and external table. Partitioning can be done based on one or more than one columns to impose multi-dimensional structure on directory storage.

Partition column is a virtual column that does not exist on the file as a column. HIVE queries can take advantage of the partitioned data for better performance.

HIVE Partition – Managed Table Partitioning

Let us discuss the HIVE partition concept for the managed table first.

Our human resource (HR) team might be interested in looking the data country-wise and then state-wise. Therefore, let us first partition the data first county-wise and then state-wise.

CREATE TABLE IF NOT EXISTS financial.EMP
(
Ename STRING,
EmpID INT,
Salary FLOAT,
Tax MAP<STRING, FLOAT>,
Subordinates ARRAY<STRING>,
Address STRUCT<city: STRING, state: STRING, zip: INT>
)
PARTITIONED BY (country STRING, state STRING);

Partitioning tables changes how HIVE structures the data storage. HIVE will create sub-directories to reflect the partition structure. For Example:

...

…/Emp/country=IN/state=KA

…/Emp/country=IN/state=BH

…

…/Emp/country=US/state=NC

…/Emp/country=US/state=AL

…

If the table data and the number of partitions are large then a HIVE query across all partitions it can trigger enormous MapReduce jobs.

In such case, we put the HIVE into “strict” mode by setting up the below parameter.

hive> set hive.mapred.mode=strict;

The above parameter prohibits the HIVE queries on partitioned tables to run without a WHERE clause.

SHOW PARTITION Syntax

hive> SHOW PARTITIONS EMP;

HIVE Partition – External Table Partitioning

Partitioning external table has the added advantage of sharing the data with other tools, while still optimizing the query performance.

CREATE EXTERNAL TABLE IF NOT EXISTS log_tracking

(
ID STRING,
SERVER STRING,
MESSAGE STRING,
)

PARTITIONED BY (YEAR INT,MONTH INT,DAY INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n';

When we created a non-partitioned external table, we need the “LOCATION” as the mandatory clause.

But for the partitioned external table, the clause is not mandatory. We use ALTER table statement to add each partition separately.

ALTER TABLE log_tracking ADD PARTITION(YEAR=2016, MONTH=12, DAY=3)

LOCATION ‘hdfs://server_name/data/log_tracking/2016/12/03’;

SHOW PARTITIONS log_tracking;
…
Year=2016/month=12/day=3
…
…

Similarly, DESCRIBE EXTENDED log_tracking will show the detailed information of the partition keys.