Hive Query – HiveQL
SELECT … FROM Clause
SELECT … FROM clause is almost same as SQL. Please see few examples below.
Hive SELECT all query
Hive> SELECT * FROM aliens LIMIT X;
From above Hive query, it will select X records from aliens. Array elements in a record will be within brackets
Hive SELECT query with array indexing
Hive> SELECT name, reportee FROM employees;
From the above Hive query, the first element of the array is selected.
Hive SELECT query with STRUCT column.
Hive> SELECT address.city, name FROM employees;
From above Hive query output will show the struct column in JSON format and the first element of the array is selected.
Hive Manipulating Column values
We can manipulate column values using function calls and arithmetic expressions.
hive> SELECT upper(name), sales_cost, round((sales_cost-cogs)/sales_cost) FROM products;
Hive Column Alias and Nested Queries
Select query with column alias
Hive> SELECT name, address.city as city FROM employees;
Hive Column aliases for nested queries
Hive> FROM( SELECT cust_name, round(ytd_sales * .05) as rewards FROM customer; ) C SELECT C.name, C.rewards WHERE C.rewards > 20;
Hive Selecting data from partitions
Hive scans only partitions relevant to the query, thus improving performance.
Hive> SELECT name, age FROM employees Where city = 'Delhi';
Assuming partitioned on cities and there are 4 partitions with equal volume of data, query will partition only 1/4th of the data
Hive JOIN Statements
Hive supports ANSI joins
hive> SELECT * FROM customers JOIN sales ON customers.id = sales.cust_id
Hive query also supports Inner joins, left outer joins, Right outer, Full outer, left semi-joins
Hive Order By
- Order By clause similar to SQL
- Performs ordering using single reducer
- Will take a long time for large datasets
Hive Sort By
- SORT BY Clause
- Data is ordered in each reducer – each reducer’s output is sorted
- If multiple reducers – final results may not be as expected
Hive views are similar to SQL views which reduce the Hive query complexity by encapsulating the complex queries from end users.
Hive> CREATE VIEW emp_dept AS SELECT * FROM emp JOIN dept ON (emp.dno=dept.deptno) WHERE ename =’Andrew’;
A user can drop the view in the same way as the table.
Hive> DROP VIEW IF EXISTS emp_dept;
Hive can build indexes on columns to speed up querying operations. The index data is stored in another table.
This is a very useful concept and a good alternative to partitioning when the logical partitions are small and numerous.
Let us consider the below table:
CREATE TABLE IF NOT EXIXTS financial.EMP ( Ename STRING, EmpID INT, Salary FLOAT, Tax MAP<STRING, FLOAT>, Subordinates ARRAY<STRING>, Address STRUCT<city: STRING, state: STRING, zip: INT> ) PARTITIONED BY (country STRING, state STRING);
If we need to create an index on country partition only, then below is the syntax:
CREATE INDEX emp_index ON TABLE EMP (country) AS ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ WITH DEFERRED REBUILD IDXPROPERTIES (‘creator’=’kmak’, ‘created_at’=13092016 13:24:56’) IN TABLE EMP_INDEX PARTITIONED BY (country, ename) COMMENT ‘Employee indexed by country and ename’;
The AS clause specifies the index handler, a java inbuilt class that implements indexing.
Hive BITMAP Indexes
CREATE INDEX emp_index ON TABLE EMP (country) AS ‘BITMAP’ WITH DEFERRED REBUILD IDXPROPERTIES (‘creator’=’kmak’, ‘created_at’=13092016 13:24:56’) IN TABLE EMP_INDEX_TABLE PARTITIONED BY (country, ename) COMMENT ‘Employee indexed by country and ename’;
Showing an Index
SHOW FORMATTED INDEX ON emp;
Dropping an Index
DROP INDEX IF EXISTS emp_index ON TABLE emp;