Hive is the most popular and the most prevalent member of the Hadoop ecosystem today.
With the increasing popularity of big data applications, MapReduce has become the standard for performing batch processing on commodity hardware. However, MapReduce code can be quite challenging to write for developers, let alone data scientists and administrators.
Hive is a data warehousing framework that runs on top of Hadoop and provides an SQL abstraction for MapReduce apps. Data analysts and business intelligence officers need not learn another complex programming language for writing MapReduce apps. Hive will automatically interpret any SQL query into a series of MapReduce jobs.
Source: Apache Software Foundation
The diagram above showcases the important components of the Hive architecture. An SQL query gets converted into a MapReduce app by going through the following process:
Running a Hive query requires some SQL experience as HiveQL queries are almost indistinguishable from SQL queries.
To connect to a Hive session, run the following command:
1beeline -u jdbc:hive2://<ip>:10000/ -n <username>
The default Hive CLI client is called beeline
. It connects to a JDBC URL using a specific user that uses the users defined in HDFS and in the operating system to manage permissions.
From the beeline session, you can then run SQL scripts that connect to HDFS locations, or any other location supported by Hive. The following code block is an example of a DDL statement taken from the Amazon Customer Reviews Open Dataset, which connects to Amazon S3.
1CREATE EXTERNAL TABLE amazon_reviews_parquet(
2 marketplace string,
3 customer_id string,
4 review_id string,
5 product_id string,
6 product_parent string,
7 product_title string,
8 star_rating int,
9 helpful_votes int,
10 total_votes int,
11 vine string,
12 verified_purchase string,
13 review_headline string,
14 review_body string,
15 review_date bigint,
16 year int)
17PARTITIONED BY (product_category string)
18ROW FORMAT SERDE
19 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
20STORED AS INPUTFORMAT
21 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
22OUTPUTFORMAT
23 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
24LOCATION
25 's3://amazon-reviews-pds/parquet/'
The EXTERNAL TABLE
keyword specifies that the table will not be managed by the built-in HDFS hive
user. An external table is necessary when storing data outside of the cluster Hive is running on, or even on a directory not owned by the hive
user.
Although there is a schema definition written, writes to the location will not be checked by Hive. This is because Hive follows the schema on read principle. The schema is applied as Hive reads the data from the source, and not while data is being inserted. This allows for the reading of more complex file types such as the Parquet file type used in this example. Hive can also be used to read CSV files through the CSV SerDe, complex text files through the RegEx SerDe, and even binary files through custom SerDes and InputFormats
.
Simple SQL aggregations, joins, and windowing functions can give invaluable insights when run in a larger scale. Hive is one of many members of the Hadoop ecosystem. It is a very simple yet powerful tool to run analytics on petabytes of data using a familiar language. Since Hive is one of the most commonly deployed members of the Hadoop ecosystem, it is essential for data engineers and data analysts to understand these basic concepts.