Writing Complex Analytical Queries with Hive

Hive is a data warehouse that runs on top of the Hadoop distributed computing framework. It works on huge datasets, so this course is useful for understanding its features so you can write efficient, fast, and optimal queries.
Course info
Rating
(38)
Level
Intermediate
Updated
Apr 25, 2017
Duration
3h 2m
Table of contents
Course Overview
Using Hive for Analytical Queries
Partitioning Tables for Faster Queries
Bucketing Columns for Faster Joins
Optimizing Hive Joins
Windowing Functions
Description
Course info
Rating
(38)
Level
Intermediate
Updated
Apr 25, 2017
Duration
3h 2m
Description

The Hive data warehouse supports analytical processing, it generally processes long-running jobs which crunch a huge amount of data. By understanding what goes on behind the scenes in Hive, you can structure your Hive queries to be optimal and performant, thus making your data analysis very efficient. In this course, Writing Complex Analytical Queries with Hive, you'll discover how to make design decisions and how to lay out data in your Hive tables. First, you'll dive into partitioning and bucketing, which are ways to reduce the data a query has to process. You'll cover how and when you use partitioning, bucketing, or both when you set up your tables. Next, you'll be introduced to the joins operation, along with covering how to deal with large tables, and run and optimize map-only joins. Lastly, you'll learn windowing functions, which allow you to write complex queries simply and easily with no intermediate tables. An important optimization with large datasets. By the end of this course, you'll develop an understanding for the little details that makes writing complex queries easier and faster.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Mining Data from Text
Intermediate
2h 21m
Jun 28, 2019
Building Regression Models with scikit-learn
Intermediate
2h 42m
Jun 28, 2019
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi. My name is Janani Ravi, and welcome to this course on Writing Complex Analytical Queries in Hive. I'll introduce myself first. I have a master's degree in Electrical Engineering from Stanford, and I have worked with companies such as Microsoft, Google, and Flipkart. At Google I was one of the first engineers working on real time collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loony Corn, a studio for high quality video content. Hive is a data warehouse that supports analytical processing. Analytical processing involves huge datasets summarizing and extracting insights from this dataset, and calculating trends. This basically means that Hive scripts tend to be very long running jobs, which require a lot of resources, but there is a lot you can do to make your queries run faster on Hive, and that's where this course comes in. This course helps you make design decisions on how to layout data in your Hive tables, partitioning and bucketing are ways to reduce the data your query has to process, understand how and when you would use partitioning, bucketing or both. Joining information from two or more tables is a very common operation, but it's often slow and inefficient in Hive. Learn how Hive deals with joins under the hood and how you can tweak your queries to have joins run faster. Lastly, we cover windowing functions. They allow you to write complex queries simply and easily with no intermediate tables. This course helps you understand the little details of Hive that makes writing complex queries easier and faster.

Using Hive for Analytical Queries
Hi, and welcome to this course on Writing Complex Analytical Queries with Hive. Now Hive is a data warehouse, which works on huge datasets, which means any query that you run on Hive is likely to be slow and long running, but there are tons of little tips and tricks that you can follow in the design of your tables, and the way you structure your queries in order to make Hive more performant. We'll look at some of those in this course. In this module we'll introduce Hive as a data warehouse for analytical processing, we'll assume, however, that you have some familiarity with Hive and writing queries for it. This module will also offer a brief introduction of the various ways in which Hive deals with huge datasets. This is an advanced course in Hive, and there are some prerequisites that you ought to be familiar with before you get started. First is you should be aware of the building blocks of Hive and the basic setup of Hive as an analytical warehouse. You should be comfortable writing queries to read and load data in Hive, and should have a basic understanding of Hadoop, HDFS, and MapReduce, the distributed framework on top of which Hive is built. This course assumes that you have Hive up and running on your local machine. This implies that you have Hadoop installed, preferably version v2 in pseudo-distributed mode. Hive v2 runs on top of Hadoop, and you have the Beeline command interface to connect to Hive locally. With this setup you'll be all ready to run the demos in this course.