Course

Skills

Writing Complex Analytical Queries with Hive

Hive is a data warehouse that runs on top of the Hadoop distributed computing framework. It works on huge datasets, so this course is useful for understanding its features so you can write efficient, fast, and optimal queries.

Preview this course

What you'll learn

The Hive data warehouse supports analytical processing, it generally processes long-running jobs which crunch a huge amount of data. By understanding what goes on behind the scenes in Hive, you can structure your Hive queries to be optimal and performant, thus making your data analysis very efficient. In this course, Writing Complex Analytical Queries with Hive, you'll discover how to make design decisions and how to lay out data in your Hive tables. First, you'll dive into partitioning and bucketing, which are ways to reduce the data a query has to process. You'll cover how and when you use partitioning, bucketing, or both when you set up your tables. Next, you'll be introduced to the joins operation, along with covering how to deal with large tables, and run and optimize map-only joins. Lastly, you'll learn windowing functions, which allow you to write complex queries simply and easily with no intermediate tables. An important optimization with large datasets. By the end of this course, you'll develop an understanding for the little details that makes writing complex queries easier and faster.

Course Overview

1min

Course Overview 2m

Using Hive for Analytical Queries

21mins

Partitioning Tables for Faster Queries

42mins

Partitioning: The Logical Equivalent of Indexes 4m
Data Organization with Partitions 5m
Working with a Managed Partitioned Table 6m
When Would You Use Partitions? 2m
Loading from Files into a Partitioned Table 3m
Partitioning an External Table 7m
Partitioning Trade-offs 3m
Introduction to Dynamic Partitioning 4m
Implementing Dynamic Partitioning 5m
Multi-column Partitioning 3m

Bucketing Columns for Faster Joins

38mins

Bucketing: The Logical Equivalent of Hash Tables 5m
The Modulo Operator as a Hashing Function 5m
Working with Bucketed Tables 3m
Bucketing vs. Partitioning 3m
Implementing a Partitioned, Bucketed Table 3m
Advantages of Bucketing 7m
Sorting Records Within a Bucket 3m
Sampling Data from a Hive Table 5m
Bucket Sampling on Hive Tables 5m

Optimizing Hive Joins

47mins

Behind the Scenes: An Introduction to MapReduce 4m
Optimizing Joins: Join Columns and MapReduce Jobs 2m
Implementing a Join Operation 4m
Optimizing Joins: Streaming the Largest Table 3m
Optimizing Joins: Bucketing and Partitioning on the Join Columns 2m
The Left Semi-join Operator 6m
Behind the Scenes: The MapReduce Data Flow 4m
Behind the Scenes: MapReduce for Join Operations 4m
Map-only Joins: The Inner Join 5m
Map-only Joins: The Left Outer Join 3m
Map-only Joins: The Right Outer Join 2m
Map-only Joins: The Full Outer Join 3m
The Bucket Map Join 5m

Windowing Functions

31mins

Introduction to Window Functions 4m
The Running Total and Running Average Implementations 6m
Window Functions with Partitions 6m
Calculating Moving Averages 2m
Calculating Percentage Contributions 3m
The Row Number and Rank Window Functions 4m
Calculating Quantiles 5m

Course FAQ

What is Hive used for?

Hive is a data warehouse, which works on huge datasets, which means any query that you run on Hive is likely to be slow and long running without the tips and tricks in this course.

What will I learn in this course?

This course helps you make design decisions on how to layout data in your Hive tables, partitioning and bucketing are ways to reduce the data your query has to process, understand how and when you would use partitioning, bucketing or both.

What prerequisites do I need?

This course assumes that you have some familiarity with Hive and writing queries for it.

What software is required for this course?

You should have Hive v2 which runs on top of Hadoop 2, and have the Beeline command interface to connect to Hive locally.

About the author

Janani Ravi

Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework. After spending years working in tech in the Bay Area, New York, and Singapore at companies such as Microsoft, Google, and Flipkart, Janani finally decided to combine her love for technology with her passion for teaching. She is now the co-founder of Loonycorn, a content studio focused on providing ... more

See more courses by Janani Ravi

Ready to upskill? Get started

Contact Sales

Writing Complex Analytical Queries with Hive

What you'll learn

Table of contents

Course FAQ

About the author

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Contact Sales

Writing Complex Analytical Queries with Hive

What you'll learn

Table of contents

Course FAQ

About the author

Get access now

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Ready to skill up
your entire team?

Ready to skill up
your entire team?