SQL on Hadoop - Analyzing Big Data with Hive

This course will teach you the Hive query language and how to apply it to solve common Big Data problems. This includes an introduction to distributed computing, Hadoop, and MapReduce fundamentals and the latest features released with Hive 0.11
Course info
Rating
(553)
Level
Intermediate
Updated
Oct 8, 2013
Duration
4h 16m
Table of contents
Introduction to Hadoop
Introduction to Hive
Hive Query Language
Advanced HiveQL
Storage and The Eco-System
Description
Course info
Rating
(553)
Level
Intermediate
Updated
Oct 8, 2013
Duration
4h 16m
Description

From developer to analyst, this course tackles a few big questions about big data: Why does this technology exist and why do I need it? How can I get the best out of it utilizing something familiar like SQL and how does this all fit together in an ever-evolving eco-system? This course will introduce the concepts of distributed computing, Hadoop and MapReduce and then goes into great detail into Apache Hive which is an SQL-like query language that can be used with Hadoop and NoSQL databases like HBase and Cassandra. The course presents some challenges you might experience solving real production problems and how Hive makes that task easier to accomplish.

About the author
About the author

Ahmad is a Data Architect specializing in the implementation of high-performance data warehouses and BI systems and enjoys speaking at various user groups and conferences.

More from the author
Section Introduction Transcripts
Section Introduction Transcripts

Introduction to Hadoop
Hi. This is Ahmad Alkilani from Pluralsight. In this course, we're going to discuss perhaps one of the more interesting projects in the Hadoop ecosystem, namely Apache Hive, which by providing a SQL-like interface, expands the reach of big data systems to traditional database professionals. We'll cover the design and architecture of Hive, usage patterns, and business drivers, and we'll look at how to work with various queries in examples. We'll take an in-depth look at the Hive query language, performance optimizations, and how you can extend the language with your own functions. We'll also cover some of the most recent analytical, and performance-specific additions to Hive's 0. 11 release, and then we'll cover a few other ecosystem projects, like Sqoop, and HCatalog. In the first module of this course, however, we'll take a step back and look at Hadoop, why it came to existence, and became so popular, and walk through a few demos. This should get you up and running with the information you'll need to get the most out of this course. Before we get into the details of Hive, and the Hive query language, let's first take a look at some Hadoop concepts to give you a better understanding, and foundation to build upon as we discuss more advanced topics. In this module, we'll cover the motivation behind Hadoop as a predominant player in the Big Data arena. Then we'll talk a bit about Hadoop's architecture, and distributed computing, just enough to cover the fundamentals and get you going. Then we'll walk through Hadoop core components, namely HDFS, and MapReduce, so you're familiar with, and can follow along with the examples and demonstrations throughout the course. Finally, we'll end with a demonstration to get you up and running for the remainder of the course.

Introduction to Hive
Hi. This is Ahmad Alkilani from Pluralsight. In the previous module, we looked at some Hadoop basics, from understanding why the technology became relevant, to the details of how the framework works. We introduced MapReduce and HDFS, and then we looked at how to use Hadoop, and setup our environment. In this module, we're going to build on that, and introduce Hive, which is a SQL-like query language that works with Big Data systems like Hadoop. So, let's get started. In this module, we'll look at the motivation behind creating Hive so we have a better understanding of how Hive fits into the ecosystem, and how, and where it can be utilized. And then we'll look at Hive's architecture. After that, we'll discuss some of Hive's principles. We'll demonstrate how Hive gives structure to unstructured data by providing a schema, and what schema on read means. And then we'll look into the Hive warehouse, what it really means from a Hadoop perspective, or in terms of HDFS to have something called a Hive warehouse, and what Hive calls Hive managed tables versus external tables. And then we'll look at some HiveQL, which is the Hive query language. We'll look at how you can write SELECT statements, Sub queries, how you can CREATE a DATABASE, and then CREATE TABLES on top of your data. And then we'll go through an extensive demo with working with Hive to showcase these concepts, and bring everything together.

Hive Query Language
Hi. This is Ahmad Alkilani from Pluralsight. In this module we'll start off by looking at Hive's data types, and then take a deeper dive into HiveQL, and the Hive warehouse by looking at how to load, and organize data into partitions, both statically defined, and dynamically created. We'll also see how to efficiently run multiple queries for better performance. Then we'll discuss some high functions, how to use aggregates, how to group results, and use cube, and rollup. Then we'll look into sorting data, and how to control the Shuffle and Sort phase using custom distribution and clustering. And we'll finally tie things together by looking at how the Hive CLI can be utilized to execute files in batch mode, and substitute variables at runtime.

Advanced HiveQL
Hi. This is Ahmad Alkilani from Pluralsight. In this module, we're going to take a look at quite a few Concepts, specifically around extending Hive, so let's take a look at our outline. We're first going to talk about bucketing, and how that relates to organizing your data. And then we'll take a look at how you can sample your data using both bucket, and block sampling. And then we're going to look at joins, and we'll first discuss the different types of joins, and then we'll look at joins in depth, and how joins are implemented in MapReduce. And then we'll discuss various join optimization techniques. And then we're going to introduce the distributed cache, and how you can use it. And then once we have all those concepts wrapped up, we'll look at some advanced Hive functions, like table valued functions, or UDTFs, and then we'll look at how you can use the lateral view to further manipulate the results of a UDTF. And then we'll look into extending Hive with our own user defined function that we'll create. And then we'll also create a transformation script using streaming. And we'll also discuss windowing and analytical functions. So without any further ado, let's get started.