HDInsight is Microsoft's managed Big Data stack in the cloud. With Azure you can provision clusters running Storm, HBase, and Hive which can process thousands of events per second, store petabytes of data, and give you a SQL-like interface to query it all. In this course, we'll build out a full solution using the stack and take a deep dive into each of the technologies.
Storm is a distributed compute platform which you can plug into Azure Event Hubs and use to power event stream processing. You can scale Storm to read tens of thousands of events per second and build a reliable workflow so that every event is guaranteed to be processed. HBase is a No-SQL database which is easy to get started with and can store tables with billions of rows and millions of columns. It's for real-time data access and it has a REST interface so you can read and write HBase data from a .NET Storm app. Hive is a data warehouse that provides a SQL-like interface over Big Data - HBase tables, and other sources. With Hive you can join across multiple sources and run queries from PowerShell and .NET. In this course, we use all three technologies running on Microsoft Azure to build a race timing solution and dive into performance tuning, reliability, and administration.
HBase Deep Dive Hey, how are you doing? I'm Elton and this is HBase Deep Dive, the next module in HDInsight Deep Dive Storm, HBase and Hive. In the previous module we saw how to logically use HBase, how to structure your tables, and how to access data using different clients. In this module we're going to look a lot more closely at the physical workings of HBase. The aim is to give you an understanding of how HBase works under the hood so you can make the right decisions when you design and implement your HBase tables. And we'll also look at performance and administration so you feel comfortable running an HBase cluster on HDInsight, knowing you can diagnose any problems that you have and can scale your cluster to suit your needs. HBase is an open source project from Apache. It uses core functionality from Hadoop, but it's a complex project. The code base runs over 4 million lines of code. We won't learn enough here to make you a core committer to HBase, but we will learn enough to make you confident about running HBase for a production solution. Let's start by looking at what the different nodes of the HBase cluster actually do.
Querying Race Data with Hive Hey, how are you doing? I'm Elton and this is Querying Race Data with Hive, the next module in HDInsight Deep Dive, Storm, HBase, and Hive. In this module we're going to finish the functional part of our race timing solution, building queries to get meaningful race results from the HBase data. We'll do that using Apache Hive which is a data warehousing component that can sit on top of multiple data sources and we'll see how adding Hive on top of HBase, makes your big data easy to query from any angle. We'll look at how to make the semi-structured data in HBase into a more rigid structure for querying and we'll use the Hive query language HiveQL to get rich results from our raw data. You can query data from difference sources in Hive and to build our complete results output for a race, we'll join the race result data in HBase with CSV files containing individual racer details and get a joined-up result at the end. Hive isn't just for reading data and to complete our solution, we'll write the full result set for a race to a file stored in a public container in Azure Storage. That file will be in a blob which lives outside of HBase so it will be accessible even without an HBase cluster running. Let's start with a closer look at the Hive queries I ran in the last module as part of the Storm performance tests.
Hive Deep Dive Hey, how are you doing? I'm Elton and this is Hive Deep Dive, the final module in HDInsight Deep Dive, Storm, HBase, and Hive. In this module we're going to look more closely at what happens when you submit a Hive query, how it actually gets processed and how you can monitor long-running jobs through another web UI. I've filled my HBase tables with a lot more data and we'll see if that has a big impact on the performance of the race results calculation. We'll look at Hive query plans to see where the time gets spent and we'll look at alternative approaches so we can run the same calculation more quickly. We'll submit Hive queries using PowerShell, which lets us break the calculation up into more efficient steps and we'll look at encapsulating the whole calculation logic in a. NET application, connecting directly to Hive, using an ODBC provider. We'll also look at tidying up our queries, centralizing some of that ugly logic with a custom user-defined function, which we can write in. NET and use in Hive queries. We'll see how to add secondary Azure storage accounts to the cluster so that we can keep our HBase storage container private and generate results data in a public container from Hive and we'll finish off our cluster provisioning script by automating the Hive data definitions, so when the startup script has run, the whole cluster is ready to go. Lastly, we'll recap what we've built using HDInsight with our race timing solution that uses big data technologies to provide a lot of high performance compute from very little custom code. Let's start by looking at how Hive actually processes a query when you submit it.