Course info
Jun 13, 2018
1h 39m

OK, so you are using Azure Data Lakes, and you think it's great. You just wish you could improve the performance of your U-SQL queries. Why does that query always read your entire data set? Why does this query take forever to complete? Like anything else in the Big Data world, your Azure Data Lake has to be structured around your data. This course, Improving Azure Data Lake Performance, will show you how to put the right structure in place. Then watch the magic start to happen! First, you'll see how an Azure Data Lake works behind the scenes – how it handles different types of data and how the storage of that data can be optimized. Next, you'll see how it's possible to optimize non-structured data. Finally, you'll be shown how structuring your data opens up a world of possibilities, including horizontal and vertical partitioning. This is where the real power of the Azure Data Lake comes to light! Horizontal partitioning allows you to defer a lot of control to the Data Lake, whereas vertical partitioning allows you – the developer – to take total control of how your data is partitioned and distributed within the Data Lake. When you're finished with this course, you'll understand how you can better optimize your jobs and save some cash. Software required: Visual Studio Community Edition 2017 with the Azure Data Lake and Stream Analytics Tools installed.

About the author
About the author

Mike loves to mess around with data and programming problems, the bigger the better. He’s worked with a variety of companies, helping to build and improve systems of all shapes and sizes.

More from the author
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hello there! My name is Mike McQuillan, and this course is all about Improving Azure Data Lake Performance. I'm a data specialist consulting with organizations large and small. And I'm here to help you improve your big data queries in Azure. You might be using an Azure Data Lake right now, but are you using it efficiently? Maybe your queries are costing you too much money. Maybe they are taking too long to execute. They might even be timing out right now. If any of this sounds familiar to you, you're in the right place. The course goes into detail about how an Azure Data Lake works and the various ways in which data can be structured within the Azure Data Lake. Correctly structuring your data can lead to big performance gains. We'll cover some major topics including how data is stored and processed within the Azure Data Lake. We'll see how to improve the performance of your U-SQL jobs and how to organize data within your Data Lake from files to databases to indexes. There's in-depth coverage of the two supported partition schemes-- horizontal partitioning and vertical partitioning. By the time you reach the end of this course, you'll know the common pitfalls of an Azure Data Lake and the techniques you need to use to avoid them. You'll have learned that with a bit of careful thought, file-based queries can be optimized to greatly reduce the amount of data your queries need to read. You'll also be well versed on horizontal partitioning, including the distribution schemes it supports, and vertical partitioning, which allows developers to take total control of the Azure Data Lake. Before viewing this course, you should already have some experience of general database concepts and, more importantly, Data Lakes. Not a big problem if you don't though. Just watch the introduction to the Azure Data Lake and U-SQL course on Pluralsight first. I look forward to guiding you to better Data Lake performance. Come and learn with me on the Improving Azure Data Lake Performance course at Pluralsight.

Why Bother Organizing an Azure Data Lake?
Hello there! I'm Mike McQuillan. Welcome to this course, Improving Azure Data Lake Performance. In this course, we'll investigate the various indexing and partitioning techniques available in the Azure Data Lake. These mechanisms can help us create more efficient queries and reduce Azure costs. We'll begin by looking at why we should organize our Data Lakes and the benefits organization provides. We'll also investigate some of the problems we may encounter if we don't organize or prepare our data before processing like data skew and excessive reads. With this foundation in place, we'll move on to look at the indexing options available to us in Azure Data Lakes. Finally, we'll take an in-depth look at horizontal partitioning in which Azure helps us divide the Data Lake open to easy-to-manage slices, and its sibling, vertical partitioning, which hands control to the developer. After we've worked through these subjects, you should be able to make an informed choice about how to structure your data for the best results. To fully benefit from this course, you should have a basic knowledge of the Azure Data Lake. If you don't have this, check out my first course on the subject, Introduction to the Azure Data Lake and U-SQL. Feel free to follow along even if you don't watch the first course, but be aware that the odd bit of terminology may catch you out. Does everything good to you? Brilliant! Then let's begin improving Azure Data Lake performance.

Dividing and Conquering an Azure Data Lake
Hello! Mike McQuillan here talking about improving Azure Data Lake performance. This module will focus on why partitioning is worth looking at, specifically discussing the partitioning and distribution of data. We'll then investigate the data distribution schemes available to us in the Azure Data Lake before moving on to a detailed look at how horizontal partitioning works. The module will demonstrate how a seemingly insignificant change can have a massive effect on your U-SQL queries. I'm ready. Are you? Cool! Let's find out how to divide and conquer an Azure Data Lake.

Kicking the Bucket – Manually Dividing an Azure Data Lake
Hello again. I'm Mike McQuillan, and you are watching the Improving Azure Data Lake Performance course on Pluralsight. In this final module of the course, we'll see how to kick the bucket and manually divide an Azure Data Lake. We'll see the benefits gained from using vertical partitioning such as being able to insert and delete sections of data instead of the entire table. The INSERT statement used for vertically partitioned tables is a little different to the one you may be used to, so we'll take a look at that too. Vertical partitioning can give precise control over partitions, which can help improve performance of U-SQL queries. We'll try out a few just to be sure. Are you ready to kick the bucket? I'm not, but I am ready to learn about vertical partitioning? Let's kick on.