By 2022 the Global Market for Hadoop is predicted to be over 87 Billion. That’s a huge market and one of the reasons Data Engineers are in such high demand. In this course, Getting Started with Hortonworks Data Platform, you will learn how to build a big data cluster using the hadoop data platform. First, you will explore how to navigate your HDP cluster from the Command line. Next, you will discover how to use Ambari to automate your Hadoop cluster. Finally, you will learn how to set up rack awareness for multi server clusters. By the end of this course, you will be ready to build your own data cluster using Hortonworks Hadoop data platform.
Course Overview Hi everyone. My name is Thomas Henson, and welcome to my course, Hortonworks Getting Started. I'm a course author at Pluralsight and a data engineering advocate in the big data community. By 2020, the global market for Hadoop is predicted to be over 87 billion. Think about that. $87 billion. That's a huge market and one of the reasons data engineers are in such high demand. This course is devoted to training the future data engineers on how to build out their first big data cluster using the Hortonworks Data Platform. Hortonworks is one of the largest contributors to the big data analytics open source community and one of the core contributors to the start of Hadoop. Some of the topics that we will cover include learning how to use Ambari to automate our Hadoop cluster from adding nodes to installing other data analytic components like Spark, Pig, and Hive; navigating our HTTP cluster from the command line; setting up rack awareness for multi-server clusters; and understanding the changes that are coming in Hadoop 3. 0. Those are going to be some huge changes. By the end of this course you'll know how to build and manage a Hortonworks Data Platform cluster. But before beginning this course, you should be familiar with installing Linux in a physical or virtual environment, and also some basic Linux command line skills. I hope you'll join me on this journey to learn about building Hadoop clusters, with the Hortonworks Getting Started course, at Pluralsight.
Why Hortonworks Data Platform? Hi. I'm Thomas Henson from Pluralsight, and this course is, Getting Started with the Hortonworks Data Platform. Using Hadoop to analyze unstructured data is a huge industry trend. Gartner predicts that data is doubling every 2 years, and 80% of that data is unstructured data. The surge in data and the need to analyze that data is the reason for the popularity of the Hadoop platform and the ecosystem partners around that platform. In this course, I'm going to show you how to set up a Hadoop cluster using the Hortonworks Data Platform, and after setting up an HDP cluster, I'll walk you through the basic task a Hadoop administrator needs to know to get started in running their first HDP cluster, from taking and installing Ambari to running commands from the command line for Hadoop administrators. Let's jump in and see what we're going to cover in this first module here. Before building a cluster I want to take a step back and make sure that you understand what Hadoop is and how Hadoop can solve your problems with data analytics. With data doubling every two years, it's essential to know where Hadoop comes from and where it's headed. Next, I'll talk about Hortonworks' approach to Hadoop, and why to use a Hadoop package instead of rolling your own, and also give a quick overview of the other Hadoop packages and platforms, and then the ecosystem partners for Hortonworks as well. Then we'll touch on the Open Data Platform initiative to explain how it's helping to accelerate Hadoop adoption worldwide. And then lastly, I'll cover how we're going to set up our Hortonworks environment and cover where all the download and supporting documentation is. Don't worry, everything that we're going to use is 100% open source. Now let's shift our focus to understanding what Hadoop is.
Installing HDP with Ambari Hi, folks. Welcome back to the Hortonworks Getting Started course. In this module, we're going to talk about how to install HDP using Ambari. This is where we get to get really hands on and start building our cluster from the ground up. So let's see what all we're going to cover in this module. First, we're going to start out by defining what Ambari is, see where it fits in the Hadoop ecosystem, but also why it's such a major component for HDP. Next, we'll set up our virtual machines, and then on each one of our nodes we're actually going to set up our password-less SSH so that we can get up and running with Ambari. After we have our nodes set up with password-less SSH, we can configure Ambari Agent so that it will do the install, patching, and updating for all of our nodes in our cluster. And finally, we're going to discuss how node management fits into Ambari and how we can take away nodes, add new nodes, because as our cluster gets up and running and we start to add more use cases, we're going to have to expand out our cluster and be able to add nodes for new use cases, but also be able to decommission and take away nodes for maintenance. Now let's jump in and see what our lab overview's going to look like, and then the use case that we're going to follow along so we can better understand how to install HDP to build our first Hadoop cluster.
Administering Hadoop with HDP Welcome back to Getting Started with the Hortonworks Data Platform. Now we're going to talk about Administering HDP in this module. By now, Matt from Mailbox Movies has the Hadoop environment set up using HDP. The SQL team is starting to look and see how they can gain insights by using Hive and Pig, and some of the other tools for data analytics. Now it's time for Matt to take his focus and turn to administering HDP because he wants to know what he can do as the system starts to scale. So in this module we're going to cover all the topics that Matt is going to need to know to start administering HDP. Let's see what we're going to cover in this module. The first thing we're going to talk about is how to protect data as this data scales, and in particular as these racks scale. So we'll talk about what rack awareness is. Next, as Matt is taking in terabytes and petabytes of data, we need to talk about some of the administrative tasks that he's going to need to do to make sure that those data blocks are replicated, and also to make sure that his data is protected, all in HDP. Next we're going to learn how we can tweak and tune some of our configuration files in HDP. And lastly, we want to make sure Matt stays updated with all the things that are going on in his cluster, so we'll show him how to set up custom alerts. Now let's jump in and look at what rack awareness is.
HDP from the Command Line Hi folks. Welcome back to the Hortonworks Getting Started course. This module we're going to cover HDP from the command line. I know we've already set up our cluster and we've seen some of the administrative tasks that we can do, but there's going to come a time where you're going to have to get behind the scenes and go to the command line to be able to solve some of your problems as an administrator or developer, or just as a good data engineer. So I want to talk about some of the things and get you familiar with what you can do from the command line and understand the structure of HDFS and some other components. Let's look and see what we'll talk about in this module. The first thing we want to cover is how you can navigate the Hadoop Distributed File System, or HDFS, and so this is where all of our data goes and the components live, so think about whenever we upload a file or upload any sort of data, it's going to go into our HDFS system, and so we want to be familiar with how that file structure is structured and then how we can move data into it. Next we're going to talk about using some of the commands, so we'll go through HDFS, FS and the HDFS DFS commands just to have some kind of base level understanding of what these commands do and how you can move data using these commands. And then last, we're going to cover some of the Hadoop administrative commands, and those are some special commands that you'll want to know. Probably not something that you're going to be able to remember always, but I want you to be familiar with them so that once you need to use them, it's not the first time you've had to use them, and so you'll know how to use them and you'll know where to reference them for future administrative tasks. So now let's jump in and let's find out how we can navigate around in HDFS.
What’s Next in HDP? Well, that's great! Congratulations! You set up your first Hadoop environment using HDP, and you know how to manage that environment, how to add nodes, and how to do some of the basic functionality. You're ready to become a more seasoned data engineer. So, what do you need to learn next and what are some of the new challenges that we want to start tackling as a beginner data engineer? Or, maybe you're a data engineer that's been doing this for awhile and you're already seasoned, but you want to know what some of the next new features are going to be with Hadoop or HDP, or just some of the technology trends. That's what we're going to cover in this module. The first thing that we're tackle is some of the challenges with Hadoop. So Hadoop's been out there for a long time, it's been in the field, but now that it's become an enterprise application there are some challenges with that. We're not just a bleeding edge technology, there's some things that we need to do to really have that enterprise focus. So we're going to talk about some of the Hadoop 3. 0 updates. This is a huge change for the Hadoop community, and I want to make sure that you understand what Hadoop 3. 0 is going to do that's different from Hadoop 2. 0, and even 1. 0. I also want to do a breakdown of now that you're a data engineer, understanding what your role as a data engineer is versus what a data scientist is, because there's a lot of confusion in the community and I want you to be able to understand hey, these are certain things that I need to focus on as a data engineer, but maybe there's some things from the data scientist perspective that you want to look at as well. Let's go ahead and talk about some of the differences between both of those roles.