Deploying Hadoop with Cloudera CDH to AWS

Learn how to deploy, size, and scale Hadoop in the cloud (namely AWS). You'll understand key concepts to deploy a CDH cluster, perform a manual installation, and finally learn how to automate deployments for multiple clusters with Cloudera Director.
Course info
Level
Intermediate
Updated
Oct 16, 2017
Duration
3h 30m
Table of contents
Course Overview
Why Hadoop in the Cloud? The Case for CDH on AWS
Understanding the Cloud: An AWS Mini Crash Course
More "AWS Mini Crash Course"
Planning Your Hadoop Cluster on AWS
Deploying, Sizing, and Scaling Your CDH Cluster on AWS
Automating Deployments & Managing Clusters with Cloudera Director
Cloudera Altus and Final Takeaway
Description
Course info
Level
Intermediate
Updated
Oct 16, 2017
Duration
3h 30m
Description

Many years ago, hardware cost was pretty steep. It was not unexpected that a project with large amounts of data required 7 figures worth of hardware just to get started. But times have changed, and with cloud services it is possible now to store data cheaply and spin up as many servers with your desired specs to process this data with all kinds of available machines and get the answers that you need. In this course, Deploying Hadoop with Cloudera CDH to AWS, you will learn how to deploy Hadoop in the cloud. First you'll learn about some key topics. Then, you'll learn how to perform deployment manually. Finally, you'll learn about a specialized tool called Cloudera Director that helps automate deployments either for transient or for long running clusters. You will also learn about some differences between AWS and Azure/GCE. These differences can be important if you are working on a different platform, but by no means are they blockers for someone already familiar with their current platform. By the end of this course, you will be able to better manage your cloud needs.

About the author
About the author

Xavier is very passionate about teaching, helping others understand search and Big Data. He is also an entrepreneur, project manager, technical author, trainer, and holds a few certifications with Cloudera, Microsoft, and the Scrum Alliance, along with being a Microsoft MVP.

More from the author
Importing Data: Python Data Playbook
Beginner
1h 35m
Nov 17, 2018
More courses by Xavier Morera
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi everyone, my name is Xavier Morera, and welcome to my course, Deploying Hadoop with Cloudera CDH to Amazon Web Services. I am very passionate about teaching, primarily helping developers understand, search, and bake data. Here's a fun fact; Did you know that the amount of data in the world right now is estimated at around 5 ZB and expected to grow up to 44 ZB by 2020? That's 44 trillion gigabytes. And at the moment, less than. 5% of the data is ever analyzed. Imagine the possibilities of what you can discover with the help of baked data. In this course we're going to learn how to deploy Hadoop in the cloud using Cloudera's distribution known as CDH on AWS to be precise. Some of the major topics that we will cover include preparing the prerequisites in AWS to deploy Hadoop. The cloud has many features, but there is only a small subset that we need to know. Planning required before deploying, this includes security, capacity planning, and understanding best practices for the different workload types. Then, we will deploy CDH manually, a similar process to deploying on-prem, but I will highlight the different steps. And finally, we'll learn how to automate cluster deployment and management with Cloudera Director. By the end of this course, you will be prepared to take your baked data to the cloud, taking advantage of the flexibility and power that AWS has to offer. Before beginning the course, it is desirable if you know about Linux, an overall idea of CDH and AWS, but if you don't, it is okay as I will present to you the detailed steps that are easy to follow. I hope you'll join me on this journey to learn about Cloudera in AWS with the Deploying Hadoop with Cloudera CDH to Amazon Web Services course at Pluralsight.

Understanding the Cloud: An AWS Mini Crash Course
Understanding the Cloud: An AWS Mini Crash Course. At this point, we should already be convinced of why moving to the cloud is a great idea, especially if you're working on something with high-resource requirements like big data, and it is very important that you understand well the basics of the cloud. In the next two modules, we'll cover the key concepts required to take your Hadoop cluster to the AWS cloud with Cloudera. Disclaimer, if you're a total AWS expert, you might feel tempted to skip ahead, but I recommend you stay because we're going to get your cloud environment ready. Of all the thousands of things available in AWS, there are about 10 things that you really need to know, and we will cover them in this module.

More "AWS Mini Crash Course"
So far we have taken the first steps to get started with Amazon Web Services, also known as AWS. It is time to continue our journey with understanding the cloud, an AWS mini crash course, part two of two.

Planning Your Hadoop Cluster on AWS
Planning Your Hadoop Cluster on AWS. Here are the topics that I would like to cover in this module: First, security. Security first. Always remember to take security very seriously unless you want to be in the news in a very bad way, or even in a court of law. Capacity Planning, or how to size your cluster properly. Then, we will talk about Architectural Best Practices. Basically, how to get it done right, and finally, Preparing Cluster Deployment, namely, the prerequisites to start setting up your cluster. So let's go with Security First.

Deploying, Sizing, and Scaling Your CDH Cluster on AWS
Deploying, Sizing, and Scaling Your CDH Cluster on AWS. So far, we've talked about several concepts that are required to set up your cluster in AWS, but we don't yet have a cluster, which is what we're going to do now, and here's how we're going to do it. First, we decided how to install the cluster, more on this soon. Then, we deploy Cloudera Manager, and deploy the agents. At this point, we do a quick overview of Cloudera Manager, and we will learn how to add nodes and remove nodes, what is called scaling horizontally, and then we'll learn how to scale vertically, that is, change the size of the nodes in the cluster as well as disk space in a particular node. Let's get started.

Automating Deployments & Managing Clusters with Cloudera Director
Automating Deployments and Managing Clusters with Cloudera Director. In previous modules and trainings, we first learned how to set up the prerequisites to deploy a Hadoop cluster with CDH and AWS, and then we took the necessary steps to get our cluster up and running. And it doesn't matter if you took path A or path B, at the end of the day, it was a manual process. And with so many steps, it is easy to make a mistake. And we want to avoid making mistakes. We want our clusters to be deployed with the precision of a hand-made expensive watch. Automation is the way to go, and we do it using Cloudera Director. Cloudera Director makes it simple, like an easy button for provisioning, managing, and de-provisioning one or many clusters in a predictable and efficient way. Director starts by deploying Cloudera Manager. It takes care of the prerequisites too, and then deploys one or many clusters, taking advantage of parallelism, making the whole process fast and efficient. It's not just an easy button, it's Cloudera Director. At a high level, let's compare the options that you have for when you're deploying a cluster. With the installation paths, we have a set of instructions that you can follow. You complete them and manually deploy a cluster, and manually perform any steps necessary to modify the cluster. Since you are doing a lot of steps, it is easier to make a mistake. On the other hand with Cloudera Director, you get a tool to manage your cloud infrastructure. Cloudera Director automates the cluster deployment process and let's you provision clusters in a consistent fashion. Best of all, it works with multiple cloud providers.