Architecting Big Data Solutions Using Google Dataproc

Dataproc is Google’s managed Hadoop offering on the cloud. This course teaches you how the separation of storage and compute allows you to utilize clusters more efficiently purely for processing data and not for storage.
Course info
Level
Beginner
Updated
Nov 1, 2018
Duration
2h 17m
Table of contents
Course Overview
Introducing Google Dataproc for Big Data on the Cloud
Running Hadoop MapReduce Jobs on Google Dataproc
Working with Apache Spark on Google Dataproc
Working with Pig and Hive on Google Dataproc
Description
Course info
Level
Beginner
Updated
Nov 1, 2018
Duration
2h 17m
Description

When organizations plan their move to the Google Cloud Platform, Dataproc offers the same features but with additional powerful paradigms such as separation of compute and storage. Dataproc allows you to lift-and-shift your Hadoop processing jobs to the cloud and store your data separately on Cloud Storage buckets, thus effectively eliminating the requirement to keep your clusters always running. In this course, Architecting Big Data Solutions Using Google Dataproc, you’ll learn to work with managed Hadoop on the Google Cloud and the best practices to follow for migrating your on-premise jobs to Dataproc clusters. First, you'll delve into creating a Dataproc cluster and configuring firewall rules to enable you to access the cluster manager UI from your local machine. Next, you'll discover how to use the Spark distributed analytics engine on your Dataproc cluster. Then, you'll explore how to write code in order to integrate your Spark jobs with BigQuery and Cloud Storage buckets using connectors. Finally, you'll learn how to use your Dataproc cluster to perform extract, transform, and load operations using Pig as a scripting language and work with Hive tables. By the end of this course, you'll have the necessary knowledge to work with Google’s managed Hadoop offering and have a sound idea of how to migrate jobs and data on your on-premise Hadoop cluster to the Google Cloud.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Scraping Your First Web Page with Python
Beginner
2h 39m
Nov 5, 2019
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi. My name is Janani Ravi, and welcome to this course on Architecting Big Data Solutions Using Google Dataproc. A little about myself. I have a Master's degree in electrical engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. At Google, I was one of the first engineers working on real-time collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high-quality video content. In this course, you'll learn to work with managed Hadoop on the Google Cloud and the best practices to follow for migrating your on-premises jobs to Dataproc clusters. We'll study in some depth how separation of storage and compute allows you to utilize clusters more efficiently purely for processing data and not for storage. We start off by creating a Dataproc cluster and configuring firewall rules to enable us to access the cluster manager UI for our local machine. We'll execute map reduce jobs in the cloud using the web console, as well as the command line. We'll add additional compute capacity to our cluster using preemptible VMs and monitor our cluster using Stackdriver. We'll then study how we can use the Spark distributed analytics engine on our Dataproc cluster. We'll work with the PySpark shell on our cluster, as well as submit Spark jobs using the web console. We'll also see how we can write code to integrate our Spark jobs for BigQuery and cloud storage buckets using connectors. We'll then use our Dataproc cluster to perform extract, transform, and load operations using Pig as a scripting language and work with Hive tables. At the end of this course, you should be comfortable working with Google's managed Hadoop offering and have a sound idea of how to migrate jobs and data on your on-premises Hadoop cluster to the Google Cloud.