Building Machine Learning Models in Spark 2

Training ML models is a compute-intensive operation and is best done in a distributed environment. This course will teach you how Spark can efficiently perform data explorations, cleaning, aggregations, and train ML models all on one platform.
Course info
Level
Intermediate
Updated
Jun 19, 2018
Duration
3h 27m
Table of contents
Machine Learning Packages: spark.mllib vs. spark.ml
Building Classification and Regression Models in Spark ML
Implementing Clustering and Dimensionality Reduction in Spark ML
Building Recommendation Systems in Spark ML
Course Overview
Description
Course info
Level
Intermediate
Updated
Jun 19, 2018
Duration
3h 27m
Description

Spark is possibly the most popular engine for big data processing these days. In this course, Building Machine Learning Models in Spark 2, you will learn to build and train Machine Learning (ML) models such as regression, classification, clustering, and recommendation systems on Spark 2.x's distributed processing environment. This course starts off with an introduction of the 2 ML libraries available in Spark 2; the older spark.mllib library built on top of RDDs and the newer spark.ml library built on top of dataframes. You will get to see the two compared to help you know when to pick one over the other. You will get to see a classification model built using Decision Trees the old way, and see how you can implement the same model on the newer spark.ml library. The course covers many features of Spark 2, including going over a brand new feature in Spark 2, the ML pipelines used to chain your data transformations and ML operations. At the end of this course you will be comfortable using the advanced features that Spark 2 offers for machine learning. You'll learn to use components such as Transformers, Estimators, and Parameters within your ML pipelines to work with distributed training at scale.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Building Features from Image Data
Advanced
2h 10m
Aug 13, 2019
Designing a Machine Learning Model
Intermediate
3h 25m
Aug 13, 2019
Building Features from Nominal Data
Intermediate
2h 40m
Aug 12, 2019
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, my name is Janani Ravi and welcome to this course on Building Machine Learning Models in Spark 2. A little about myself, I have a Master's in Electrical Engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. At Google I was one of the first engineers working on real-time collaborative editing in Google Docs and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high quality video content. In this course, you'll learn to build and train ML models, such as regression, classification, clustering, and recommendation systems on Spark 2's distributed processing environment. This course starts off with an introduction of the two ML libraries available in Spark 2, the older spark. mllib library built on top of RDDs, and the newer spark. ml library built on top of data fields. We'll compare and contrast the two and talk about when we would choose one library over the other. This course covers both supervised and unsupervised machine learning models, starting off with classification and regression models. We'll cover decision trees and random forests for classification and Lasso and Ridge models for regression. We'll also see how we can use the confusion matrix and measures such as position and recall to see how good our classification models are. We'll also cover a brand-new feature in Spark 2, the ML pipelines, used to chain our data transformations and ML operations. We'll cover mean-shift clustering and dimensionality reduction using PCA in the unsupervised learning techniques before we move onto recommendation systems using the alternate least squared method. We'll implement a recommendations engine using both explicit, as well as implicit ratings. At the end of this course, you'll be very comfortable using the advanced features that Spark 2 offers for machine learning. You'll learn to use components such as transformers, estimators, and parameters within your ML pipelines to work with distributed training at scale.