Building Machine Learning Models in Spark 2

Training ML models is a compute-intensive operation and is best done in a distributed environment. This course will teach you how Spark can efficiently perform data explorations, cleaning, aggregations, and train ML models all on one platform.
Course info
Rating
(15)
Level
Intermediate
Updated
Jun 19, 2018
Duration
3h 27m
Table of contents
Course Overview
Machine Learning Packages: spark.mllib vs. spark.ml
Building Classification and Regression Models in Spark ML
Implementing Clustering and Dimensionality Reduction in Spark ML
Building Recommendation Systems in Spark ML
Description
Course info
Rating
(15)
Level
Intermediate
Updated
Jun 19, 2018
Duration
3h 27m
Description

Spark is possibly the most popular engine for big data processing these days. In this course, Building Machine Learning Models in Spark 2, you will learn to build and train Machine Learning (ML) models such as regression, classification, clustering, and recommendation systems on Spark 2.x's distributed processing environment. This course starts off with an introduction of the 2 ML libraries available in Spark 2; the older spark.mllib library built on top of RDDs and the newer spark.ml library built on top of dataframes. You will get to see the two compared to help you know when to pick one over the other. You will get to see a classification model built using Decision Trees the old way, and see how you can implement the same model on the newer spark.ml library. The course covers many features of Spark 2, including going over a brand new feature in Spark 2, the ML pipelines used to chain your data transformations and ML operations. At the end of this course you will be comfortable using the advanced features that Spark 2 offers for machine learning. You'll learn to use components such as Transformers, Estimators, and Parameters within your ML pipelines to work with distributed training at scale.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Using PyTorch in the Cloud: PyTorch Playbook
Intermediate
2h 21m
Apr 25, 2019
Building Clustering Models with scikit-learn
Intermediate
2h 33m
Apr 24, 2019
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, my name is Janani Ravi and welcome to this course on Building Machine Learning Models in Spark 2. A little about myself, I have a Master's in Electrical Engineering from Stanford and have worked at companies such as Microsoft, Google, and Flipkart. At Google I was one of the first engineers working on real-time collaborative editing in Google Docs and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high quality video content. In this course, you'll learn to build and train ML models, such as regression, classification, clustering, and recommendation systems on Spark 2's distributed processing environment. This course starts off with an introduction of the two ML libraries available in Spark 2, the older spark. mllib library built on top of RDDs, and the newer spark. ml library built on top of data fields. We'll compare and contrast the two and talk about when we would choose one library over the other. This course covers both supervised and unsupervised machine learning models, starting off with classification and regression models. We'll cover decision trees and random forests for classification and Lasso and Ridge models for regression. We'll also see how we can use the confusion matrix and measures such as position and recall to see how good our classification models are. We'll also cover a brand-new feature in Spark 2, the ML pipelines, used to chain our data transformations and ML operations. We'll cover mean-shift clustering and dimensionality reduction using PCA in the unsupervised learning techniques before we move onto recommendation systems using the alternate least squared method. We'll implement a recommendations engine using both explicit, as well as implicit ratings. At the end of this course, you'll be very comfortable using the advanced features that Spark 2 offers for machine learning. You'll learn to use components such as transformers, estimators, and parameters within your ML pipelines to work with distributed training at scale.

Machine Learning Packages: spark.mllib vs. spark.ml
Hi, and welcome to this course on Building Machine Learning Models in Spark 2. Now if you want to build and train your models in the distributed environment, Spark is a great option. Spark can be thought of as a distributed computing engine that is very useful to extract insights from very huge amounts of data, which is why machine learning libraries fit right in with Spark. In fact, machine learning libraries have been a part of Spark since the very beginning. Early versions of Spark in Spark 1. x provided powerful ML support using the spark. mllib library. Spark 2 is the current version that is available, and if you're going to adopt Spark today, this is the version that you'll use. Spark 2 provides many enhancements and improvements over Spark 1, specifically in terms of performance. Spark 2 also offers an entirely new set of APIs for developers to work with. You'll work with dataFrames and not directly with RDDs. The machine learning library has also been enhanced and improved. There are now many more higher level extractions, which make it much easier to work with. If you have worked with Spark, you might have heard of project Tungsten, which completely revamped Spark's distributed computing engine to make it much faster. The execution speed up for certain operations is between 10 and 100 x. Spark 2 offers built-in libraries for hyperparameter tuning, which allows you to choose the best model for your use case. The machine learning libraries in Spark 1, as well as Spark 2 are both powerful; however, for faster speed of execution and more abstraction and higher-level libraries, Spark 2 is the clear winner, the spark. mllib.

Building Classification and Regression Models in Spark ML
Hi and welcome to this module where we'll build classification and the regression models in Spark ML. This is the new library, which runs on top of dataFrames in Spark 2. In addition to being able to take advantage of the faster execution speeds offered by Spark 2, Spark ML has high-level abstractions for machine learning, such as estimators and transformers. Estimators and transformers can be chained together in a pipeline. This forms a machine learning workflow. Spark ML also has special libraries which helps us with feature engineering; extracting, transforming, and selecting only those features that we're interested in are all made easy. In the previous module, we saw an example of classification using the decision tree machine learning model. In this module, we'll dive deeper into classification. We'll learn how we can evaluate classifiers using the confusion matrix. We'll revisit the decision tree model once again, but this time we'll build it using the Spark 2 APIs, the Spark. mllib. We'll also implement a classification problem using random forests. Then we'll move onto regression and see how we can build specialized regression models, such as Lasso and Ridge regression.

Implementing Clustering and Dimensionality Reduction in Spark ML
Hi, and welcome to this module where we'll focus on unsupervised learning techniques in Spark ML. We'll implement clustering and dimensionality reduction. Unsupervised learning techniques are typically used with data where we don't have a huge number of labeled instances. Unsupervised learning is used to find patterns within the data itself and not rely on training labels. In this module, we'll study K-means clustering, which is a widely used unsupervised learning techniques to find logical groupings of data. Documents or records that are similar to one another belong to the same group. The most important hyperparameter in K-means clustering is k, the number of logical groups into which we want to divide our data. Elbow and silhouette methods are used to find the best value for this hyperparameter. Machine learning models perform well when they're trained on huge datasets with a lot of disparate instances. Now, it might be that a lot of the input features don't really have much significance, which is why it's common to apply dimensionality reduction on the input dataset in order to discover latent factors in the underlying data. In this module, we'll look at Spark ML libraries to perform principal components analysis, or PCA, which is a very commonly used method for dimensionality reduction.

Building Recommendation Systems in Spark ML
Hi, and welcome to this module where we learned how to build recommendation systems using Spark ML. There are a number of different types of algorithms that can be used to make recommendations to users. One of the most popular techniques used to build recommendation systems is collaborative filtering. Collaborative filtering algorithms use the information from other users to generate rules like people who buy X will also buy Y. We'll first understand the intuition behind collaborative filtering using the Alternating-Least-Squares, or the ALS method. Once you implement it in Spark, though, there is no reason for you to know the math or the logic involved. Spark offers estimators which abstract you away from all the details of ALS. In this module, we'll study two different kinds of recommendation systems, one which uses explicit ratings, and another which uses implicit ratings.