Getting Started with Spark 2

The 2.x releases of Spark represent significantly different and upgraded features. This course will focus on all of these changes, in both theory and practice.
Course info
Rating
(46)
Level
Beginner
Updated
May 16, 2018
Duration
2h 16m
Table of contents
Course Overview
Understanding Differences Between Spark 2.x and Spark 1.x
Exploring and Analyzing Data with DataFrames
Querying Data Using Spark SQL
Description
Course info
Rating
(46)
Level
Beginner
Updated
May 16, 2018
Duration
2h 16m
Description

Spark is possibly the most popular engine for big data processing these days and the 2.x release has several new features which make Spark more powerful and easy to work with. In this course, Getting Started with Spark 2, you will get up and running with Spark 2 and understand the similarities and differences between version 2.x and older versions. First, you will get to see the basic Spark architecture and the details of Project Tungsten which brought great performance improvements to Spark 2. You will go over the new developer APIs using DataFrames and see how they inter-operate with RDDs from Spark 1.x. Next, you will move on to big data processing where you will load and clean datasets, remove invalid rows, execute transformations to extract insights and perform grouping, sorting, and aggregations using the new DataFrame APIs. You will also study how and where to use broadcast variables and accummulators. Finally, you will work with Spark SQL which allows you to use SQL commands for big data processing. The course also covers advanced SQL support in the form of windowing operations. At the end of this course, you should be very comfortable working with Spark DataFrames and Spark SQL. You will be better equipped to make technical choices based on the performance trade-offs of older versions of Spark vs. Spark 2. Software required: Apache Spark 2.2, Python 2.7.

About the author
About the author

A problem solver at heart, Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the original engineers on Google Docs and holds 4 patents for its real-time collaborative editing framework.

More from the author
Analyzing Data with Qlik Sense
Intermediate
2h 11m
Jun 17, 2019
More courses by Janani Ravi
Section Introduction Transcripts
Section Introduction Transcripts

Course Overview
Hi, my name is Janani Ravi, and welcome to this course on Getting Started with Spark 2. A little about myself, I have a master's degree in electrical engineering from Stanford, and have worked at companies such as Microsoft, Google, and Flipkart. At Google, I was one of the first engineers working on real-time collaborative editing in Google Docs, and I hold four patents for its underlying technologies. I currently work on my own startup, Loonycorn, a studio for high-quality video content. In this course, you'll get up and running with Spark 2, and understand the similarities and differences between version 2. X and older versions. You'll understand the basic Spark architecture and the details of Project Tungsten, which brought great performance improvements to Spark 2. The course will cover the new developer APIs using DataFrames, and you'll see how they interoperate with RDDs from Spark 1. We'll then move on to big data processing, where we'll load and clean Datasets, remove invalid rows, execute transformations to extract insights, and perform grouping, sorting, and aggregations using the new DataFrame APIs. We'll also study how and where to use broadcast variables and accumulators. We'll then work with Spark SQL, which allows us to use SQL commands for big data processing. Datasets loaded into Spark can be used to retrieve information using the familiar SQL constructs. The course also covers advanced SQL support in the form of windowing operations. At the end of this course, you should be very comfortable working with Spark DataFrames and Spark SQL. You should be able to make technical choices based on performance tradeoffs of older versions of Spark versus Spark 2.

Exploring and Analyzing Data with DataFrames
Hi, and welcome to this module on Exploring and Analyzing Data in Spark 2 using DataFrames. Developers in Spark 2 use RDDs only if their use case specifically demands it. If you want low-level transformation, and actions, and control on your DataSet, you'll use RDDs. If your data is unstructured, such as media streams or strings of text, those also call for RDDs. In all other situations, you'll use DataFrames. In this module, we'll work on real-world DataSets, one for London crime, and other for soccer player statistics. We'll see how we can use built-in aggregate functions. We'll also get hands-on practice using DataFrames for sampling, grouping, and ordering data. And finally, we'll cover broadcast variables and accumulators. Broadcast variables are used for processes to access shared data in an optimized fashion. Accumulators allow multiple processes to update shared variables.

Querying Data Using Spark SQL
Hi, and welcome to this module on Querying Data Using Spark SQL. Spark tries to make it very easy for you to work with data. One way to achieve this is to allow you to query your DataFrames as though they were tables in your relational database. Before you start querying your DataFrames, you need to register them as SQL tables. You can register them as a temporary table, which is available on a per-session basis, or as a global table, so that it's available across all SparkSessions. Spark 2 has a special optimizer for SQL queries called the Catalyst Optimizer, and it makes executing SQL queries very, very fast. Spark is capable of inferring the schema of your DataFrame objects, and thus of your SQL tables. It's also possible for you to explicitly specify the schema. Spark also allows you to specify windowing operations with your DataFrames. If you've used Windows in SQL, you know exactly what they are. This is a way to group and order your data, and then apply some kind of ranking or analytics function within a group.