Apache Spark Fundamentals

This course will teach you how to use Apache Spark to analyze your big data at lightning-fast speeds; leaving Hadoop in the dust! For a deep dive on SQL and Streaming check out the sequel, Handling Fast Data with Apache Spark SQL and Streaming.
Course info
Rating
(232)
Level
Intermediate
Updated
Oct 27, 2015
Duration
4h 27m
Table of contents
Description
Course info
Rating
(232)
Level
Intermediate
Updated
Oct 27, 2015
Duration
4h 27m
Description

Our ever-connected world is creating data faster than Moore's law can keep up, making it so that we have to be smarter in our decisions on how to analyze it. Previously, we had Hadoop's MapReduce framework for batch processing, but modern big data processing demands have outgrown this framework. That's where Apache Spark steps in, boasting speeds 10-100x faster than Hadoop and setting the world record in large scale sorting. Spark's general abstraction means it can expand beyond simple batch processing, making it capable of such things as blazing-fast, iterative algorithms and exactly once streaming semantics. In this course, you'll learn Spark from the ground up, starting with its history before creating a Wikipedia analysis application as one of the means for learning a wide scope of its core API. That core knowledge will make it easier to look into Spark's other libraries, such as the streaming and SQL APIs. Finally, you'll learn how to avoid a few commonly encountered rough edges of Spark. You will leave this course with a tool belt capable of creating your own performance-maximized Spark application.

About the author
About the author

Justin is a software journeyman, continuously learning and honing his skills.

More from the author
Patterns for Pragmatic Unit Testing
Beginner
2h 1m
26 Dec 2014
Scala: Getting Started
Intermediate
2h 1m
6 Jun 2014
Section Introduction Transcripts
Section Introduction Transcripts

Spark Core: Part 2
Hi, this is Justin Pihony. In this module, we'll finish up the basics of Spark's Core API. In the last module, we covered the root of the API. How to load, transform, and act on our data. In this module, we'll continue through the rest of the core. We'll see some specialized functions for a common use case, working with data that's in a key value format. Then we'll speed our processing up by learning about Spark's ability to persist intermediate data so that it can be iterated over even quicker. After that, we'll cover how to accumulate values across this distributed set, ending with some additional information on how to use the Java API, especially if you're not using Java 8.

Distribution and Instrumentation
Hi, this is Justin Pihony. In this module, we're going to deviate from reviewing the API itself, instead focusing on the environment and some tools to help us use Spark to its fullest. As the primary focus of the course is to introduce you to the Spark API, this will only be a taste of a much larger ecosphere, just enough so that you can get an end-to-end picture of Spark's abilities as a whole. Which, of course, includes distribution and monitoring. We'll go into a little more detail on who to submit our Spark applications for execution, getting an introduction as to what cluster managers are, and some of the tools that Spark provides to make it easier to spin them out. Then, we'll take a look at this in action through Amazon's Cloud computing services. And finish by seeing how to monitor and measure our applications.

Spark Libraries
Hi, this is Justin Pihony. In this module we're going to review Spark's built-in libraries. The previous modules have provided enough information to deploy a basic Spark application, covering the basics up through the core API, then exhibiting how to deploy and maintain it. Using that base knowledge, we'll now see how to utilize Spark's specialized libraries, going one abstraction level deeper into the big data stack. We'll start with Spark SQL, a means of processing semi-structured data, using the structure to optimize each query to its fullest potential. Then we'll go over Spark Streaming, which is officially described as enabling the processing of live streams of data in a scalable, high throughput, fault-tolerant manner. The ability to switch between batch analysis right into streaming seems to have enticed a wide audience, and we'll see a little of why this is. Then there's the Machine Learning Library, MLlib, having the ambitious goal of making machine learning both scalable and easy. And we'll close the module with the graphing library, GraphX, which as I'm sure you can guess, focuses on a merging of data-parallel and graph-parallel computations. Now, while each library comes with the added benefit of already knowing the basics, as each is built on top of the core, and its RDD abstraction, there's still a lot within each specialization. So this module's goal is merely to introduce you to the basics of each of them, not going into quite as much depth as we did with the core. We'll leave that for a future course.

Optimizations and the Future
Hi, this is Justin Pihony. In this, our final module, we'll go over the last two items from the course list, covering some of the more common troubleshooting and optimization issues. Like what closures are, and how to safely work with them, how broadcast variables can be used to reduce network bandwidth, and how to keep data partitioning from harming your performance, finishing with a review of some of the areas of advancement surrounding Spark's bright future.