Apache Spark is a leader in enabling quick and efficient data processing. This course will teach you how to use Spark's SQL, Streaming, and even the newer Structured Streaming APIs to create applications able to handle data as it arrives.
Analyzing data used to be something you did once a night. Now you need to be able to process data on the fly so you can provide up to the minute insights. But, how do you accomplish in real time what used to take hours without a complicated code base? In this course, Handling Fast Data with Apache Spark SQL and Streaming, you'll learn to use Apache Spark Streaming and SQL libraries as a great way to handle this new world of real time, fast data processing. First, you'll dive into SparkSQL. Next, you'll explore how to catch potential fraud by analyzing streams with Spark Streaming. Finally, you'll discover the newer Structured Streaming API. By the end of this course, you'll have a deeper understanding of these APIs, along with a number of streaming concepts that have driven the API design.
Course Overview Hi, my name is Justin Pihony, and welcome to my course, Fast Data Handling with Apache Spark SQL and Streaming. Being a top contributor of Apache Spark answers on Stack Overflow, as well as the developer support manager at Lightbend has given me a lot of insight into how to maximize Spark's power, while sidestepping possible pitfalls. Fast data is the next big thing in the world of data. Nowadays, we want valuable business insights now, not after having to wait for batch jobs to complete, and we're now at a point where we can build these systems able to reactively handle our needs at scale. In this course, we're going to see how to use Spark in its SQL and streaming capabilities to build these fast data applications without breaking a sweat. Some of the major topics that we'll cover include a deep dive into Spark's SQL library, learning both the untyped side via DataFrames and the type-safe side via datasets, as well as covering Spark's take on streaming via both the older, more-stable Spark Streaming library and its modernized, up-and-coming structured streaming library. By the end of this course, you'll have extensive knowledge of Spark's SQL and streaming APIs, knowing how to utilize them to create a fast data application capable of pulling out business insights in no time at all. Before beginning the course, you should have a basic understanding of Apache Spark, which you can get from my other course, Apache Spark Fundamentals. I hope you'll join me on this journey to learn about Spark's SQL and streaming libraries, and how they can be used in this new architecture overtaking the big data world via the Fast Data Handling with Apache Spark SQL and Streaming at Pluralsight.
Introduction Hello, my name is Justin Pihony, and welcome to this course on Handling Fast Data with Apache Spark SQL and Streaming. It serves two purposes. One is as a successor to my first Spark course, Apache Spark Fundamentals, with this one taking a more focused deeper dive into the popular SQL and streaming libraries. The other purpose is to familiarize you with the next evolution and data processing dubbed fast data. We'll dig deeper into the terminology later, but the idea is that big data processing is too focused on batch, and too slow for today's fast pace, even with the advent of Apache Spark. We want our analysis to evolve as the data evolves, continuously and quickly. In fact, the International Data Corporation published a paper predicting that by the year 2020, 1. 7 MB of new information will be created every second for every human on the planet. So learning to handle data at speed is going to be just as important, if not more important as handling it at scale. Now, in this short introductory module, I'll prepare you for the rest of the course, starting off by getting you that deeper understanding of what fast data is all about. Then I'll make sure you understand the data project we'll be working on throughout the course, and any other important tidbits of prerequisite knowledge that'll prepare you for the fun to come. And I'll finish with a brief overview of Spark's most recent accomplishments, coming about through its newer 2. x branch, addressing how this affects the rest of the course also.
Querying Data with the DataFrames (Part 1) Hi, this is Justin Pihony. In this module, we're going to start off by digging deeper into the DataFrames API. In the Fundamentals course, we only touched on some of the most prevalent methods; however, as Spark SQL is one of the bigger growth areas of Spark, there's a lot of ground left to cover, so we'll need two DataFrames-focused modules to cover it all. In this first one, we're going to start by covering a vast swath of the methods available to call directly against a DataFrame. Then we'll move on to the just as numerous function methods. These are methods that are available to work against the different parts of your data like finding aggregations, checking on columnar data, and other pieces of your data rows, finishing by seeing how we can even use specialized functions to flatten internal rays out into the main set.
Querying Data with the DataFrames (Part 2) Hi, this is Justin Pihony. In this second DataFrame-focused module, we're going to finish our deep dive into this vast DataFrames API. We'll continue down the path of SQL functions, first learning how to create windows of data to inspect more than one data row at a time, and thinking a bit more vertically. Then learning the rest of the many useful means of using functions in Spark SQL. After that, we'll go over the different ways to join your DataFrames. This is a topic which I've found is good to make clear no matter the flavor of SQL you're using, as it always seems to create some sort of confusion. Then we'll finish off the module by reviewing the newer SQL UI, and how it can help debug and get a better view into your Spark SQL code.
Improving Type Safety with Datasets Hi, this is Justin Pihony. In this module, we're going to learn about the evolution from weakly typed DataFrames into strongly typed compiler verified Datasets. In the last module, we learned a lot about the usefulness and power of DataFrames. However, DataFrames were built with a SQL mindset, working with generic row objects, which cannot be checked for correctness at compilation time. Datasets emerged in Spark 1. 6 as a way to get the benefits of the optimizer and bring back some functional nature coupled with type safety. In this module, we'll dig a bit more into the reason for this library expansion and see how nice it can be to have the benefits from the last module merged with the benefits of the compiler. Then, we'll learn about encoders, the true power behind Datasets and we'll finish by expanding our horizons beyond the already numerous native Datasources that are available for loading and saving our data, distributing our storage into Cassandra.
Processing Data with the Streaming API Hi, this is Justin Pihony. In this module, we're going to speed things up, the speed at which we run our data processing, that is. With a much stronger understanding of the querying side, we're now going to start digging into the Spark Streaming library. We'll see how easy Spark's focus on unification has made it to switch between batch and streaming, when we build out our application to start detecting transaction anomalies in near real time. But first we'll review the existing streaming landscape, and how Spark fits into it. Then we'll gain a deeper understanding of its mechanics, so that we can apply that understanding towards learning the ins and outs of the API. And of course we'll want to carry some state along our stream analysis to more efficiently catch any oddities in the transaction data flow. Finally, we'll see how we can use the Spark UI to help monitor our active system, which can be used to figure out areas of improvement for our code. Now it should be noted that this module is mostly in addition to what was already covered in my Fundamentals course, so you'll want to review that to get the Spark Streaming basics. There might be a little duplication, but mostly it's going to focus on strengthening your understanding of the concepts, and broadening your knowledge of the library's capabilities.
Optimizing, Structured Streaming, and Spark 2.x Hi, this is Justin Pihony. In this final module, we'll cover a number of optimization aspects as well as dive a bit further into some of the recent and future Spark advancements. It all goes back to the core of the course, learning the best ways to optimize our applications toward processing data faster and more efficiently. First, we will see how we can improve our streaming code from the last section to better handle different failure scenarios, then we'll discuss some optimizations we can make to most efficiently handle our data, after which we'll learn about how Spark is further embracing fast data through structured streaming, their next gen streaming library built on top of Spark SQL. It takes advantage of bleeding-edge streaming concepts all while hiding much of it so that you can avoid having to reason about the streaming abstraction itself as much as possible. Then we'll close out with a review of the biggest aspects in the Spark 2. x series and what's yet to come.