Apache Spark is one of the fastest and most efficient general engines for large-scale data processing. In this course, you'll learn how to develop Spark applications for your Big Data using Scala and a stable Hadoop distribution, Cloudera CDH.
At the core of working with large-scale datasets is a thorough knowledge of Big Data platforms like Apache Spark and Hadoop. In this course, Developing Spark Applications Using Scala & Cloudera, you’ll learn how to process data at scales you previously thought were out of your reach. First, you’ll learn all the technical details of how Spark works. Next, you’ll explore the RDD API, the original core abstraction of Spark. Then, you’ll discover how to become more proficient using Spark SQL and DataFrames. Finally, you'll learn to work with Spark's typed API: Datasets. When you’re finished with this course, you’ll have a foundational knowledge of Apache Spark with Scala and Cloudera that will help you as you move forward to develop large-scale data applications that enable you to work with Big Data in an efficient and performant way.
Xavier is very passionate about teaching, helping others understand search and Big Data. He is also an entrepreneur, project manager, technical author, trainer, and holds a few certifications with Cloudera, Microsoft, and the Scrum Alliance, along with being a Microsoft MVP.
Course Overview Hello and welcome to this Pluralsight course, Developing Spark Applications Using Scala and Cloudera. I am Xavier Morera and I help developers understand enterprise search and big data. Did you know that Spark as a big data processing engine is at least 10 to 100 times faster than Hadoop MapReduce? And on top of that it is easier to learn, widely adopted, and used for a diverse range of applications? In this course we're going to learn how to create Spark applications in the very language in which Spark was created; Scala, a language in which most Spark samples are available. Also, because infrastructure is important, we will leverage the first and one of the most widely used Hadoop distributions, CDH, which stands for Cloudera's Distribution including Hadoop. Some of the major topics that we will cover include getting an environment set up with Spark and some interesting data, namely CDH plus StackOverflow, understanding Spark and overview, and getting technical with Spark. We will also learn how to work with RDDs or resilient distributed datasets, the original core abstraction of Spark, and then we will cover DataFrames and Spark SQL, which helps us become proficient with Spark quicker, and finally we will learn about datasets, the higher-level typed API in Spark. By the end of this course you will be able to create Spark applications using Scala and Cloudera. Before beginning the course, you should be familiar with programming, preferably with Scala, but I include a refresher module to jumpstart your journey. Additionally, you will need a cluster, but I will explain how to get your infrastructure set up in multiple different ways. I hope you will join me on this journey to learn about Spark with the Developing Spark Applications with Scala and Cloudera course at Pluralsight.
Why Spark with Scala and Cloudera? Apache Spark is one of the most active projects in open source and with good reason. It was developed in response to the limitations of MapReduce and it delivered. Spark can be 10 to 100 times faster than MapReduce, which combined with the power of the language in which Spark was created, Scala, and a solid Hadoop distribution, namely Cloudera, lets you create big data applications that are more performant as well as easier to create. Also, by covering Spark, Scala and Cloudera together, you will get a broader picture into developing Spark applications. Additionally, there have been multiple improvements with the release of Spark 2, among some of the most important being the unification of the dataset and DataFrame API. usability improvements, structured streaming, performance improvements, and SQL 2003 support. Let's cover quickly a few details around Spark starting with getting an environment with data, a quick Scala refresher, and then the most important part, learning how to work with the Spark APIs.
Refreshing Your Knowledge: Scala Fundamentals for This Course Refreshing Your Knowledge: Scala Fundamentals for this Course. Scala is a great programming language, but more than that it is a great programming language for you to learn and be proficient at. If you're curious, Scala stands for scalable language and it holds true to its name as the same concepts can be used to describe small to large parts, something achieved in part by unifying and generalizing concepts from both object-oriented and functional programming. It is used by small to large companies alike for their production systems including Twitter, Foursquare, Amazon, Siemens and more. There are many libraries and frameworks that are written in Scala or that provide a Scala API, the one that we're interested in the most right now being Apache Spark, mainly because Spark is written in Scala, meaning obviously that there is a Scala API for Spark. Thus it is the language of choice by many developers to create Spark applications. Additionally, you will find a lot of examples and tutorials in Scala. In this module I will do a very quick Scala refresher. If you're an experienced Scala developer who is aching to learn Spark, you can definitely jump ahead; however, if you have some doubts on what it is that you need to know of the basics of Scala to work with Spark, then please join me in this module.
Going Deeper into Spark Core So far we have covered a lot of ground with Spark Core. If you are just getting started with big data, I can safely say that you are no longer a stranger to Spark. At a higher level in the previous module we focused primarily into the many ways of loading and saving data with RDDs and now it is time to go deeper into Spark Core. In this module we will focus on transformations, actions, partitions, learn more about sampling, combining, aggregating, set operations, caching, shared variables and more. Let's begin.
Increasing Proficiency with Spark: DataFrames and Spark SQL Increasing Proficiency with Spark: DataFrames and Spark SQL. Now that you know pretty well the lower-level API, namely RDDs, it is time to learn about the higher-level API. Thus we will cover now DataFrames and Spark SQL and maybe you're wondering, why do I need to learn yet another API if I just learned RDDs? The answer is simple. Even though we could most likely do anything we wanted with our data using RDDs, the higher-level API allows us to become proficient with Spark quicker, especially true if you have a relational background and so in this module I will start by explaining to you the reasons of why it's better to work with DataFrames and Spark SQL and then we will learn how to work with the higher-level API. Let's begin.
Working with a Typed API: Datasets Understanding a Typed API: Datasets. In many occasions throughout this training I have said DataFrame API, but then immediately I made the clarification that it is the Dataset API and that DataFrames are one special case. As covered earlier, in the days of 1. x, Spark had 3 APIs. First, RDDs, then DataFrames, and more recently, Datasets. But starting with Spark 2, RDDs became the lower-level API with DataFrames and Datasets unified into a single API, what's referred to as the higher-level API. And so, in earlier modules we covered RDDs and then DataFrames. Now it's time to expand our knowledge with Datasets. Let's begin.