Apache Spark is one of the fastest and most efficient general engines for large-scale data processing. In this course, you will learn how to develop Spark applications for your Big Data using Python and a stable Hadoop distribution, Cloudera CDH.
At the core of working with large-scale datasets is a thorough knowledge of Big Data platforms like Apache Spark and Hadoop. In this course, Developing Spark Applications with Python & Cloudera, you’ll learn how to process data at scales you previously thought were out of your reach. First, you’ll learn all the technical details of how Spark works. Next, you’ll explore the RDD API, the original core abstraction of Spark. Finally, you’ll discover how to become more proficient using Spark SQL and DataFrames. When you’re finished with this course, you’ll have a foundational knowledge of Apache Spark with Python and Cloudera that will help you as you move forward to develop large-scale data applications that enable you to work with Big Data in an efficient and performant way.
Xavier is very passionate about teaching, helping others understand search and Big Data. He is also an entrepreneur, project manager, technical author, trainer, and holds a few certifications with Cloudera, Microsoft, and the Scrum Alliance, along with being a Microsoft MVP.
Course Overview Hello, and welcome to this Pluralsight course, Developing Spark Applications with Python and Cloudera. I am Xavier Morera and I help developers understand Enterprise Search and Big Data. Did you know that Spark as a big data processing engine is at least 10 to 100 times faster than Hadoop or MapReduce and that on top of that, it is easier to learn, widely adopted, and used for a diverse range of applications? In this course we're going to learn how to create Spark applications in a very popular and also easy-to-use language, Python, and because infrastructure is important, we will leverage the first and one of the most widely used Hadoop distributions, CDH, which stands for Cloudera's Distribution including Hadoop. Some of the major topics that we will cover include getting an environment set up with Spark and some interesting data, namely CDH plus StackOverflow, understanding Spark, an overview and getting technical with Spark. Then we will learn how to work with the original core abstraction of Spark, the RDD or resilient distributed dataset. Next we will cover data frames and Spark SQL, which helps us become more proficient with Spark quicker. And finally we will talk at a high level about datasets, which are not to be used with Python because it's dynamically typed and we'll also cover a few related topics. By the end of this course you will be able to create Spark applications with Python and Cloudera, but before beginning the course you should be familiar with programming, preferably with Python, but I include also a small refresher module in case you need a jumpstart. Additionally, you will need a cluster, but I will explain how to get your infrastructure set up in multiple different ways. I hope you will join me on this journey to learn about Spark with the Developing Spark Applications with Python and Cloudera course at Pluralsight.
Why Spark with Python and Cloudera? Hello and welcome to Developing Spark Applications with Python and Cloudera. I am Xavier Morera and I help developers understand search and big data and in this course we're going to talk about Apache Spark. Apache Spark is one of the most active projects in open source and with good reason. It was developed in response to the limitations of MapReduce and it delivered. Spark can be 10 to 100 times faster than MapReduce, which combined with the power of Python and a solid Hadoop distribution, namely Cloudera, it lets you create big data applications that are more performant as well as easier to code, also by covering Spark, Python, and Cloudera together, you will get a broader picture into developing Spark applications. Additionally, there have been multiple improvements with the release of Spark 2, among some of them the most important being that unification of the Dataset and DataFrame APIs, usability improvements, structured streaming, performance improvements, and SQL 2003 support. Let's cover quickly a few details around Spark, starting with getting an environment with data, a quick Python refresher and then we get to the most important part, learning how to work with the Spark APIs. Let's begin.
Refreshing Your Knowledge: Python Fundamentals for This Course Refreshing Your Knowledge: Python Fundamentals for this Course. Python is a great programming language. It is easy to learn, yet really powerful. It's available in many platforms and with a very active developer community. It is also one of the top choices when it comes to big data and data science as there are many specialized libraries available for you. It is used by small to large companies alike for their production systems including Google, Dropbox, Facebook, Netflix, and Spotify, just to name a few. Also, there are many APIs that you can work with using Python including the Spark API, which we will use today to create big data applications. In this module I will do a very quick Python refresher. If you are an experienced Python developer who's aching to learn Spark, you can jump ahead; however, if you have some doubts on what is it that you need to know of Python to work with Spark, please join me in this module.
Going Deeper into Spark Core So far, we have covered a lot of ground with Spark Core. If you are just getting started with big data, I can safely say that you are no longer a stranger to Spark. At a higher level in the previous module, we focused primarily into the many ways of loading and saving data with RDDs. And now, it is time to go deeper into Spark Core. In this module, we will focus on transformations, actions, partitions, learn more about sampling, combining, aggregating, set operations, caching, shared variables, and more. Let's begin.
Increasing Proficiency with Spark: DataFrames & Spark SQL Increasing Proficiency with Spark: DataFrames and Spark SQL. Now that you know pretty well the lower-level API, namely RDDs, it is time to learn about the higher-level API; thus, we will cover now DataFrames and Spark SQL. And maybe you're wondering, why do I need to learn yet another API if I just learned RDDs? The answer is simple. Even though we could most likely do anything we wanted with our data using RDDs, the higher-level API allows us to become proficient with Spark quicker, especially true if you have a relational background. And so in this module, I will start by explaining to you the reasons of why it's better to work with DataFrames in Spark SQL, and then we will learn how to work with the higher-level API. Let's begin.
Understanding a Typed API: Datasets Works with Scala, Not Python Understanding a Typed API: Datasets Works with Scala, Not Python. In many occasions throughout this training, I have said DataFrame API, but then immediately I made the clarification that it is the Dataset API and that DataFrames are one special case, namely Dataset of Row. You see, earlier in the days of 1. x, Spark had three APIs, the RDDs, then DataFrames, and later Datasets. But starting with Spark 2, RDDs became the lower-level API with DataFrames and Datasets unified into the higher-level API. This leaves us with Datasets as the typed API and DataFrames, again, Dataset of Row, to the untyped higher-level API. But so far we have not used Datasets, reason being simple; we can't use Datasets with Python. If we wanted to use them, we would need to use another language, like Scala. So let me ask you, got Scala?