Course

Skills

Developing Spark Applications with Python & Cloudera

by Xavier Morera

Apache Spark is one of the fastest and most efficient general engines for large-scale data processing. In this course, you will learn how to develop Spark applications for your Big Data using Python and a stable Hadoop distribution, Cloudera CDH.

Preview this course

Try for free

Get this course plus top-rated picks in tech skills and other popular topics.

$29.00

per month after 10 day trial

Your 10 day Standard free trial includes

Expert-led courses

Keep up with the pace of change with thousands of expert-led, in-depth courses.

For teams

Give up to 50 users access to our full library including this course free for 30 days

Course info

Rating

(67)

Level

Beginner

Updated

Feb 27, 2018

Duration

5h 42m

What you'll learn

At the core of working with large-scale datasets is a thorough knowledge of Big Data platforms like Apache Spark and Hadoop. In this course, Developing Spark Applications with Python & Cloudera, you’ll learn how to process data at scales you previously thought were out of your reach. First, you’ll learn all the technical details of how Spark works. Next, you’ll explore the RDD API, the original core abstraction of Spark. Finally, you’ll discover how to become more proficient using Spark SQL and DataFrames. When you’re finished with this course, you’ll have a foundational knowledge of Apache Spark with Python and Cloudera that will help you as you move forward to develop large-scale data applications that enable you to work with Big Data in an efficient and performant way.

Course Overview

1min

Course Overview 2m

Why Spark with Python and Cloudera?

13mins

Getting an Environment & Data: CDH + StackOverflow

42mins

Getting an Environment and Data: CDH + StackOverflow 2m
Prerequisites and Known Issues 2m
Upgrading Cloudera Manager and CDH 6m
Installing or Upgrading to Java 8 (JDK 1.8) 4m
Getting Spark - There Are Several Options: 1.6 3m
Getting Spark 2 Standalone 3m
Installing Spark 2 on Cloudera 6m
Bonus -> IPython with Anaconda: Supercharge Your PySpark Shell 7m
Data: StackOverflow and StackExchange Dumps + Demo Files 3m
Preparing Your Big Data 4m
Takeaway 2m

Refreshing Your Knowledge: Python Fundamentals for This Course

30mins

Refreshing Your Knowledge: Python Fundamentals for This Course 1m
Python's History, Philosophy, and Paradigm 3m
The Python Shell: REPL 3m
Syntax, Variables, (Dynamic) Types, and Operators 7m
Compound Variables: Lists, Tuples, and Dictionaries 5m
Code Blocks, Functions, Loops, Generators, and Flow Control 5m
Map, Filter, Group, and Reduce 2m
Enter PySpark: Spark in the Shell 2m
Takeaway 2m

Understanding Spark: An Overview

27mins

Understanding Spark: An Overview 3m
Spark, Word Count, Operations, and Transformations 2m
A Few Words on Fine Grained Transformations and Scalability 2m
Word Count in "Not Big Data" 2m
How Word Count Works, Featuring Coarse Grained Transformations 4m
Parallelism by Partitioning Data 3m
Pipelining: One of the Secrets of Spark's Performance 2m
Narrow and Wide Transformations 4m
Lazy Execution, Lineage, Directed Acyclic Graph (DAG), and Fault Tolerance 4m
The Spark Libraries and Spark Packages 2m
Takeaway 1m

Getting Technical with Spark

46mins

Getting Technical: Spark Architecture 3m
Storage in Spark and Supported Data Formats 3m
Let's Talk APIs: Low-level and High-level Spark APIs 4m
Performance Optimizations: Tungsten and Catalyst 3m
SparkContext and SparkSession: Entry Points to Spark Apps 4m
Spark Configuration + Client and Cluster Deployment Modes 6m
Spark on Yarn: The Cluster Manager 3m
Spark with Cloudera Manager and YARN UI 4m
Visualizing Your Spark App: Web UI and History Server 8m
Logging in Spark and with Cloudera 2m
Navigating the Spark and Cloudera Documentation 4m
Takeaway 1m

Learning the Core of Spark: RDDs

42mins

Learning the Core of Spark: RDDs 2m
SparkContext: The Entry Point to a Spark Application 3m
RDD and PairRDD - Resilient Distributed Datasets 4m
Creating RDDs with Parallelize 4m
Returning Data to the Driver, i.e. collect(), take(), first()... 4m
Partitions, Repartition, Coalesce, Saving as Text, and HUE 3m
Creating RDDs from External Datasets 10m
Saving Data as PickleFile, NewAPIHadoopFile, SequenceFile, ... 5m
Creating RDDs with Transformations 3m
A Little Bit More on Lineage and Dependencies 1m
Takeaway 2m

Going Deeper into Spark Core

46mins

Going Deeper into Spark Core 1m
Functional Programming: Anonymous Functions (Lambda) in Spark 1m
A Quick Look at Map, FlatMap, Filter, and Sort 5m
How I Can Tell It Is a Transformation 1m
Why Do We Need Actions? 1m
Partition Operations: MapPartitions and PartitionBy 7m
Sampling Your Data 2m
Set Operations: Join, Union, Full Right, Left Outer, and Cartesian 5m
Combining, Aggregating, Reducing, and Grouping on PairRDDs 8m
ReduceByKey vs. GroupByKey: Which One Is Better? 1m
Grouping Data into Buckets with Histogram 3m
Caching and Data Persistence 2m
Shared Variables: Accumulators and Broadcast Variables 5m
Developing Self-contained PySpark Application, Packages, and Files 1m
Disadvantages of RDDs - So What's Better? 1m
Takeaway 2m

Increasing Proficiency with Spark: DataFrames & Spark SQL

38mins

Increasing Proficiency with Spark: DataFrames & Spark SQL 1m
"Everyone" Uses SQL and How It All Began 3m
Hello DataFrames and Spark SQL 3m
SparkSession: The Entry Point to the Spark SQL and DataFrame API 2m
Creating DataFrames 3m
DataFrames to RDDs and Viceversa 3m
Loading DataFrames: Text and CSV 2m
Schemas: Inferred and Programatically Specified + Option 5m
More Data Loading: Parquet and JSON 4m
Rows, Columns, Expressions, and Operators 2m
Working with Columns 2m
More Columns, Expressions, Cloning, Renaming, Casting, & Dropping 4m
User Defined Functions (UDFs) on Spark SQL 3m
Takeaway 2m

Continuing the Journey on DataFrames and Spark SQL

36mins

Querying, Sorting, and Filtering DataFrames: The DSL 5m
What to Do with Missing or Corrupt Data 4m
Saving DataFrames 6m
Spark SQL: Querying Using Temporary Views 4m
Loading Files and Views into DataFrames Using Spark SQL 2m
Saving to Persistent Tables + Spark 2 Known Issue 2m
Hive Support and External Databases 5m
Aggregating, Grouping, and Joining 5m
The Catalog API 1m
Takeaway 2m

Understanding a Typed API: Datasets Works with Scala, Not Python

5mins

Understanding a Typed API: Datasets; Works with Scala, Not Python 1m
Got Scala? 1m
A Quick Look at Datasets 3m
Takeaway 1m

Final Takeaway and Continuing the Journey with Spark

11mins

Final Takeaway 6m
Continuing the Journey with Spark 5m

About the author

Xavier Morera

Xavier Morera is driven by one passion: taking on the challenge of understanding complex topics and sharing that knowledge with others. He’s currently focused on the transformative fields of AI, machine learning, generative AI, search, and big data. As an entrepreneur, project manager, technical author, and trainer, Xavier brings a diverse set of skills and deep expertise to every project he takes on. He holds multiple certifications with Cloudera, Microsoft, and the Scrum Alliance and has been... more

See more courses by Xavier Morera

Try for free

Get this course plus top-rated picks in tech skills and other popular topics.

$29.00

per month after 10 day trial

Your 10 day Standard free trial includes

Expert-led courses

Keep up with the pace of change with thousands of expert-led, in-depth courses.

For teams

Give up to 50 users access to our full library including this course free for 30 days

Course info

Rating

(67)

Level

Beginner

Updated

Feb 27, 2018

Duration

5h 42m

Ready to upskill? Get started

Contact Sales

Developing Spark Applications with Python & Cloudera

What you'll learn

Table of contents

About the author

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Contact Sales

Developing Spark Applications with Python & Cloudera

What you'll learn

Table of contents

About the author

Get access now

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Ready to skill up
your entire team?

Ready to skill up
your entire team?