Course

Skills

Developing Spark Applications Using Scala & Cloudera

by Xavier Morera

Apache Spark is one of the fastest and most efficient general engines for large-scale data processing. In this course, you'll learn how to develop Spark applications for your Big Data using Scala and a stable Hadoop distribution, Cloudera CDH.

Preview this course

Try for free

Get this course plus top-rated picks in tech skills and other popular topics.

$29.00

per month after 10 day trial

Your 10 day Standard free trial includes

Expert-led courses

Keep up with the pace of change with thousands of expert-led, in-depth courses.

For teams

Give up to 50 users access to our full library including this course free for 30 days

Course info

Rating

(48)

Level

Beginner

Updated

May 3, 2018

Duration

5h 43m

What you'll learn

At the core of working with large-scale datasets is a thorough knowledge of Big Data platforms like Apache Spark and Hadoop. In this course, Developing Spark Applications Using Scala & Cloudera, you’ll learn how to process data at scales you previously thought were out of your reach. First, you’ll learn all the technical details of how Spark works. Next, you’ll explore the RDD API, the original core abstraction of Spark. Then, you’ll discover how to become more proficient using Spark SQL and DataFrames. Finally, you'll learn to work with Spark's typed API: Datasets. When you’re finished with this course, you’ll have a foundational knowledge of Apache Spark with Scala and Cloudera that will help you as you move forward to develop large-scale data applications that enable you to work with Big Data in an efficient and performant way.

Course Overview

2mins

Course Overview 2m

Why Spark with Scala and Cloudera?

13mins

Getting an Environment and Data: CDH + StackOverflow

34mins

Getting an Environment & Data: CDH + StackOverflow 2m
Prerequisites & Known Issues 2m
Upgrading Cloudera Manager and CDH 6m
Installing or Upgrading to Java 8 (JDK 1.8) 4m
Getting Spark - There Are Several Options: 1.6 3m
Getting Spark 2 Standalone 3m
Installing Spark 2 on Cloudera 6m
Data: StackOverflow & StackExchange Dumps + Demo Files 3m
Preparing Your Big Data 4m
Takeaway 2m

Refreshing Your Knowledge: Scala Fundamentals for This Course

24mins

Refreshing Your Knowledge: Scala Fundamentals for This Course 1m
Scala's History and Overview 2m
Building and Running Scala Applications 1m
Creating Self-contained Applications, Including scalac & sbt 5m
The Scala Shell: REPL (Read Evaluate Print Loop) 1m
Scala, the Language 4m
More on Types, Functions, and Operations 2m
Expressions, Functions, and Methods 1m
Classes, Case Classes, and Traits 1m
Flow Control 1m
Functional Programming 1m
Enter spark2-shell: Spark in the Scala Shell 1m
Takeaway 2m

Understanding Spark: An Overview

27mins

Understanding Spark: An Overview 3m
Spark, Word Count, Operations, and Transformations 2m
A Few Words on Fine Grained Transformations and Scalability 2m
Word Count in "Not Big Data" 2m
How Word Count Works, Featuring Coarse Grained Transformations 4m
Parallelism by Partitioning Data 3m
Pipelining: One of the Secrets of Spark's Performance 2m
Narrow and Wide Transformations 4m
Lazy Execution, Lineage, Directed Acyclic Graph (DAG), and Fault Tolerance 4m
Time for the Big Picture: Spark Libraries 2m
Takeaway 1m

Getting Technical with Spark

45mins

Getting Technical: Spark Architecture 3m
Storage in Spark and Supported Data Formats 3m
Let's Talk APIs: Low Level and High Level Spark APIs 5m
Performance Optimizations: Tungsten and Catalyst 3m
SparkContext and SparkSession: Entry Points to Spark Apps 4m
Spark Configuration + Client and Cluster Deployment Modes 6m
Spark on Yarn: The Cluster Manager 3m
Spark with Cloudera Manager and YARN UI 4m
Visualizing Your Spark App: Web UI and History Server 8m
Logging in with Spark and Cloudera 2m
Navigating the Spark and Cloudera Documentation 4m
Takeaway 1m

Learning the Core of Spark: RDDs

42mins

Learning the Core of Spark: RDDs 2m
SparkContext: The Entry Point to a Spark Application 4m
RDD and PairRDD - Resilient Distributed Datasets 4m
Creating RDDs with Parallelize 4m
Returning Data to the Driver, i.e. collect(), take(), first()... 4m
Partitions, Repartition, Coalesce, Saving as Text, and HUE 3m
Creating RDDs from External Datasets 10m
Saving Data as ObjectFile, NewAPIHadoopFile, SequenceFile, ... 6m
Creating RDDs with Transformations 3m
A Little Bit More on Lineage and Dependencies 1m
Takeaway 2m

Going Deeper into Spark Core

47mins

Going Deeper into Spark Core 1m
Functional Programming: Anonymous Functions (Lambda) in Spark 2m
A Quick Look at Map, FlatMap, Filter, and Sort 5m
How Can I Tell It Is a Transformation 1m
Why Do We Need Actions? 1m
Partition Operations: MapPartitions and PartitionBy 6m
Sampling Your Data 2m
Set Operations: Join, Union, Full Right, Left Outer, and Cartesian 5m
Combining, Aggregating, Reducing, and Grouping on PairRDDs 9m
ReduceByKey vs. GroupByKey: Which One Is Better? 1m
Grouping Data into Buckets with Histogram 3m
Caching and Data Persistence 2m
Shared Variables: Accumulators and Broadcast 5m
What's Needed for Developing Self-contained Spark Applications 2m
Disadvantages of RDDs - So What's Better? 1m
Takeaway 2m

Increasing Proficiency with Spark: DataFrames and Spark SQL

37mins

Increasing Proficiency with Spark: DataFrames & Spark SQL 1m
"Everyone" Uses SQL and How It All Began 3m
Hello DataFrames and Spark SQL 3m
SparkSession: The Entry Point to the Spark SQL / DataFrame API 2m
Creating DataFrames 2m
DataFrames to RDDs and Vice Versa 3m
Loading DataFrames: Text and CSV 2m
Schemas: Inferred and Programatically Specified + Option 5m
More Data Loading: Parquet and JSON 4m
Rows, Columns, Expressions, and Operators 2m
Working with Columns 2m
More Columns, Expressions, Cloning, Renaming, Casting, & Dropping 4m
User Defined Functions (UDFs) on Spark SQL 3m
Takeaway 2m

Continuing the Journey on DataFrames and Spark SQL

35mins

Querying, Sorting, and Filtering DataFrames: The DSL 5m
What to Do with Missing or Corrupt Data 4m
Saving DataFrames 5m
Spark SQL: Querying Using Temporary Views 4m
Loading Files and Views into DataFrames Using Spark SQL 2m
Saving to Persistent Tables + Spark 2 Known Issue 2m
Hive Support and External Databases 5m
Aggregating, Grouping, and Joining 5m
The Catalog API 1m
Takeaway 2m

Working with a Typed API: Datasets

19mins

Understanding a Typed API: Datasets 1m
The Motivation Behind Datasets 5m
What's a Dataset? 3m
What Do You Need for Datasets? 1m
Creating Datasets 3m
Dataset Operations 3m
RDDs vs. DataFrames vs. Datasets: A Few Final Thoughts 1m
Takeaway 2m

Final Takeaway and Continuing the Journey with Spark

11mins

Final Takeaway 6m
Continuing the Journey with Spark, Scala, and Cloudera 5m

About the author

Xavier Morera

Xavier Morera is driven by one passion: taking on the challenge of understanding complex topics and sharing that knowledge with others. He’s currently focused on the transformative fields of AI, machine learning, generative AI, search, and big data. As an entrepreneur, project manager, technical author, and trainer, Xavier brings a diverse set of skills and deep expertise to every project he takes on. He holds multiple certifications with Cloudera, Microsoft, and the Scrum Alliance and has been... more

See more courses by Xavier Morera

Try for free

Get this course plus top-rated picks in tech skills and other popular topics.

$29.00

per month after 10 day trial

Your 10 day Standard free trial includes

Expert-led courses

Keep up with the pace of change with thousands of expert-led, in-depth courses.

For teams

Give up to 50 users access to our full library including this course free for 30 days

Course info

Rating

(48)

Level

Beginner

Updated

May 3, 2018

Duration

5h 43m

Ready to upskill? Get started

Contact Sales

Developing Spark Applications Using Scala & Cloudera

What you'll learn

Table of contents

About the author

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill up
your entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Contact Sales

Developing Spark Applications Using Scala & Cloudera

What you'll learn

Table of contents

About the author

Get access now

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Ready to skill upyour entire team?

With your Pluralsight plan, you can:

With your 30-day pilot, you can:

Support

Community

Company

Industries

Newsletter

Ready to skill up
your entire team?

Ready to skill up
your entire team?