- Lab
- Data

Transformations and Actions with RDDs in Apache Spark
In this hands-on lab, Transformations and Actions with RDDs in Apache Spark, you'll learn how to apply transformations and actions to efficiently process large-scale data. You'll work with map, filter, and reduceByKey transformations and execute actions like count and collect to retrieve results. Using a New York Stock Exchange dataset, youโll create an RDD, apply transformations to clean and structure the data, and execute actions to analyze stock trading information. By the end of this lab, youโll have the skills to use higher-order functions in Spark to process and analyze large datasets efficiently, equipping you with the confidence to tackle big data projects in real-world applications.

Path Info
Table of Contents
-
Challenge
Introduction to Spark RDDs
Step 1 : Introduction to Spark RDDs
Before diving into Transformations and Actions with RDDs in Apache Spark, let's first understand what RDDs are.
๐ RDDS
RDDs are the core abstraction in Spark, offering fault tolerance and distributed processing capabilities. Understanding RDDs and transformations/actions is essential for working with big data in Spark.
Why It Matters
Working with RDDs is fundamental for data engineers and analysts dealing with large-scale data. By mastering RDD operations, youโll be able to: Create RDDs from diverse data sources and perform data transformations. Utilize high-performance operations like map, filter, and reduceByKey to manipulate and aggregate data. Handle large datasets with distributed processing, ensuring scalability and fault tolerance.
Key Concepts
๐ RDDs (Resilient Distributed Datasets): The primary data structure in Spark, representing a distributed collection of objects that can be processed in parallel across a cluster.
๐ Transformations and Actions:
Transformations like map, filter, and flatMap allow you to modify datasets. Actions like count, collect, and reduce trigger execution and return results.
๐ Pair RDDs:
Special RDDs where each element is a key-value pair. They enable operations like reduceByKey to aggregate data based on keys.
๐ Higher-Order Functions:
Functions like reduceByKey let you perform complex data aggregations efficiently using Spark's distributed computing power.
โ Important: Mastering these core concepts of RDDs and transformations/actions will equip you to work with big data in Spark, making you capable of building scalable data pipelines and performing large-scale data analysis. ๐ What Youโll Do in This Lab
In this lab, you'll work with a New York Stock Exchange dataset to create an RDD, apply transformations to clean and structure the data, and execute actions to analyze stock trading information.
By the end of the lab, you'll confidently use higher-order functions in Spark to efficiently process and analyze large datasets, equipping yourself with essential skills for tackling big data projects.
๐ Let's get started!
-
Challenge
Create an RDD from a nyse_data csv File
Step 2: Load a CSV File into a Spark RDD.
For this lab we will use the file located at
src/main/scala/StockAnalysis.scala
To start, first we need to launch spark. Launching Spark means getting everything ready so you can start processing data. It provides one central place where you can access and control all of Sparkโs features. The complete syntax looks as follows:
val spark: SparkSession = SparkSession.builder() .master("local[2]") .appName("SparkScalaRDDLab") .getOrCreate()
Let's break this code down into smaller pieces and see what each part is doing: SparkSession.builder(): This is the entry point to create a Spark application in Spark. .appName("SparkScalaRDDLab"): This sets the name of the Spark application, which is useful for tracking in logs or the Spark UI. .getOrCreate(): This either creates a new
SparkSession
if none exists or retrieves an existing one. Now let's get the Spark context:val sc = spark.sparkContext
๐ This code does the following:
The spark object is your
SparkSession
, which is a higher-level entry point to Spark. By writing spark.sparkContext, you're reaching into this session to get the underlying SparkContext.๐ What is SparkContext?
The SparkContext is like the engine of Spark. It manages the connection to the cluster (or local machine), handles resource allocation and distributes tasks to the workers (executors).
๐ Why Assign It to sc?
By assigning it to sc, you create an easy reference to use if you need to perform lower-level operations or interact directly with the clusterโs resources. Essentially, sc becomes your shortcut to the Spark engine that does all the heavy lifting.
โ Important: When you run your program, you can ignore this message as it does not affect the execution of the program. It is related to internal Java reflection mechanisms and does not impact the correctness of your output
[error] WARNING: An illegal reflective access operation has occurred
-
Challenge
Perform Spark Operations on RDD
Step 3: Perform Spark Operations on RDD
Goal of Step: The goal of this task is to load market data into an RDD, perform data transformations using map and filter operations, and remove the header row to process the data. Specifically, we will filter the dataset to retain only AAPL stock data and count the number of records for AAPL.
Tasks:
- Perform transformations : Map : Split each line by comma to create structured data.
- Perform transformations : Filter: Keep only AAPL stock data
- Perform an Action: Count the number of AAPL records
What Does
map
Do?- The
map
transformation in Spark applies a given function to each element of an RDD, producing a new RDD with transformed values. - It is a one-to-one transformation, meaning each input element maps to exactly one output element.
Code Breakdown:
rdd.map(...)
: Applies a transformation to each element (line) in the original RDD.line.split(",")
: Splits each line (a string) into an array of values using the comma (,
) as the delimiter.dataRdd
: The resulting RDD contains arrays of structured data instead of raw text lines.
Example Input and Output:
Raw RDD (
rdd
) Example:"NYSE,2023-01-01,AAPL,150.0,20000" "NYSE,2023-01-01,GOOG,2800.5,15000"
Each line is an unstructured string representing stock trading data.
After Applying
map
:dataRdd.collect()
Would return:
Array( Array("NYSE", "2023-01-01", "AAPL", "150.0", "20000"), Array("NYSE", "2023-01-01", "GOOG", "2800.5", "15000") )
Now, each row is structured as an array, making it easier to process in later steps.
Why Use
map
Here?- It converts unstructured text data into a structured format.
- This makes it easier to perform further transformations, such as filtering, aggregations, or converting data types.
- It prepares the data for meaningful analysis and insights.
This is the first step in processing and analyzing stock trading data in Apache Spark! ๐
Code Breakdown:
val aaplRdd = dataRddNoHeader.filter(row => row(0) == "AAPL")
.filter(...)
: Keeps only the elements that satisfy the given condition.row(0) == "AAPL"
: Checks if the first column (stock symbol) is"AAPL"
, keeping only those rows.
Example Input (
dataRddNoHeader
):Array( Array("AAPL", "2023-01-01", "150.0", "20000"), Array("GOOG", "2023-01-01", "2800.5", "15000"), Array("AAPL", "2023-01-02", "152.3", "22000") )
After Filtering (
aaplRdd.collect()
):Array( Array("AAPL", "2023-01-01", "150.0", "20000"), Array("AAPL", "2023-01-02", "152.3", "22000") )
Only AAPL stock data remains.
Why Use
filter
?- Helps focus on specific stock data for targeted analysis.
- Reduces the dataset size, making processing more efficient.
This step refines the data before further analysis. ๐
-
Challenge
Use Pair RDDs for Key-value Transformation
Step 4: Use Pair RDDs for Key-value Transformations
Goal of Step:
The goal of this step is to utilize Pair RDDs to perform key-value transformations, specifically calculating the total closing price per stock symbol.
Tasks:
- Create key-value pairs
- Reduce by key: Calculate the total closing price per stock symbol
- Collect and print the results
Code Explanation:
val pairRdd = dataRddNoHeader.map(row => (row(0), row(5).toFloat))
map(row => (row(0), row(5).toFloat))
row(0)
: Extracts the stock symbol (e.g.,"AAPL"
).row(5).toFloat
: Extracts the closing price and converts it to a float.
- Creates a Pair RDD where:
- Key = Stock symbol (e.g.,
"AAPL"
,"GOOG"
). - Value = Closing price (e.g.,
155.0
,2820.0
).
- Key = Stock symbol (e.g.,
Example Input (
dataRddNoHeader
)Array( Array("AAPL", "2023-01-01", "150.0", "20000", "160.0", "155.0"), Array("GOOG", "2023-01-01", "2800.5", "15000", "2850.0", "2820.0"), Array("AAPL", "2023-01-02", "152.3", "22000", "158.0", "154.0") )
After Applying
map
pairRdd.collect()
Returns:
Array( ("AAPL", 155.0), ("GOOG", 2820.0), ("AAPL", 154.0) )
Why Use a Pair RDD?
- Enables key-based operations like grouping, aggregating, and sorting.
- Essential for transformations like
reduceByKey
andgroupByKey
.
This step structures the data for further stock price analysis! ๐
Code Explanation:
println("\nTotal closing price per stock symbol:") symbolCloseSum.collect().foreach { case (symbol, totalClose) => println(s"$symbol: $totalClose") }
symbolCloseSum.collect()
:- Gathers the results from distributed computation into the driver.
.foreach { case (symbol, totalClose) => println(...) }
:- Iterates through each (symbol, total closing price) pair.
- Prints in the format:
"AAPL: 309.0"
.
Example Output:
Total closing price per stock symbol: AAPL: 309.0 GOOG: 2820.0
Why Use
collect
?- Brings results to the driver for display.
- Ideal for small final outputs (avoid on large datasets).
This step lets you view the computed totals! ๐ Congratulations on Completing the Lab! ๐
You have successfully completed the lab on Transformations and Actions with RDDs in Apache Spark. In this module, you learned:
- Creating an RDD from a sample text file.
- Performing map, filter, and reduce actions.
- Using Pair RDDs for key-value transformations.
Key Takeaways
- RDD Creation: Learn how to create and manipulate RDDs for efficient data processing.
- Transformation & Actions: Apply map, filter, and reduce actions to process large datasets.
- Pair RDDs: Utilize key-value transformations for advanced data aggregation and grouping.
Thank you for completing the lab! ๐
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorโs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.