Transformations and Actions with RDDs in Apache Spark

Libraries: If you want this lab, consider one of these libraries.
Data

Transformations and Actions with RDDs in Apache Spark

In this hands-on lab, Transformations and Actions with RDDs in Apache Spark, you'll learn how to apply transformations and actions to efficiently process large-scale data. You'll work with map, filter, and reduceByKey transformations and execute actions like count and collect to retrieve results. Using a New York Stock Exchange dataset, you’ll create an RDD, apply transformations to clean and structure the data, and execute actions to analyze stock trading information. By the end of this lab, you’ll have the skills to use higher-order functions in Spark to process and analyze large datasets efficiently, equipping you with the confidence to tackle big data projects in real-world applications.

Get started Contact sales

Lab Info

Level

Intermediate

Last updated

Sep 02, 2025

Duration

37m

Challenge

Introduction to Spark RDDs

Step 1 : Introduction to Spark RDDs

Before diving into Transformations and Actions with RDDs in Apache Spark, let's first understand what RDDs are.

📌 RDDS

RDDs are the core abstraction in Spark, offering fault tolerance and distributed processing capabilities. Understanding RDDs and transformations/actions is essential for working with big data in Spark.

Why It Matters

Working with RDDs is fundamental for data engineers and analysts dealing with large-scale data. By mastering RDD operations, you’ll be able to: Create RDDs from diverse data sources and perform data transformations. Utilize high-performance operations like map, filter, and reduceByKey to manipulate and aggregate data. Handle large datasets with distributed processing, ensuring scalability and fault tolerance.

Key Concepts

📌 RDDs (Resilient Distributed Datasets): The primary data structure in Spark, representing a distributed collection of objects that can be processed in parallel across a cluster.

📌 Transformations and Actions:

Transformations like map, filter, and flatMap allow you to modify datasets. Actions like count, collect, and reduce trigger execution and return results.

📌 Pair RDDs:

Special RDDs where each element is a key-value pair. They enable operations like reduceByKey to aggregate data based on keys.

📌 Higher-Order Functions:

Functions like reduceByKey let you perform complex data aggregations efficiently using Spark's distributed computing power.

✅ Important: Mastering these core concepts of RDDs and transformations/actions will equip you to work with big data in Spark, making you capable of building scalable data pipelines and performing large-scale data analysis. 📌 What You’ll Do in This Lab

In this lab, you'll work with a New York Stock Exchange dataset to create an RDD, apply transformations to clean and structure the data, and execute actions to analyze stock trading information.

By the end of the lab, you'll confidently use higher-order functions in Spark to efficiently process and analyze large datasets, equipping yourself with essential skills for tackling big data projects.

🚀 Let's get started!
Challenge

Create an RDD from a nyse_data csv File
Step 2: Load a CSV File into a Spark RDD.

For this lab we will use the file located at src/main/scala/StockAnalysis.scala

To start, first we need to launch spark. Launching Spark means getting everything ready so you can start processing data. It provides one central place where you can access and control all of Spark’s features. The complete syntax looks as follows:
```
val spark: SparkSession = SparkSession.builder()
  .master("local[2]")
  .appName("SparkScalaRDDLab")
  .getOrCreate()
```
Let's break this code down into smaller pieces and see what each part is doing: SparkSession.builder(): This is the entry point to create a Spark application in Spark. .appName("SparkScalaRDDLab"): This sets the name of the Spark application, which is useful for tracking in logs or the Spark UI. .getOrCreate(): This either creates a new SparkSession if none exists or retrieves an existing one. Now let's get the Spark context: val sc = spark.sparkContext

📌 This code does the following:

The spark object is your SparkSession, which is a higher-level entry point to Spark. By writing spark.sparkContext, you're reaching into this session to get the underlying SparkContext.

📌 What is SparkContext?

The SparkContext is like the engine of Spark. It manages the connection to the cluster (or local machine), handles resource allocation and distributes tasks to the workers (executors).

📌 Why Assign It to sc?

By assigning it to sc, you create an easy reference to use if you need to perform lower-level operations or interact directly with the cluster’s resources. Essentially, sc becomes your shortcut to the Spark engine that does all the heavy lifting.

✅ Important: When you run your program, you can ignore this message as it does not affect the execution of the program. It is related to internal Java reflection mechanisms and does not impact the correctness of your output

[error] WARNING: An illegal reflective access operation has occurred
Challenge

Perform Spark Operations on RDD
Step 3: Perform Spark Operations on RDD

Goal of Step: The goal of this task is to load market data into an RDD, perform data transformations using map and filter operations, and remove the header row to process the data. Specifically, we will filter the dataset to retain only AAPL stock data and count the number of records for AAPL.

Tasks:
- Perform transformations : Map : Split each line by comma to create structured data.
- Perform transformations : Filter: Keep only AAPL stock data
- Perform an Action: Count the number of AAPL records
What Does map Do?
- The map transformation in Spark applies a given function to each element of an RDD, producing a new RDD with transformed values.
- It is a one-to-one transformation, meaning each input element maps to exactly one output element.
Code Breakdown:
- rdd.map(...): Applies a transformation to each element (line) in the original RDD.
- line.split(","): Splits each line (a string) into an array of values using the comma (,) as the delimiter.
- dataRdd: The resulting RDD contains arrays of structured data instead of raw text lines.
Example Input and Output:

Raw RDD (rdd) Example:
```
"NYSE,2023-01-01,AAPL,150.0,20000"
"NYSE,2023-01-01,GOOG,2800.5,15000"
```
Each line is an unstructured string representing stock trading data.

After Applying map:
```
dataRdd.collect()
```
Would return:
```
Array(
  Array("NYSE", "2023-01-01", "AAPL", "150.0", "20000"),
  Array("NYSE", "2023-01-01", "GOOG", "2800.5", "15000")
)
```
Now, each row is structured as an array, making it easier to process in later steps.

Why Use map Here?
- It converts unstructured text data into a structured format.
- This makes it easier to perform further transformations, such as filtering, aggregations, or converting data types.
- It prepares the data for meaningful analysis and insights.
This is the first step in processing and analyzing stock trading data in Apache Spark! 🚀

Code Breakdown:
```
val aaplRdd = dataRddNoHeader.filter(row => row(0) == "AAPL")
```
- .filter(...): Keeps only the elements that satisfy the given condition.
- row(0) == "AAPL": Checks if the first column (stock symbol) is "AAPL", keeping only those rows.
Example Input (dataRddNoHeader):
```
Array(
  Array("AAPL", "2023-01-01", "150.0", "20000"),
  Array("GOOG", "2023-01-01", "2800.5", "15000"),
  Array("AAPL", "2023-01-02", "152.3", "22000")
)
```
After Filtering (aaplRdd.collect()):
```
Array(
  Array("AAPL", "2023-01-01", "150.0", "20000"),
  Array("AAPL", "2023-01-02", "152.3", "22000")
)
```
Only AAPL stock data remains.

Why Use filter?
- Helps focus on specific stock data for targeted analysis.
- Reduces the dataset size, making processing more efficient.
This step refines the data before further analysis. 🚀
Challenge

Use Pair RDDs for Key-value Transformation
Step 4: Use Pair RDDs for Key-value Transformations

Goal of Step:

The goal of this step is to utilize Pair RDDs to perform key-value transformations, specifically calculating the total closing price per stock symbol.

Tasks:
- Create key-value pairs
- Reduce by key: Calculate the total closing price per stock symbol
- Collect and print the results
Code Explanation:
```
val pairRdd = dataRddNoHeader.map(row => (row(0), row(5).toFloat))
```
- map(row => (row(0), row(5).toFloat))
  
  row(0): Extracts the stock symbol (e.g., "AAPL").
  
  row(5).toFloat: Extracts the closing price and converts it to a float.
- Creates a Pair RDD where:
  
  Key = Stock symbol (e.g., "AAPL", "GOOG").
  
  Value = Closing price (e.g., 155.0, 2820.0).
Example Input (dataRddNoHeader)
```
Array(
  Array("AAPL", "2023-01-01", "150.0", "20000", "160.0", "155.0"),
  Array("GOOG", "2023-01-01", "2800.5", "15000", "2850.0", "2820.0"),
  Array("AAPL", "2023-01-02", "152.3", "22000", "158.0", "154.0")
)
```
After Applying map
```
pairRdd.collect()
```
Returns:
```
Array(
  ("AAPL", 155.0),
  ("GOOG", 2820.0),
  ("AAPL", 154.0)
)
```
Why Use a Pair RDD?
- Enables key-based operations like grouping, aggregating, and sorting.
- Essential for transformations like reduceByKey and groupByKey.
This step structures the data for further stock price analysis! 🚀

Code Explanation:
```
println("\nTotal closing price per stock symbol:")
symbolCloseSum.collect().foreach {
  case (symbol, totalClose) =>
    println(s"$symbol: $totalClose")
}
```
- symbolCloseSum.collect():
  
  Gathers the results from distributed computation into the driver.
- .foreach { case (symbol, totalClose) => println(...) }:
  
  Iterates through each (symbol, total closing price) pair.
  
  Prints in the format: "AAPL: 309.0".
Example Output:
```
Total closing price per stock symbol:
AAPL: 309.0
GOOG: 2820.0
```
Why Use collect?
- Brings results to the driver for display.
- Ideal for small final outputs (avoid on large datasets).
This step lets you view the computed totals! 🚀 Congratulations on Completing the Lab! 🎉

You have successfully completed the lab on Transformations and Actions with RDDs in Apache Spark. In this module, you learned:
- Creating an RDD from a sample text file.
- Performing map, filter, and reduce actions.
- Using Pair RDDs for key-value transformations.
Key Takeaways
- RDD Creation: Learn how to create and manipulate RDDs for efficient data processing.
- Transformation & Actions: Apply map, filter, and reduce actions to process large datasets.
- Pair RDDs: Utilize key-value transformations for advanced data aggregation and grouping.
Thank you for completing the lab! 🚀

About the author

Maryam Zakeryfar

Staff Software Engineer with a PhD in Computer Science and 15+ years of experience designing and developing complex software systems. Passionate about staying ahead of technology trends to deliver high-quality solutions.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.