• Labs icon Lab
  • Data
Labs

Transformations and Actions with RDDs in Apache Spark

In this hands-on lab, Transformations and Actions with RDDs in Apache Spark, you'll learn how to apply transformations and actions to efficiently process large-scale data. You'll work with map, filter, and reduceByKey transformations and execute actions like count and collect to retrieve results. Using a New York Stock Exchange dataset, youโ€™ll create an RDD, apply transformations to clean and structure the data, and execute actions to analyze stock trading information. By the end of this lab, youโ€™ll have the skills to use higher-order functions in Spark to process and analyze large datasets efficiently, equipping you with the confidence to tackle big data projects in real-world applications.

Labs

Path Info

Level
Clock icon Intermediate
Duration
Clock icon 38m
Published
Clock icon Mar 04, 2025

Contact sales

By filling out this form and clicking submit, you acknowledge ourย privacy policy.

Table of Contents

  1. Challenge

    Introduction to Spark RDDs

    Step 1 : Introduction to Spark RDDs

    Before diving into Transformations and Actions with RDDs in Apache Spark, let's first understand what RDDs are.

    ๐Ÿ“Œ RDDS

    RDDs are the core abstraction in Spark, offering fault tolerance and distributed processing capabilities. Understanding RDDs and transformations/actions is essential for working with big data in Spark.

    Why It Matters

    Working with RDDs is fundamental for data engineers and analysts dealing with large-scale data. By mastering RDD operations, youโ€™ll be able to: Create RDDs from diverse data sources and perform data transformations. Utilize high-performance operations like map, filter, and reduceByKey to manipulate and aggregate data. Handle large datasets with distributed processing, ensuring scalability and fault tolerance.

    Key Concepts

    ๐Ÿ“Œ RDDs (Resilient Distributed Datasets): The primary data structure in Spark, representing a distributed collection of objects that can be processed in parallel across a cluster.

    ๐Ÿ“Œ Transformations and Actions:

    Transformations like map, filter, and flatMap allow you to modify datasets. Actions like count, collect, and reduce trigger execution and return results.

    ๐Ÿ“Œ Pair RDDs:

    Special RDDs where each element is a key-value pair. They enable operations like reduceByKey to aggregate data based on keys.

    ๐Ÿ“Œ Higher-Order Functions:
    Functions like reduceByKey let you perform complex data aggregations efficiently using Spark's distributed computing power.

    โœ… Important: Mastering these core concepts of RDDs and transformations/actions will equip you to work with big data in Spark, making you capable of building scalable data pipelines and performing large-scale data analysis. ๐Ÿ“Œ What Youโ€™ll Do in This Lab

    In this lab, you'll work with a New York Stock Exchange dataset to create an RDD, apply transformations to clean and structure the data, and execute actions to analyze stock trading information.

    By the end of the lab, you'll confidently use higher-order functions in Spark to efficiently process and analyze large datasets, equipping yourself with essential skills for tackling big data projects.

    ๐Ÿš€ Let's get started!

  2. Challenge

    Create an RDD from a nyse_data csv File

    Step 2: Load a CSV File into a Spark RDD.

    For this lab we will use the file located at src/main/scala/StockAnalysis.scala

    To start, first we need to launch spark. Launching Spark means getting everything ready so you can start processing data. It provides one central place where you can access and control all of Sparkโ€™s features. The complete syntax looks as follows:

    val spark: SparkSession = SparkSession.builder()
      .master("local[2]")
      .appName("SparkScalaRDDLab")
      .getOrCreate()
    

    Let's break this code down into smaller pieces and see what each part is doing: SparkSession.builder(): This is the entry point to create a Spark application in Spark. .appName("SparkScalaRDDLab"): This sets the name of the Spark application, which is useful for tracking in logs or the Spark UI. .getOrCreate(): This either creates a new SparkSession if none exists or retrieves an existing one. Now let's get the Spark context: val sc = spark.sparkContext

    ๐Ÿ“Œ This code does the following:

    The spark object is your SparkSession, which is a higher-level entry point to Spark. By writing spark.sparkContext, you're reaching into this session to get the underlying SparkContext.

    ๐Ÿ“Œ What is SparkContext?

    The SparkContext is like the engine of Spark. It manages the connection to the cluster (or local machine), handles resource allocation and distributes tasks to the workers (executors).

    ๐Ÿ“Œ Why Assign It to sc?

    By assigning it to sc, you create an easy reference to use if you need to perform lower-level operations or interact directly with the clusterโ€™s resources. Essentially, sc becomes your shortcut to the Spark engine that does all the heavy lifting.

    โœ… Important: When you run your program, you can ignore this message as it does not affect the execution of the program. It is related to internal Java reflection mechanisms and does not impact the correctness of your output

    [error] WARNING: An illegal reflective access operation has occurred

  3. Challenge

    Perform Spark Operations on RDD

    Step 3: Perform Spark Operations on RDD

    Goal of Step: The goal of this task is to load market data into an RDD, perform data transformations using map and filter operations, and remove the header row to process the data. Specifically, we will filter the dataset to retain only AAPL stock data and count the number of records for AAPL.

    Tasks:

    • Perform transformations : Map : Split each line by comma to create structured data.
    • Perform transformations : Filter: Keep only AAPL stock data
    • Perform an Action: Count the number of AAPL records

    What Does map Do?

    • The map transformation in Spark applies a given function to each element of an RDD, producing a new RDD with transformed values.
    • It is a one-to-one transformation, meaning each input element maps to exactly one output element.

    Code Breakdown:

    • rdd.map(...): Applies a transformation to each element (line) in the original RDD.
    • line.split(","): Splits each line (a string) into an array of values using the comma (,) as the delimiter.
    • dataRdd: The resulting RDD contains arrays of structured data instead of raw text lines.

    Example Input and Output:

    Raw RDD (rdd) Example:

    "NYSE,2023-01-01,AAPL,150.0,20000"
    "NYSE,2023-01-01,GOOG,2800.5,15000"
    

    Each line is an unstructured string representing stock trading data.

    After Applying map:

    dataRdd.collect()
    

    Would return:

    Array(
      Array("NYSE", "2023-01-01", "AAPL", "150.0", "20000"),
      Array("NYSE", "2023-01-01", "GOOG", "2800.5", "15000")
    )
    

    Now, each row is structured as an array, making it easier to process in later steps.

    Why Use map Here?

    • It converts unstructured text data into a structured format.
    • This makes it easier to perform further transformations, such as filtering, aggregations, or converting data types.
    • It prepares the data for meaningful analysis and insights.

    This is the first step in processing and analyzing stock trading data in Apache Spark! ๐Ÿš€

    Code Breakdown:

    val aaplRdd = dataRddNoHeader.filter(row => row(0) == "AAPL")
    
    • .filter(...): Keeps only the elements that satisfy the given condition.
    • row(0) == "AAPL": Checks if the first column (stock symbol) is "AAPL", keeping only those rows.

    Example Input (dataRddNoHeader):

    Array(
      Array("AAPL", "2023-01-01", "150.0", "20000"),
      Array("GOOG", "2023-01-01", "2800.5", "15000"),
      Array("AAPL", "2023-01-02", "152.3", "22000")
    )
    

    After Filtering (aaplRdd.collect()):

    Array(
      Array("AAPL", "2023-01-01", "150.0", "20000"),
      Array("AAPL", "2023-01-02", "152.3", "22000")
    )
    

    Only AAPL stock data remains.

    Why Use filter?

    • Helps focus on specific stock data for targeted analysis.
    • Reduces the dataset size, making processing more efficient.

    This step refines the data before further analysis. ๐Ÿš€

  4. Challenge

    Use Pair RDDs for Key-value Transformation

    Step 4: Use Pair RDDs for Key-value Transformations

    Goal of Step:

    The goal of this step is to utilize Pair RDDs to perform key-value transformations, specifically calculating the total closing price per stock symbol.

    Tasks:

    • Create key-value pairs
    • Reduce by key: Calculate the total closing price per stock symbol
    • Collect and print the results

    Code Explanation:

    val pairRdd = dataRddNoHeader.map(row => (row(0), row(5).toFloat))
    
    • map(row => (row(0), row(5).toFloat))
      • row(0): Extracts the stock symbol (e.g., "AAPL").
      • row(5).toFloat: Extracts the closing price and converts it to a float.
    • Creates a Pair RDD where:
      • Key = Stock symbol (e.g., "AAPL", "GOOG").
      • Value = Closing price (e.g., 155.0, 2820.0).

    Example Input (dataRddNoHeader)

    Array(
      Array("AAPL", "2023-01-01", "150.0", "20000", "160.0", "155.0"),
      Array("GOOG", "2023-01-01", "2800.5", "15000", "2850.0", "2820.0"),
      Array("AAPL", "2023-01-02", "152.3", "22000", "158.0", "154.0")
    )
    

    After Applying map

    pairRdd.collect()
    

    Returns:

    Array(
      ("AAPL", 155.0),
      ("GOOG", 2820.0),
      ("AAPL", 154.0)
    )
    

    Why Use a Pair RDD?

    • Enables key-based operations like grouping, aggregating, and sorting.
    • Essential for transformations like reduceByKey and groupByKey.

    This step structures the data for further stock price analysis! ๐Ÿš€

    Code Explanation:

    println("\nTotal closing price per stock symbol:")
    symbolCloseSum.collect().foreach {
      case (symbol, totalClose) =>
        println(s"$symbol: $totalClose")
    }
    
    • symbolCloseSum.collect():
      • Gathers the results from distributed computation into the driver.
    • .foreach { case (symbol, totalClose) => println(...) }:
      • Iterates through each (symbol, total closing price) pair.
      • Prints in the format: "AAPL: 309.0".

    Example Output:

    Total closing price per stock symbol:
    AAPL: 309.0
    GOOG: 2820.0
    

    Why Use collect?

    • Brings results to the driver for display.
    • Ideal for small final outputs (avoid on large datasets).

    This step lets you view the computed totals! ๐Ÿš€ Congratulations on Completing the Lab! ๐ŸŽ‰

    You have successfully completed the lab on Transformations and Actions with RDDs in Apache Spark. In this module, you learned:

    • Creating an RDD from a sample text file.
    • Performing map, filter, and reduce actions.
    • Using Pair RDDs for key-value transformations.

    Key Takeaways

    • RDD Creation: Learn how to create and manipulate RDDs for efficient data processing.
    • Transformation & Actions: Apply map, filter, and reduce actions to process large datasets.
    • Pair RDDs: Utilize key-value transformations for advanced data aggregation and grouping.

    Thank you for completing the lab! ๐Ÿš€

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the authorโ€™s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.