Featured resource
2025 Tech Upskilling Playbook
Tech Upskilling Playbook

Build future-ready tech teams and hit key business milestones with seven proven plays from industry leaders.

Check it out
  • Lab
    • Libraries: If you want this lab, consider one of these libraries.
    • Cloud
    • Data
Labs

Running a Pyspark Job on Cloud Dataproc Using Google Cloud Storage

This hands-on lab introduces how to use Google Cloud Storage as the primary input and output location for Dataproc cluster jobs. Leveraging GCS over the Hadoop Distributed File System (HDFS) allows us to treat clusters as ephemeral entities, so we can delete clusters that are no longer in use, while still preserving our data.

Lab platform
Lab Info
Level
Intermediate
Last updated
Jun 08, 2025
Duration
30m

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.
Table of Contents
  1. Challenge

    Prepare Our Environment

    First, we need to enable the Dataproc API:

    gcloud services enable dataproc.googleapis.com
    

    Then create a Cloud Storage bucket:

    gsutil mb -l us-central1 gs://$DEVSHELL_PROJECT_ID-data
    

    Now we'll first enable the Cloud Resource Manager API , enable private IP Google access, then create the ephemeral dataproc cluster.

    gcloud services enable cloudresourcemanager.googleapis.com
    
    gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access
    
    gcloud dataproc clusters create wordcount --region=us-central1 --single-node --master-machine-type=n1-standard-2
    

    And finally, download the wordcount.py file that will be used for the pyspark job:

    gsutil cp -r gs://acg-gcp-labs-resources/data-engineer/dataproc/* .
    
  2. Challenge

    Submit the Pyspark Job to the Dataproc Cluster

    In Cloud Shell, type:

    gcloud dataproc jobs submit pyspark wordcount.py --cluster=wordcount --region=us-central1 -- 
    gs://acg-gcp-labs-resources/data-engineer/dataproc/romeoandjuliet.txt 
    gs://$DEVSHELL_PROJECT_ID-data/output/
    
  3. Challenge

    Review the Pyspark Output
    1. In Cloud Shell, download output files from the GCS output location:
    gsutil cp -r gs://$DEVSHELL_PROJECT_ID-data/output/* .
    

    Note: Alternatively, we could download them to our local machine via the web console.

  4. Challenge

    Delete the Dataproc Cluster
    1. We don't need our cluster any longer, so let's delete it. In the web console, go to the top-left menu and into BIGDATA > Dataproc.

    2. Select the wordcount cluster, then click DELETE, and OK to confirm.

    Our job output still remains in Cloud Storage, allowing us to delete Dataproc clusters when no longer in use to save costs, while preserving input and output resources.

About the author

Pluralsight Skills gives leaders confidence they have the skills needed to execute technology strategy. Technology teams can benchmark expertise across roles, speed up release cycles and build reliable, secure products. By leveraging our expert content, skill assessments and one-of-a-kind analytics, keep up with the pace of change, put the right people on the right projects and boost productivity. It's the most effective path to developing tech skills at scale.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Get started with Pluralsight