Featured resource
2025 Tech Upskilling Playbook
Tech Upskilling Playbook

Build future-ready tech teams and hit key business milestones with seven proven plays from industry leaders.

Check it out
  • Lab
    • Libraries: If you want this lab, consider one of these libraries.
    • Cloud
    • Data
Google Cloud Platform icon
Labs

Preparing Data Using Amazon Athena and AWS Glue

Imagine you are the data engineer and you have been assigned the task to prepare the data and get it ready for the machine learning engineers to create a highly predictable model. Your corporation has been working with AWS and you have been encouraged to use AWS services. Your raw data has been uploaded to an input folder in an S3 bucket. You will use a Glue crawler to detect the schema structure. You will then upload the data to a database that will be queried using SQL to detect discrepancies. Then, you will use the Visual ETL tool from AWS Glue to check for any missing or duplicate data and upload the processed data to the output folder.

Google Cloud Platform icon
Lab platform
Lab Info
Level
Beginner
Last updated
Sep 23, 2025
Duration
1h 0m

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.
Table of Contents
  1. Challenge

    Create a Storage Area to Store the Input Files
    • Download the input dataset Employee_lab.csv from the lab GitHub repo
    • Launch the AWS console and create an S3 bucket and two folders, input and output.
  2. Challenge

    Read the Raw Data to a Database
    • Create a Glue Crawler.
    • Configure the S3 input folder as the data source.
    • When prompted to create a new IAM role, add the suffix mlsc01 to the predefined role name so it's easy to find later.
    • Create and add a database.
    • Run the crawler and write the raw data to this database.
  3. Challenge

    Run SQL Queries and Detect Data Discrepancies
    • Launch Amazon Athena and run SQL queries to detect null values against the age feature, check the number of observations that fall outside $250,000, and determine the format of first-name and last-name features.
  4. Challenge

    Fix the Data Discrepancies
    • Use AWS Glue Visual ETL to configure an input S3 bucket to read the raw data, change the schema, assign proper data types, fill missing values against the age feature, filter data, and ignore rows whose salary is greater than $250,000.
    • Run a SQL query to convert the first and last names to lower case and remove the blank spaces in the fields.
    • Finally, write the formatted data to the output folder of the S3 bucket.
About the author

Pluralsight Skills gives leaders confidence they have the skills needed to execute technology strategy. Technology teams can benchmark expertise across roles, speed up release cycles and build reliable, secure products. By leveraging our expert content, skill assessments and one-of-a-kind analytics, keep up with the pace of change, put the right people on the right projects and boost productivity. It's the most effective path to developing tech skills at scale.

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Get started with Pluralsight