Libraries: If you want this lab, consider one of these libraries.
Data

Exploring Data with Quantitative Techniques Using R Hands-on Practice

In this lab, Exploring Data with Quantitative Techniques Using R, you will dive deep into data manipulation and analysis using the nycflights13 package, leveraging R functions such as mutate() and filter(). You'll generate summary statistics, create visual representations of data distributions with ggplot2, and perform correlation analysis and logistic regression to uncover relationships within the dataset. This hands-on experience will sharpen your skills in understanding and analyzing complex datasets, preparing you for advanced data exploration tasks.

Get started Contact sales

Lab Info

Last updated

Jun 04, 2025

Duration

1h 0m

Challenge

Exploring the NYC Flights Dataset
RStudio Guide

To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.

Exploring the NYC Flights Dataset

To review the concepts covered in this step, please refer to the Understanding Data Exploration module of the Exploring Data with Quantitative Techniques Using R course.

Understanding the dataset and managing data is important because it lays the foundation for all subsequent analyses.

Dive into the NYC flights dataset using the nycflights13 package. Use the mutate() function to create a new column and filter() to subset the data. This step will help you get comfortable with manipulating and understanding the structure of your dataset.

Task 1.1: Load the NYC Flights Dataset

Start by loading the nycflights13 package, which comes preinstalled in this environment. Then view the first 5 rows of the flights dataset. This dataset contains information about all flights that departed from NYC in 2013.

🔍 Hint

Use the library() function to load the nycflights13 package. After that, you can access the flights dataset directly. Use head to display the first few rows.
🔑 Solution

library(nycflights13) head(flights, n=5)
Task 1.2: Create New Columns with mutate

Load the dplyr package. Add a new column named minute_in_day to the flights dataframe using the mutate() function. This new variable should combine hour and minute to represent the minute in the day that the departure was scheduled (out of a max of 1440 minutes in a day).

🔍 Hint

Use the mutate() function from the dplyr package to create a new column. The new variable can be calculated by multiplying hour by 60 and adding to minute.
🔑 Solution

library(dplyr) flights <- flights %>% mutate(minute_in_day = (hour * 60) + minute)
Task 1.3: Subset Data for a Specific Carrier

Now, focus on flights operated by a specific carrier. Use the filter() function to subset the flights dataframe, selecting only the flights operated by the carrier 'AA' (American Airlines). Save this subsetted data frame to a new variable called flights_aa.

🔍 Hint

Use the filter() function from the dplyr package to select rows where the carrier column equals 'AA'.
🔑 Solution

# Subset the dataframe for flights operated by 'AA' flights_aa <- flights %>% filter(carrier == 'AA')
Challenge

Sampling Techniques in R
Sampling Techniques in R

To review the concepts covered in this step, please refer to the Sampling a Dataset for Data Exploration module of the Exploring Data with Quantitative Techniques Using R course.

Sampling is important because it allows for the analysis of large datasets without the need for processing the entire dataset. Learning to use random sampling approaches and making code reproducible are essential skills for data scientists.

Implement various sampling techniques using both base R and the dplyr package. Start by taking a simple random sample of the flights dataset using the sample() function. Then, explore stratified sampling with dplyr's sample_frac() function, ensuring your samples meet a specified representation by category. This exercise will enhance your ability to work with large datasets efficiently and reproducibly.

Task 2.1: Loading the Flights Dataset

Before we can sample the dataset, we first need to load it into our R environment. Load the nycflights13 package so that you can access the flights dataset.

🔍 Hint

Use the library() function to load the nycflights13 package. This will enable you to access the flights dataset.
🔑 Solution

library(nycflights13)
Task 2.2: Simple Random Sampling

Now that we have the flights dataset available, let's perform a simple random sample. Use the sample() function to select 1000 random row indices from the flights dataset. Then subset the data and save it to a new data frame called sampled_flights. Make sure to set a seed so the sample is reproducible.

🔍 Hint

First, use set.seed(number) to ensure reproducibility. Then, use the sample() function with the first argument being 1:nrow(flights), and the second argument being the desired sample size of 1000. This will create an sampled index that you can use to subset within brackets [].
🔑 Solution

set.seed(123) myrows <- sample(1:nrow(flights), 1000) sampled_flights <- flights[myrows, ]
Task 2.3: Stratified Sampling with dplyr

To ensure our sample represents all the airlines, let's perform stratified sampling. Use dplyr to sample 10% of rows from each airline carrier in the flights dataset. Save the result to a new variable called stratified_sample.

🔍 Hint

Load the dplyr package with library(dplyr). Then, use group_by followed by sample_frac to perform the stratified sampling.
🔑 Solution

library(dplyr) stratified_sample <- flights %>% group_by(carrier) %>% sample_frac(0.10)
Challenge

Summarizing and Visualizing Data
Summarizing and Visualizing Data

To review the concepts covered in this step, please refer to the Summarizing Data to Get an Understanding of New Data module of the Exploring Data with Quantitative Techniques Using R course.

Summarizing data and visualizing distributions are crucial for uncovering the underlying patterns and anomalies in the dataset. These techniques are foundational for understanding new datasets.

Use R to generate summary statistics and visualize data distributions from the flights dataset in the nycflights13 package. Start by creating group-based counts and displaying them in a bar chart. Then, generate histograms for multiple numeric variables to understand their distributions. Finally, create a box plot to identify outliers in the data. Utilize the ggplot2 package for these visualization tasks. This step will help you comprehend the overall structure and characteristics of the dataset.

Task 3.1: Loading the Required Libraries

Load the dplyr library for data frame manipulations, the ggplot2 library for creating visualizations, and the nycflights13 library to access the flights dataset.

🔍 Hint

Use the library() function and pass the name of the package as a string.
🔑 Solution

library('dplyr') library('ggplot2') library('nycflights13')
Task 3.2: Creating Group-Based Counts

For a specified column in the flights dataset, create a dataframe called group_counts that counts the occurrences of each carrier. This dataframe will be used to visualize the data in a bar chart.

🔍 Hint
Utilize the count() function from the dplyr package, specifying the dataset (flights) and the column to group by.
🔑 Solution

group_counts <- count(flights, carrier) group_counts
Task 3.3: Visualizing Group-Based Counts with a Bar Chart

Using the group_counts variable you created in the previous task, visualize the group-based counts with a bar chart using ggplot2.

🔍 Hint

Start with ggplot() and specify the dataframe and aesthetics (aes) with the grouping variable as x and the count n as y. Then, add geom_bar() with stat = 'identity' to create the bar chart.
🔑 Solution

ggplot(group_counts, aes(x = carrier, y = n)) + geom_bar(stat = 'identity')
Task 3.4: Generating Histograms for Numeric Variables

Using ggplot2, plot histograms for the numeric variables dep_delay and air_time in your dataset to understand their distributions. This will help you visualize the frequency of data points across different value ranges.

🔍 Hint

For each variable, use ggplot() specifying the dataset and aesthetics with the variable as x. Then, add geom_histogram().
🔑 Solution

ggplot(flights, aes(x = dep_delay)) + geom_histogram() ggplot(flights, aes(x = air_time)) + geom_histogram()
Task 3.5: Creating a Box Plot to Identify Outliers

With ggplot2, identify potential outliers in air_time of the flights dataset using a box plot, which is effective for visualizing the distribution of data points and spotting outliers.

🔍 Hint
Begin with ggplot() and set the dataset and aesthetics with the numeric variable as y. Then, add geom_boxplot() to generate the box plot.
🔑 Solution

ggplot(flights, aes(y = air_time)) + geom_boxplot()
Challenge

Correlation Analysis and Logistic Regression
Correlation Analysis and Logistic Regression

To review the concepts covered in this step, please refer to the Using Correlation Analysis module of the Exploring Data with Quantitative Techniques Using R course.

Understanding relationships between variables is important because it helps in identifying patterns and making predictions. Correlation analysis and logistic regression are powerful techniques for exploring these relationships.

Perform correlation analysis between numeric variables using the cor() function and visualize the relationship with a scatter plot including a linear regression line using ggplot2. Then, set up a logistic regression model for a binary outcome variable with the glm() function, interpreting the results to understand the impact of different variables on the outcome. This step will enhance your ability to uncover and understand relationships within your data.

Task 4.1: Performing Correlation Analysis

Load the nycflights13 library to access the flights dataset. Calculate the Spearman correlation coefficient between distance and arr_delay to assess their relationship. There are missing values in the dataset, so make sure to only analyze pairwise complete observations.

🔍 Hint

Use the cor() function with method = 'spearman' to calculate the correlation. The flights dataset is your dataframe, and you are correlating distance and arr_delay. Set use = 'pairwise.complete.obs' to ensure your correlation does not include missing values.
🔑 Solution

# Load the library to access data library('nycflights13') # Calculate the Spearman correlation coefficient cor(flights$distance, flights$arr_delay, method = 'spearman', use = 'pairwise.complete.obs')
Task 4.2: Visualizing the Relationship with Scatter Plot and Linear Regression Line

Load the ggplot2 library. Create a scatter plot to visualize the relationship between distance and arr_delay in the flights dataset. Add a linear regression line to the plot.

🔍 Hint

Use the ggplot() function with aes() to specify the x and y variables. Add geom_point() for the scatter plot and geom_smooth(method = 'lm') for the linear regression line.
🔑 Solution

# Load the necessary library library(ggplot2) # Create a scatter plot with a linear regression line ggplot(flights, aes(x = distance, y = arr_delay)) + geom_point() + geom_smooth(method = 'lm')
Task 4.3: Setting Up a Logistic Regression Model

Create a new column, bin_arr_delay (binary arrival delay), that has value 1 if arr_delay is greater than 5 and 0 otherwise. Use the glm() function to set up a logistic regression model predicting bin_arr_delay using air_time and distance as predictors. The model may take a moment to fit. Use summary to view the model output.

🔍 Hint

Use the ifelse function to create the bin_arr_delay variable. Use the glm() function with family = 'binomial' for logistic regression. The syntax for a regression formula is predicted ~ predictor1 + predictor2.
🔑 Solution

# Create a new column flights$bin_arr_delay <- ifelse(flights$arr_delay > 5, 1, 0) # Set up the logistic regression model mymodel <- glm(bin_arr_delay ~ air_time + distance, family = 'binomial', data = flights) # Print the summary of the model summary(mymodel)

About the author

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Exploring Data with Quantitative Techniques Using R Hands-on Practice

Lab Info

Table of Contents

Exploring the NYC Flights Dataset

RStudio Guide

Exploring the NYC Flights Dataset

Task 1.1: Load the NYC Flights Dataset

Task 1.2: Create New Columns with mutate

Task 1.3: Subset Data for a Specific Carrier

Sampling Techniques in R

Sampling Techniques in R

Task 2.1: Loading the Flights Dataset

Task 2.2: Simple Random Sampling

Task 2.3: Stratified Sampling with dplyr

Summarizing and Visualizing Data

Summarizing and Visualizing Data

Task 3.1: Loading the Required Libraries

Task 3.2: Creating Group-Based Counts

Task 3.3: Visualizing Group-Based Counts with a Bar Chart

Task 3.4: Generating Histograms for Numeric Variables

Task 3.5: Creating a Box Plot to Identify Outliers

Correlation Analysis and Logistic Regression

Correlation Analysis and Logistic Regression

Task 4.1: Performing Correlation Analysis

Task 4.2: Visualizing the Relationship with Scatter Plot and Linear Regression Line

Task 4.3: Setting Up a Logistic Regression Model

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight