- Lab
- Data

Exploring Data with Quantitative Techniques Using R Hands-on Practice
In this lab, Exploring Data with Quantitative Techniques Using R, you will dive deep into data manipulation and analysis using the nycflights13 package, leveraging R functions such as mutate() and filter(). You'll generate summary statistics, create visual representations of data distributions with ggplot2, and perform correlation analysis and logistic regression to uncover relationships within the dataset. This hands-on experience will sharpen your skills in understanding and analyzing complex datasets, preparing you for advanced data exploration tasks.

Path Info
Table of Contents
-
Challenge
Exploring the NYC Flights Dataset
RStudio Guide
To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.
Exploring the NYC Flights Dataset
To review the concepts covered in this step, please refer to the Understanding Data Exploration module of the Exploring Data with Quantitative Techniques Using R course.
Understanding the dataset and managing data is important because it lays the foundation for all subsequent analyses.
Dive into the NYC flights dataset using the
nycflights13
package. Use themutate()
function to create a new column andfilter()
to subset the data. This step will help you get comfortable with manipulating and understanding the structure of your dataset.
Task 1.1: Load the NYC Flights Dataset
Start by loading the
nycflights13
package, which comes preinstalled in this environment. Then view the first 5 rows of theflights
dataset. This dataset contains information about all flights that departed from NYC in 2013.π Hint
Use the
library()
function to load thenycflights13
package. After that, you can access theflights
dataset directly. Usehead
to display the first few rows.π Solution
library(nycflights13) head(flights, n=5)
Task 1.2: Create New Columns with mutate
Load the
dplyr
package. Add a new column namedminute_in_day
to theflights
dataframe using themutate()
function. This new variable should combinehour
andminute
to represent the minute in the day that the departure was scheduled (out of a max of 1440 minutes in a day).π Hint
Use the
mutate()
function from thedplyr
package to create a new column. The new variable can be calculated by multiplyinghour
by 60 and adding tominute
.π Solution
library(dplyr) flights <- flights %>% mutate(minute_in_day = (hour * 60) + minute)
Task 1.3: Subset Data for a Specific Carrier
Now, focus on flights operated by a specific carrier. Use the
filter()
function to subset theflights
dataframe, selecting only the flights operated by the carrier 'AA' (American Airlines). Save this subsetted data frame to a new variable calledflights_aa
.π Hint
Use the
filter()
function from thedplyr
package to select rows where thecarrier
column equals 'AA'.π Solution
# Subset the dataframe for flights operated by 'AA' flights_aa <- flights %>% filter(carrier == 'AA')
-
Challenge
Sampling Techniques in R
Sampling Techniques in R
To review the concepts covered in this step, please refer to the Sampling a Dataset for Data Exploration module of the Exploring Data with Quantitative Techniques Using R course.
Sampling is important because it allows for the analysis of large datasets without the need for processing the entire dataset. Learning to use random sampling approaches and making code reproducible are essential skills for data scientists.
Implement various sampling techniques using both base R and the
dplyr
package. Start by taking a simple random sample of the flights dataset using thesample()
function. Then, explore stratified sampling withdplyr
'ssample_frac()
function, ensuring your samples meet a specified representation by category. This exercise will enhance your ability to work with large datasets efficiently and reproducibly.
Task 2.1: Loading the Flights Dataset
Before we can sample the dataset, we first need to load it into our R environment. Load the
nycflights13
package so that you can access theflights
dataset.π Hint
Use the
library()
function to load thenycflights13
package. This will enable you to access theflights
dataset.π Solution
library(nycflights13)
Task 2.2: Simple Random Sampling
Now that we have the flights dataset available, let's perform a simple random sample. Use the
sample()
function to select 1000 random row indices from theflights
dataset. Then subset the data and save it to a new data frame calledsampled_flights
. Make sure to set a seed so the sample is reproducible.π Hint
First, use
set.seed(number)
to ensure reproducibility. Then, use thesample()
function with the first argument being1:nrow(flights)
, and the second argument being the desired sample size of1000
. This will create an sampled index that you can use to subset within brackets[]
.π Solution
set.seed(123) myrows <- sample(1:nrow(flights), 1000) sampled_flights <- flights[myrows, ]
Task 2.3: Stratified Sampling with dplyr
To ensure our sample represents all the airlines, let's perform stratified sampling. Use
dplyr
to sample 10% of rows from each airline carrier in theflights
dataset. Save the result to a new variable calledstratified_sample
.π Hint
Load the
dplyr
package withlibrary(dplyr)
. Then, usegroup_by
followed bysample_frac
to perform the stratified sampling.π Solution
library(dplyr) stratified_sample <- flights %>% group_by(carrier) %>% sample_frac(0.10)
-
Challenge
Summarizing and Visualizing Data
Summarizing and Visualizing Data
To review the concepts covered in this step, please refer to the Summarizing Data to Get an Understanding of New Data module of the Exploring Data with Quantitative Techniques Using R course.
Summarizing data and visualizing distributions are crucial for uncovering the underlying patterns and anomalies in the dataset. These techniques are foundational for understanding new datasets.
Use R to generate summary statistics and visualize data distributions from the flights dataset in the nycflights13 package. Start by creating group-based counts and displaying them in a bar chart. Then, generate histograms for multiple numeric variables to understand their distributions. Finally, create a box plot to identify outliers in the data. Utilize the ggplot2 package for these visualization tasks. This step will help you comprehend the overall structure and characteristics of the dataset.
Task 3.1: Loading the Required Libraries
Load the
dplyr
library for data frame manipulations, theggplot2
library for creating visualizations, and thenycflights13
library to access theflights
dataset.π Hint
Use the
library()
function and pass the name of the package as a string.π Solution
library('dplyr') library('ggplot2') library('nycflights13')
Task 3.2: Creating Group-Based Counts
For a specified column in the flights dataset, create a dataframe called
group_counts
that counts the occurrences of each carrier. This dataframe will be used to visualize the data in a bar chart.π Hint
Utilize the count() function from the dplyr package, specifying the dataset (flights) and the column to group by.π Solution
group_counts <- count(flights, carrier) group_counts
Task 3.3: Visualizing Group-Based Counts with a Bar Chart
Using the
group_counts
variable you created in the previous task, visualize the group-based counts with a bar chart usingggplot2
.π Hint
Start with
ggplot()
and specify the dataframe and aesthetics (aes
) with the grouping variable asx
and the countn
asy
. Then, addgeom_bar()
withstat = 'identity'
to create the bar chart.π Solution
ggplot(group_counts, aes(x = carrier, y = n)) + geom_bar(stat = 'identity')
Task 3.4: Generating Histograms for Numeric Variables
Using
ggplot2
, plot histograms for the numeric variablesdep_delay
andair_time
in your dataset to understand their distributions. This will help you visualize the frequency of data points across different value ranges.π Hint
For each variable, use
ggplot()
specifying the dataset and aesthetics with the variable asx
. Then, addgeom_histogram()
.π Solution
ggplot(flights, aes(x = dep_delay)) + geom_histogram() ggplot(flights, aes(x = air_time)) + geom_histogram()
Task 3.5: Creating a Box Plot to Identify Outliers
With
ggplot2
, identify potential outliers inair_time
of the flights dataset using a box plot, which is effective for visualizing the distribution of data points and spotting outliers.π Hint
Begin with ggplot() and set the dataset and aesthetics with the numeric variable as y. Then, add geom_boxplot() to generate the box plot.π Solution
ggplot(flights, aes(y = air_time)) + geom_boxplot()
-
Challenge
Correlation Analysis and Logistic Regression
Correlation Analysis and Logistic Regression
To review the concepts covered in this step, please refer to the Using Correlation Analysis module of the Exploring Data with Quantitative Techniques Using R course.
Understanding relationships between variables is important because it helps in identifying patterns and making predictions. Correlation analysis and logistic regression are powerful techniques for exploring these relationships.
Perform correlation analysis between numeric variables using the
cor()
function and visualize the relationship with a scatter plot including a linear regression line usingggplot2
. Then, set up a logistic regression model for a binary outcome variable with theglm()
function, interpreting the results to understand the impact of different variables on the outcome. This step will enhance your ability to uncover and understand relationships within your data.
Task 4.1: Performing Correlation Analysis
Load the
nycflights13
library to access theflights
dataset. Calculate the Spearman correlation coefficient betweendistance
andarr_delay
to assess their relationship. There are missing values in the dataset, so make sure to only analyze pairwise complete observations.π Hint
Use the
cor()
function withmethod = 'spearman'
to calculate the correlation. Theflights
dataset is your dataframe, and you are correlatingdistance
andarr_delay
. Setuse = 'pairwise.complete.obs'
to ensure your correlation does not include missing values.π Solution
# Load the library to access data library('nycflights13') # Calculate the Spearman correlation coefficient cor(flights$distance, flights$arr_delay, method = 'spearman', use = 'pairwise.complete.obs')
Task 4.2: Visualizing the Relationship with Scatter Plot and Linear Regression Line
Load the
ggplot2
library. Create a scatter plot to visualize the relationship betweendistance
andarr_delay
in theflights
dataset. Add a linear regression line to the plot.π Hint
Use the
ggplot()
function withaes()
to specify the x and y variables. Addgeom_point()
for the scatter plot andgeom_smooth(method = 'lm')
for the linear regression line.π Solution
# Load the necessary library library(ggplot2) # Create a scatter plot with a linear regression line ggplot(flights, aes(x = distance, y = arr_delay)) + geom_point() + geom_smooth(method = 'lm')
Task 4.3: Setting Up a Logistic Regression Model
Create a new column,
bin_arr_delay
(binary arrival delay), that has value 1 ifarr_delay
is greater than 5 and 0 otherwise. Use theglm()
function to set up a logistic regression model predictingbin_arr_delay
usingair_time
anddistance
as predictors. The model may take a moment to fit. Usesummary
to view the model output.π Hint
Use the
ifelse
function to create thebin_arr_delay
variable. Use theglm()
function withfamily = 'binomial'
for logistic regression. The syntax for a regression formula ispredicted ~ predictor1 + predictor2
.π Solution
# Create a new column flights$bin_arr_delay <- ifelse(flights$arr_delay > 5, 1, 0) # Set up the logistic regression model mymodel <- glm(bin_arr_delay ~ air_time + distance, family = 'binomial', data = flights) # Print the summary of the model summary(mymodel)
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.