Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Exploring Data with Quantitative Techniques Using R Hands-on Practice

In this lab, Exploring Data with Quantitative Techniques Using R, you will dive deep into data manipulation and analysis using the nycflights13 package, leveraging R functions such as mutate() and filter(). You'll generate summary statistics, create visual representations of data distributions with ggplot2, and perform correlation analysis and logistic regression to uncover relationships within the dataset. This hands-on experience will sharpen your skills in understanding and analyzing complex datasets, preparing you for advanced data exploration tasks.

Labs

Path Info

Duration
Clock icon 1h 0m
Published
Clock icon Apr 05, 2024

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Exploring the NYC Flights Dataset

    RStudio Guide

    To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.


    Exploring the NYC Flights Dataset

    To review the concepts covered in this step, please refer to the Understanding Data Exploration module of the Exploring Data with Quantitative Techniques Using R course.

    Understanding the dataset and managing data is important because it lays the foundation for all subsequent analyses.

    Dive into the NYC flights dataset using the nycflights13 package. Use the mutate() function to create a new column and filter() to subset the data. This step will help you get comfortable with manipulating and understanding the structure of your dataset.


    Task 1.1: Load the NYC Flights Dataset

    Start by loading the nycflights13 package, which comes preinstalled in this environment. Then view the first 5 rows of the flights dataset. This dataset contains information about all flights that departed from NYC in 2013.

    πŸ” Hint

    Use the library() function to load the nycflights13 package. After that, you can access the flights dataset directly. Use head to display the first few rows.

    πŸ”‘ Solution
    library(nycflights13)
    head(flights, n=5)
    

    Task 1.2: Create New Columns with mutate

    Load the dplyr package. Add a new column named minute_in_day to the flights dataframe using the mutate() function. This new variable should combine hour and minute to represent the minute in the day that the departure was scheduled (out of a max of 1440 minutes in a day).

    πŸ” Hint

    Use the mutate() function from the dplyr package to create a new column. The new variable can be calculated by multiplying hour by 60 and adding to minute.

    πŸ”‘ Solution
    library(dplyr)
    flights <- flights %>% 
      mutate(minute_in_day = (hour * 60) + minute)
    

    Task 1.3: Subset Data for a Specific Carrier

    Now, focus on flights operated by a specific carrier. Use the filter() function to subset the flights dataframe, selecting only the flights operated by the carrier 'AA' (American Airlines). Save this subsetted data frame to a new variable called flights_aa.

    πŸ” Hint

    Use the filter() function from the dplyr package to select rows where the carrier column equals 'AA'.

    πŸ”‘ Solution
    # Subset the dataframe for flights operated by 'AA'
    flights_aa <- flights %>% 
      filter(carrier == 'AA')
    
  2. Challenge

    Sampling Techniques in R

    Sampling Techniques in R

    To review the concepts covered in this step, please refer to the Sampling a Dataset for Data Exploration module of the Exploring Data with Quantitative Techniques Using R course.

    Sampling is important because it allows for the analysis of large datasets without the need for processing the entire dataset. Learning to use random sampling approaches and making code reproducible are essential skills for data scientists.

    Implement various sampling techniques using both base R and the dplyr package. Start by taking a simple random sample of the flights dataset using the sample() function. Then, explore stratified sampling with dplyr's sample_frac() function, ensuring your samples meet a specified representation by category. This exercise will enhance your ability to work with large datasets efficiently and reproducibly.


    Task 2.1: Loading the Flights Dataset

    Before we can sample the dataset, we first need to load it into our R environment. Load the nycflights13 package so that you can access the flights dataset.

    πŸ” Hint

    Use the library() function to load the nycflights13 package. This will enable you to access the flights dataset.

    πŸ”‘ Solution
    library(nycflights13)
    

    Task 2.2: Simple Random Sampling

    Now that we have the flights dataset available, let's perform a simple random sample. Use the sample() function to select 1000 random row indices from the flights dataset. Then subset the data and save it to a new data frame called sampled_flights. Make sure to set a seed so the sample is reproducible.

    πŸ” Hint

    First, use set.seed(number) to ensure reproducibility. Then, use the sample() function with the first argument being 1:nrow(flights), and the second argument being the desired sample size of 1000. This will create an sampled index that you can use to subset within brackets [].

    πŸ”‘ Solution
    set.seed(123)
    myrows <- sample(1:nrow(flights), 1000)
    sampled_flights <- flights[myrows, ]
    

    Task 2.3: Stratified Sampling with dplyr

    To ensure our sample represents all the airlines, let's perform stratified sampling. Use dplyr to sample 10% of rows from each airline carrier in the flights dataset. Save the result to a new variable called stratified_sample.

    πŸ” Hint

    Load the dplyr package with library(dplyr). Then, use group_by followed by sample_frac to perform the stratified sampling.

    πŸ”‘ Solution
    library(dplyr)
    stratified_sample <- flights %>% 
      group_by(carrier) %>% 
      sample_frac(0.10)
    
  3. Challenge

    Summarizing and Visualizing Data

    Summarizing and Visualizing Data

    To review the concepts covered in this step, please refer to the Summarizing Data to Get an Understanding of New Data module of the Exploring Data with Quantitative Techniques Using R course.

    Summarizing data and visualizing distributions are crucial for uncovering the underlying patterns and anomalies in the dataset. These techniques are foundational for understanding new datasets.

    Use R to generate summary statistics and visualize data distributions from the flights dataset in the nycflights13 package. Start by creating group-based counts and displaying them in a bar chart. Then, generate histograms for multiple numeric variables to understand their distributions. Finally, create a box plot to identify outliers in the data. Utilize the ggplot2 package for these visualization tasks. This step will help you comprehend the overall structure and characteristics of the dataset.


    Task 3.1: Loading the Required Libraries

    Load the dplyr library for data frame manipulations, the ggplot2 library for creating visualizations, and the nycflights13 library to access the flights dataset.

    πŸ” Hint

    Use the library() function and pass the name of the package as a string.

    πŸ”‘ Solution
    library('dplyr')
    library('ggplot2')
    library('nycflights13')
    

    Task 3.2: Creating Group-Based Counts

    For a specified column in the flights dataset, create a dataframe called group_counts that counts the occurrences of each carrier. This dataframe will be used to visualize the data in a bar chart.

    πŸ” Hint Utilize the count() function from the dplyr package, specifying the dataset (flights) and the column to group by.
    πŸ”‘ Solution
    group_counts <- count(flights, carrier)
    group_counts
    

    Task 3.3: Visualizing Group-Based Counts with a Bar Chart

    Using the group_counts variable you created in the previous task, visualize the group-based counts with a bar chart using ggplot2.

    πŸ” Hint

    Start with ggplot() and specify the dataframe and aesthetics (aes) with the grouping variable as x and the count n as y. Then, add geom_bar() with stat = 'identity' to create the bar chart.

    πŸ”‘ Solution
    ggplot(group_counts, aes(x = carrier, y = n)) +
     geom_bar(stat = 'identity')
    

    Task 3.4: Generating Histograms for Numeric Variables

    Using ggplot2, plot histograms for the numeric variables dep_delay and air_time in your dataset to understand their distributions. This will help you visualize the frequency of data points across different value ranges.

    πŸ” Hint

    For each variable, use ggplot() specifying the dataset and aesthetics with the variable as x. Then, add geom_histogram().

    πŸ”‘ Solution
    ggplot(flights, aes(x = dep_delay)) +
      geom_histogram()
    ggplot(flights, aes(x = air_time)) +
      geom_histogram()
    

    Task 3.5: Creating a Box Plot to Identify Outliers

    With ggplot2, identify potential outliers in air_time of the flights dataset using a box plot, which is effective for visualizing the distribution of data points and spotting outliers.

    πŸ” Hint Begin with ggplot() and set the dataset and aesthetics with the numeric variable as y. Then, add geom_boxplot() to generate the box plot.
    πŸ”‘ Solution
    ggplot(flights, aes(y = air_time)) +
      geom_boxplot()
    
  4. Challenge

    Correlation Analysis and Logistic Regression

    Correlation Analysis and Logistic Regression

    To review the concepts covered in this step, please refer to the Using Correlation Analysis module of the Exploring Data with Quantitative Techniques Using R course.

    Understanding relationships between variables is important because it helps in identifying patterns and making predictions. Correlation analysis and logistic regression are powerful techniques for exploring these relationships.

    Perform correlation analysis between numeric variables using the cor() function and visualize the relationship with a scatter plot including a linear regression line using ggplot2. Then, set up a logistic regression model for a binary outcome variable with the glm() function, interpreting the results to understand the impact of different variables on the outcome. This step will enhance your ability to uncover and understand relationships within your data.


    Task 4.1: Performing Correlation Analysis

    Load the nycflights13 library to access the flights dataset. Calculate the Spearman correlation coefficient between distance and arr_delay to assess their relationship. There are missing values in the dataset, so make sure to only analyze pairwise complete observations.

    πŸ” Hint

    Use the cor() function with method = 'spearman' to calculate the correlation. The flights dataset is your dataframe, and you are correlating distance and arr_delay. Set use = 'pairwise.complete.obs' to ensure your correlation does not include missing values.

    πŸ”‘ Solution
    # Load the library to access data
    library('nycflights13')
    
    # Calculate the Spearman correlation coefficient
    cor(flights$distance, flights$arr_delay, method = 'spearman', use = 'pairwise.complete.obs')
    

    Task 4.2: Visualizing the Relationship with Scatter Plot and Linear Regression Line

    Load the ggplot2 library. Create a scatter plot to visualize the relationship between distance and arr_delay in the flights dataset. Add a linear regression line to the plot.

    πŸ” Hint

    Use the ggplot() function with aes() to specify the x and y variables. Add geom_point() for the scatter plot and geom_smooth(method = 'lm') for the linear regression line.

    πŸ”‘ Solution
    # Load the necessary library
    library(ggplot2)
    
    # Create a scatter plot with a linear regression line
    ggplot(flights, aes(x = distance, y = arr_delay)) +
      geom_point() +
      geom_smooth(method = 'lm')
    

    Task 4.3: Setting Up a Logistic Regression Model

    Create a new column, bin_arr_delay (binary arrival delay), that has value 1 if arr_delay is greater than 5 and 0 otherwise. Use the glm() function to set up a logistic regression model predicting bin_arr_delay using air_time and distance as predictors. The model may take a moment to fit. Use summary to view the model output.

    πŸ” Hint

    Use the ifelse function to create the bin_arr_delay variable. Use the glm() function with family = 'binomial' for logistic regression. The syntax for a regression formula is predicted ~ predictor1 + predictor2.

    πŸ”‘ Solution
    # Create a new column
    flights$bin_arr_delay <- ifelse(flights$arr_delay > 5, 1, 0)
    
    # Set up the logistic regression model
    mymodel <- glm(bin_arr_delay ~ air_time + distance, family = 'binomial', data = flights)
    
    # Print the summary of the model
    summary(mymodel)
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.