- Lab
- Data

Statistical Modeling in R
This Code Lab guides learners through essential regression modeling techniques in R using lm() and glm(). By completing the lab, participants will gain hands-on experience in loading datasets, fitting and evaluating linear, logistic, and Poisson regression models, and making predictions. Learners will explore key model diagnostics, interpret statistical outputs, and visualize regression results. This foundational lab prepares participants for real-world data analysis and predictive modeling using R.

Path Info
Table of Contents
-
Challenge
### Step 0: Getting Started
Getting started
In this lab you will fit 3 regression models in R:
- Linear regression
- Logistic regression (using the GLM family)
- Poisson regression (using the GLM family)
Each of these regression models is a separate Step in this lab.
Data for each model
Each regression model uses its own dataset and these files are available in the
workspace/
folder which will be your working directoryRStudio Guide
To get started, click on the
workspace/
folder in the bottom right pane of RStudio. Click on the file entitledStep 1 - LinearRegression.Rmd
You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file.
Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step.
Then when you are ready to move onto the next step, you'll come back and click on the file for the next step i.e.
Step 2 - LogisticRegression.Rmd
and thenStep 3 - PoissonRegression.Rmd
until you have completed all tasks in all steps of the lab. -
Challenge
#### Step 1: Linear Regression
Exploring Linear Regression with R
To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.
To get started, click on the
workspace/
folder in the bottom right pane of RStudio. Click on the file entitledStep 1 - LinearRegression.Rmd
Linear regression is a fundamental statistical technique that helps us understand the relationship between variables. Before fitting a regression model, it is essential to explore and clean the dataset.
You'll start by loading the dataset, checking for missing values, inspecting its structure, and visualizing key features.
Task 1.1: Load the Dataset
First, load the data from the
insurance.csv
file into an R data frame. Make sure any text columns are treated as categorical variables. Then, display the first few rows of the data frame so you can see what it looks like.π‘ Hint
Use `read.csv()` to load the dataset and set `stringsAsFactors = TRUE` to ensure categorical variables are correctly handled while reading in the data.π Solution
# Load the insurance data insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE) # View the first few records of the data head(insurance)
Task 1.2: Check for Missing Values
Count missing values in the insurance data, you will need to identify and deal with them if they are present.
Identifying missing values is critical to ensure data quality.
π‘ Hint
Use `is.na()` combined with `sum()` to count missing values in the dataset.π Solution
# Check for missing values in `insurance` sum(is.na(insurance))
Task 1.3: Inspect the Structure of the Dataset and Get Summary Statistics
View the structure of the data and the summary statistics for the data.
π‘ Hint
Use `str()` to inspect the structure and `summary()` to get an overview of the data.π Solution
# Inspect the structure str(insurance) # Get summary statistics summary(insurance)
Task 1.4: Explore Data Using Visualizations
Create a histogram of charges to visualize how charges are distributed. Visualize the relationship between age of an individual and insurance charges using a scatterplot.
π‘ Hint
Use `hist()` for distributions and `plot()` to explore relationships between variables.π Solution
# Use a histogram to view distribution of insurance charges hist(insurance$charges, main = "Distribution of Insurance Charges", xlab = "Charges", col = "lightblue", border = "white") # Use a scatterplot to view charges vs. age plot(insurance$age, insurance$charges, xlab = "Age", ylab = "Charges", main = "Charges vs. Age", pch = 19, col = "steelblue")
Task 1.5: Load the
rsample
Library to Split Data into Train and TestThe
rsample
package is already part of the environment. Include it in your program.π‘ Hint
Use `library()` to include the rsample library for splitting data into training and testing dataπ Solution
# Load the rsample library library(rsample)
Task 1.6: Split the Data (70% Training, 30% Test)
Split the data into training and test sets in variables called
train_data
andtest_data
.train_data
should have 70% of the data, andtest_data
the remaining 30%. Use seed123
for reproducibility.π‘ Hint
Use `initial_split()` to divide the dataset into training and test sets and the `training()` and `testing()` functions to access the splitsπ Solution
# Set a reproducible seed set.seed(123) # Split the data split <- initial_split(insurance, prop = 0.7) # Assign the training data train_data <- training(split) # Assign the test data test_data <- testing(split)
Task 1.7: Simple Regression with a Single Predictor on Training Data
Write code to fit a simple linear regression model on the training data where
charges
is predicted solely byage
. After fitting the model, inspect the model summary to evaluate how well age explains the variability in charges.Note that the regression model built using this single predictor has a very low R2.
age
alone does not have much predictive power.π‘ Hint
Use `lm()` to fit a linear model with `charges ~ age`.π Solution
# Fit a simple regression model with `age` as predictor lm_simple <- lm(charges ~ age, data = train_data) # View summary of model results summary(lm_simple)
Task 1.8: Multiple Regression with All Predictors on Training Data
Write code to fit a regression model using all available predictors (such as
age
,sex
,bmi
,children
,smoker
, andregion
) to predictcharges
. Then, review the model summary to compare the coefficients and performance metrics with the simple regression model. This comparison will show you the benefits of including additional variables.Note that the R2 shows a significant improvement is now > 0.7.
π‘ Hint
Use `lm()` to fit a model using all available predictors (`charges ~ .`).π Solution
# Fit a multiple regression model with all features as predictors lm_multiple <- lm(charges ~ ., data = train_data) # View summary of model results summary(lm_multiple)
Task 1.9: Make Predictions on the Test Set
Now, use the
lm_multiple
model you created to predict insurance charges on thetest_data
. Use thepredict()
function for this. Then, create a new data frame calledresults
. This data frame should have two columns: 'Actual', containing the real charges from the test_data, and 'Predicted', containing the values you just predicted.π‘ Hint
Use `predict()` to generate predictions using the multiple regression model.π Solution
# Make predictions on the test data predictions <- predict(lm_multiple, newdata = test_data) # Get the actual data and the predictions in a single data frame results <- data.frame(Actual = test_data$charges, Predicted = predictions)
Task 1.10: Compute the R2 on the test set using caret
First, load the
caret
package. This package has a bunch of useful functions for machine learning, including one for calculating R-squared.Once you've loaded
caret
, use itsR2()
function to calculate the R-squared value for your predictions. You'll need to give it two things: the predictions you made (predictions
) and the actual values from yourtest data
(test_data$charges
). This will tell you how well your model's predictions match the real values.π‘ Hint
Load the `caret` library and invoke the `R2(predictions, test_data$charges)` function.π Solution
# Load the `caret` package library(caret) # Compute the R2 R2(predictions, test_data$charges)
-
Challenge
#### Step 2: Logistic Regression
Exploring Logistic Regression with R
To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.
To get started, click on the
workspace/
folder in the bottom right pane of RStudio. Click on the file entitledStep 2 - LogisticRegression.Rmd
Logistic regression is a fundamental statistical technique used for modeling binary or categorical outcomes. It helps us understand the relationship between predictor variables and a categorical response variable by estimating probabilities using the logistic function.
Task 2.1: Load Data for Logistic Regression
Before starting with logistic regression, you need to load the dataset into R. This dataset in the
churn.csv
file contains customer information for a telecom service along with whether they churned or not. Make sure you usestringsAsFactors
so that categorical variables are read into the program as factors.You will fit a logistic regression model to predict whether a customer churned.
π‘ Hint
Use `read.csv()` to load the dataset.π Solution
# Load the churn data churn <- read.csv("churn.csv", stringsAsFactors = TRUE) # View the first few records of the data head(churn)
Task 2.2: View the Number of Records in Each Category
Before training a model, it is useful to check how many customers belong to each class (churn vs. no churn). This will help you understand if the dataset is imbalanced, which can affect model performance.
π‘ Hint
Use `table()` to count occurrences of each category.π Solution
# View the count in each category table(churn$Churn)
Task 2.3: Visualize the Number of Records in Each Category
A bar plot is a great way to visualize class distribution in the dataset. This helps you quickly see if one category dominates the other, which might impact model predictions.
π‘ Hint
Use `barplot()` to visualize the category distribution.π Solution
# Store churn counts in a variable churn_counts <- table(churn$Churn) # Visualize using a bar plot barplot(churn_counts, main = "Churn Distribution", xlab = "Churn", ylab = "Count", col = "steelblue")
Task 2.4: Load the
rsample
Library to Split Data into Train and TestTo build a predictive model, you must split the dataset into training and testing sets. The
rsample
package helps you efficiently partition your data.π‘ Hint
Use `library()` to include the rsample library in your programπ Solution
# Load the rsample library library(rsample)
Task 2.5: Split the Data (70% Training, 30% Test)
A standard practice in machine learning is to allocate a portion of the data for training and another for testing. Here, 70% of the data will be used for training, while 30% will be reserved for evaluation.
π‘ Hint
Use `initial_split()` to divide the dataset. Use `training()` and `testing()` to apportion the training and test data. Use 123 as a reproducible seed.π Solution
# Set a reproducible seed set.seed(123) # Split the data split <- initial_split(churn, prop = 0.7) # Assign the training data train_data <- training(split) # Assign the test data test_data <- testing(split)
Task 2.6: Logistic Regression with All Predictors on Training Data
Now that the data is split, fit a logistic regression model on the training data using all available features
Churn ~ .,
. This will allow you to predict whether a customer will churn based on various factors.π‘ Hint
Use `glm()` with `family = binomial`.π Solution
# Fit a logistic regression model with all features as predictors (GLM model) logistic_model <- glm(Churn ~ ., family = binomial, data = train_data) # View summary of model results summary(logistic_model)
Task 2.7: Compute the odds ratio
The odds ratio in a logistic model quantifies how a one-unit change in a predictor variable affects the odds of the outcome occurring, holding other variables constant.
π‘ Hint
Use `exp(coef())` to compute the odds ratioπ Solution
# Compute the odds ratio odds_ratio <- exp(coef(logistic_model)) odds_ratio
Task 2.8: Make Predictions on the Test Set and Construct Confusion Matrix
Once the model is trained, test it by making predictions on the test dataset. A confusion matrix will help evaluate classification performance. Use a threshold of
0.5
for the confusion matrix.π‘ Hint
Use `predict()` to make predictions and `table()` to construct a confusion matrix.π Solution
# Make predictions on the test data predict_test <- predict(logistic_model, type = "response", newdata = test_data) # Construct a confusion matrix with threshold = 0.5 test_table <- table(test_data$Churn, predict_test > 0.5) test_table
Task 2.9: Extract Confusion Matrix Components
A confusion matrix breaks down the modelβs predictions into true positives, false positives, true negatives, and false negatives. Extracting these values allows you to compute accuracy, precision, and recall for the model.
π‘ Hint
Access matrix values using indexing. e.g. ` test_table[1, 1]` gives us the true negatives.π Solution
# True negatives true_negatives <- test_table[1, 1] # False positives false_positives <- test_table[1, 2] # False negatives false_negatives <- test_table[2, 1] # True positives true_positives <- test_table[2, 2]
Task 2.10: Calculate and Print Performance Metrics: Accuracy, Precision, and Recall
Accuracy, precision, and recall are key metrics for evaluating a classification model. These metrics provide insights into the modelβs effectiveness in predicting churn.
π‘ Hint
Use mathematical formulas to compute accuracy, precision, and recall. e.g. `accuracy <- (true_positives + true_negatives) / sum(test_table)`π Solution
# Calculate and print accuracy accuracy <- (true_positives + true_negatives) / sum(test_table) cat("Accuracy:", accuracy, "\n") # Calculate and print precision precision <- true_positives / (true_positives + false_positives) cat("Precision:", precision, "\n") # Calculate and print recall recall <- true_positives / (true_positives + false_negatives) cat("Recall:", recall, "\n")
-
Challenge
### Step 3: Poisson Regression
Exploring Poisson Regression with R
To review the concepts covered in this step, please refer to the Build and Interpret Statistical Models module of the Statistical Modeling and Hypothesis Testing in R course.
To get started, click on the
workspace/
folder in the bottom right pane of RStudio. Click on the file entitledStep 3 - PoissonRegression.Rmd
Poisson regression is a fundamental statistical technique used for modeling count data, where the response variable represents the number of occurrences of an event in a fixed interval of time or space. It assumes that the count data follows a Poisson distribution and uses the log link function to model the relationship between predictor variables and the expected count.
Task 3.1: Load Dataset and View Records
Before performing Poisson regression, load the dataset and inspect its structure to ensure it is correctly formatted.
π‘ Hint
Use `read.csv()` to load the dataset and `head()` to preview the first few rows.π Solution
# Load the dataset accident_data <- read.csv("accidents_data.csv", stringsAsFactors = TRUE) # View first few rows head(accident_data)
Task 3.2: Quick Summary of the Data
Summarizing the dataset helps identify missing values, outliers, and key statistics for each variable.
π‘ Hint
Use `summary()` to generate summary statistics of the dataset.π Solution
# Get summary statistics for accidents summary(accident_data)
Task 3.3: Check Poisson Distribution Assumptions
For Poisson regression, the mean and variance of the dependent variable should be approximately equal. Compute these values to check the assumption.
π‘ Hint
Use `mean()` and `var()` to calculate these statistics.π Solution
# Calculate mean and variance of accident occurrences mean_accidents <- mean(accident_data$Accidents) var_accidents <- var(accident_data$Accidents) # Mean ~ Variance print(paste("Mean:", mean_accidents)) print(paste("Variance:", var_accidents))
Task 3.4: Compute Dispersion Ratio
The dispersion ratio, which is the variance divided by the mean, should be close to 1 for a Poisson distribution.
π‘ Hint
Use `var_accidents / mean_accidents` to compute the ratio.π Solution
# Compute dispersion ratio (should be close to 1) dispersion_ratio <- var_accidents / mean_accidents print(paste("Dispersion Ratio:", dispersion_ratio))
Task 3.5: Visualize Average Accidents and Traffic Volume per Weekday/Weekend
To understand patterns in the data, visualize how accidents and traffic volume vary across different days.
π‘ Hint
Use `aggregate()` to compute averages and `barplot()` to visualize the results.π Solution
# Calculate average accidents on weekday/weekend avg_accidents <- aggregate(Accidents ~ Weekend, data = accident_data, FUN = mean) # Bar plot for average accidents on weekday/weekend barplot(avg_accidents$Accidents, names.arg = avg_accidents$Weekend, main = "Average Accidents on Weekday/Weekend", col = "steelblue", xlab = "Weekend", ylab = "Average Accidents") # Calculate average traffic volume on weekday/weekend avg_traffic <- aggregate(TrafficVolume ~ Weekend, data = accident_data, FUN = mean) # Bar plot for average traffic volume weekday/weekend barplot(avg_traffic$TrafficVolume, names.arg = avg_traffic$Weekend, main = "Average Traffic Volume Weekday/Weekend", col = "darkred", xlab = "Weekend", ylab = "Average Traffic Volume")
Task 3.6: Fit a Poisson Regression Model
Now, fit a Poisson regression model to examine how
Weekend
andTrafficVolume
influence accident occurrences.π‘ Hint
Use `glm()` with `family = poisson` to specify a Poisson regression model.π Solution
# Fit Poisson model poisson_model <- glm(Accidents ~ Weekend + TrafficVolume, family = poisson, data = accident_data) # Display model summary summary(poisson_model)
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.