- Lab
- Data

Performing Dimension Analysis with R Hands-on Practice
In this lab, you will explore the complexities of high-dimensional data with R, beginning with visualization techniques to understand and explore the intricacies of datasets like the Boston Housing Dataset. Progressing to feature selection and dimensionality reduction, you will learn to streamline data for efficiency in machine learning models using methods such as stepwise regression and PCA. The lab culminates in applying advanced techniques such as LDA, QDA, and manifold learning for both categorical and non-linear data, equipping you with the skills to enhance model performance across diverse datasets.

Path Info
Table of Contents
-
Challenge
Exploring High-Dimensionality Data
RStudio Guide
To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.
Exploring High-Dimensionality Data
To review the concepts covered in this step, please refer to the Understanding the Importance of Reducing Complexity in Data module of the Performing Dimension Analysis with R course.
Understanding the importance of reducing complexity in data is crucial because it sets the foundation for effective dimensionality reduction and feature selection. This step will help learners grasp the problems associated with high-dimensionality data and the difference between feature selection and dimensionality reduction.
Dive into the world of data by exploring the
HRIS.csv
dataset. Though this dataset is relatively small, the techniques you learn will apply to datasets of any size. Using R, you'll perform exploratory data analysis (EDA) to identify key aspects of the data. Tools such asggplot2
for visualization anddplyr
for data manipulation will be essential in this step. Objective: Gain a comprehensive understanding of high-dimensional data characteristics and the necessity for dimensionality reduction.
Task 1.1: Load Required Libraries
Before diving into the data, you need to load the necessary libraries for data manipulation and visualization. For this task, load the
dplyr
andggplot2
libraries.π Hint
Use the
library()
function to load each library. For example, to loaddplyr
, you would uselibrary('dplyr')
.π Solution
library(dplyr) library(ggplot2)
Task 1.2: Read and Inspect the Dataset
Now that the libraries are loaded, the next step is to read the
HRIS.csv
file into R ashr_data
and inspect the first few rows of the dataset to understand its structure.π Hint
Use the
read.csv()
function to read the file, and thehead()
function to display the first few rows. Remember to specify the correct path to the file.π Solution
hr_data <- read.csv('HRIS.csv') head(hr_data)
Task 1.3: Explore Data Dimensions
Understanding the dimensions of your dataset is crucial for recognizing high-dimensionality. Use R to find out the number of rows and columns in the
hr_data
dataset.π Hint
Use the
dim()
function to get the dimensions of the dataset.π Solution
dim(hr_data)
Task 1.4: Summarize the Dataset
Get a quick summary of the dataset to understand the distribution of data across different columns. This will help in identifying any potential issues with the data.
π Hint
Use the
summary()
function to get a summary of the dataset.π Solution
summary(hr_data)
Task 1.5: Visualize Salary Distribution
Visualizing data can provide insights that are not immediately obvious from the raw data. Create a histogram to visualize the distribution of salaries in the dataset.
π Hint
Use the
ggplot()
function fromggplot2
, along withgeom_histogram()
, to create the histogram. Make sure to specifysalary
as the data to plot.π Solution
ggplot(hr_data, aes(x = salary)) + geom_histogram(binwidth = 5000)
-
Challenge
Implementing Principal Components Analysis (PCA)
Implementing Principal Components Analysis (PCA)
To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Continuous Data module of the Performing Dimension Analysis with R course.
Principal Components Analysis (PCA) is important because it's a fundamental technique for reducing dimensionality in linear data, helping to mitigate the curse of dimensionality. This step will provide hands-on experience in applying PCA to continuous data.
In this step, you'll apply Principal Components Analysis (PCA) to the
HRIS.csv
dataset to reduce its dimensionality. The goal is to extract the most significant components that explain the majority of the variance in the data. You'll use theprcomp
function in R. Objective: Learn how to perform PCA on a dataset and interpret the results to understand the underlying structure of the data.
Task 2.1: Loading and Inspecting the HRIS Dataset
Load the
HRIS.csv
dataset into R ashr_data
and inspect the first few rows to understand its structure.π Hint
Use the
read.csv()
function to load the dataset and thehead()
function to display the first few rows.π Solution
# Load the HRIS dataset hr_data <- read.csv('HRIS.csv') # Display the first few rows of the dataset head(hr_data)
Task 2.2: Preprocessing Data for PCA
Before performing PCA, it's important to preprocess the data. Load the
dplyr
package to help with preprocessing.For PCA, only meaningful numeric variables are appropriate. Subset the data to columns that fit this description.
π Hint
Use the
select()
function from thedplyr
package to select variables. Note that theemployee_id
column is numeric, but is not meaningful for PCA.π Solution
# Load the dplyr package library(dplyr) # Retain only meaningful numeric variables hr_data_numeric <- hr_data %>% select(salary, benefits, latitude, longitude, employee_level, tenure_in_days)
Task 2.3: Performing PCA
Perform a principal componenets analysis (PCA) on the dataset. Specify that the variables should be scaled to unit variance (i.e., standardized). Summarize the result. How much cumulative variance is explained by 3 principal components?
π Hint
Use the
prcomp
function to perform PCA. Make sure to set thescale.
argument toTRUE
to standardize the variables. Usesummary
to summarize the resultsπ Solution
# Perform PCA pca <- prcomp(hr_data_numeric, scale. = TRUE) summary(pca) # ___% of variance is explained by 3 components # ~82%
Task 2.4: Visualize with a Biplot
Use the
biplot
function to plot the PCA results. Which variables are most related to the first principle component (the x-axis)?π Hint
Use the
biplot
function with your PCA results as the input argument.π Solution
# Make a biplot biplot(pca) # The variables ____ are most related to the first principal component # salary, employee_level, benefits
-
Challenge
Applying Linear Discriminant Analysis (LDA)
Applying Linear Discriminant Analysis (LDA)
To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Categorical Data module of the Performing Dimension Analysis with R course.
Linear Discriminant Analysis (LDA) is important because it provides a method for dimensionality reduction that is particularly useful for categorical data. This step will offer practical experience in using LDA for dimensionality reduction.
This step involves applying Linear Discriminant Analysis (LDA) to the
HRIS.csv
dataset to reduce its dimensionality while considering the categorical nature of some of its variables. You'll convert appropriate columns to factors and use theMASS
package'slda
function to perform LDA. Objective: Understand how to apply LDA to a dataset with categorical variables and interpret the results.
Task 3.1: Loading and Inspecting the Dataset
Begin by loading the
HRIS.csv
dataset into R ashr_data
. Use appropriate functions to inspect the dataset, focusing on its structure and the first few rows to understand the data you'll be working with.π Hint
Use the
read.csv()
function to load the dataset. Then, apply thestr()
function to inspect the structure andhead()
to display the first few rows.π Solution
# Load the HRIS.csv dataset hr_data <- read.csv('HRIS.csv') # Inspect the structure of the dataset str(hr_data) # Display the first few rows of the dataset head(hr_data)
Task 3.2: Converting Categorical Variables to Factors
Some columns in the dataset are categorical and should be treated as factors for the analysis. Convert the
gender
,position
, anddepartment
columns to factors.π Hint
Use the
factor()
function to convert thegender
,position
, anddepartment
columns to factors. Assign the converted columns back to the dataset.π Solution
# Convert the specified columns to factors hr_data$gender <- factor(hr_data$gender) hr_data$position <- factor(hr_data$position) hr_data$department <- factor(hr_data$department)
Task 3.3: Performing LDA
Load the
MASS
package and perform linear discriminant analysis (LDA). Predict an employee'sposition
using theirsalary
,benefits
, andemployee_level
. Print the results.π Hint
Use the
lda()
function to perform LDA. The formula should follow the patternpredicted ~ predictor1 + predictor2...
.π Solution
library(MASS) lda_model <- lda(formula = position ~ salary + benefits + employee_level, data = hr_data) lda_model
-
Challenge
Exploring Manifold Learning Techniques
Exploring Manifold Learning Techniques
To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Non-linear Data module of the Performing Dimension Analysis with R course.
Manifold Learning is crucial for dealing with non-linear data, as it can uncover the underlying structure of data that is not apparent in high-dimensional space. This step will introduce learners to t-SNE, a manifold learning technique.
Embark on a journey through non-linear data by applying t-SNE to the
HRIS.csv
dataset. The goal is to discover low-dimensional manifolds that represent the data's structure. You'll use theRtsne
anddplyr
packages. Objective: Gain practical experience in applying manifold learning techniques to non-linear data and understand how to interpret the results.
Task 4.1: Loading the HRIS Dataset
Begin by loading the
HRIS.csv
dataset into R ashr_data
. This dataset contains various employee details, which you will use to explore manifold learning techniques. Use theread.csv
function to load the dataset into a variable namedhr_data
.π Hint
Use the
read.csv
function with the file path as its argument.π Solution
hr_data <- read.csv('HRIS.csv')
Task 4.2: Applying t-SNE
Load the
dplyr
andRtsne
R packages, which are already installed in this environment. Subset the data to meaningful numeric columns. Then apply the t-SNE technique to reduce the dimensionality of the data to 2. Then, apply t-SNE to the dataset and store the result in a variable namedtsne_results
. Finally, plot the results.π Hint
Use the
select
function fromdplyr
to subset to meaningful numeric columns (if you loaded theMASS
package in a previous step, you may need to specifydplyr::select
). Then, use theRtsne
function on the numeric data with the dimensions set to 2. Plot the results using theplot
function andtsne_results$Y
.π Solution
# Load the dplyr and Rtsne packages library(dplyr) library(Rtsne) # Retain only meaningful numeric variables hr_data_numeric <- hr_data %>% dplyr::select(salary, benefits, latitude, longitude, employee_level, tenure_in_days) # Apply t-SNE to the HRIS dataset tsne_results <- Rtsne(hr_data_numeric, dims = 2) # Plot the t-SNE results plot(tsne_results$Y)
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.