Libraries: If you want this lab, consider one of these libraries.
Data

Performing Dimension Analysis with R Hands-on Practice

In this lab, you will explore the complexities of high-dimensional data with R, beginning with visualization techniques to understand and explore the intricacies of datasets like the Boston Housing Dataset. Progressing to feature selection and dimensionality reduction, you will learn to streamline data for efficiency in machine learning models using methods such as stepwise regression and PCA. The lab culminates in applying advanced techniques such as LDA, QDA, and manifold learning for both categorical and non-linear data, equipping you with the skills to enhance model performance across diverse datasets.

Get started Contact sales

Lab Info

Last updated

Jun 04, 2025

Duration

1h 0m

Challenge

Exploring High-Dimensionality Data
RStudio Guide

To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.

Exploring High-Dimensionality Data

To review the concepts covered in this step, please refer to the Understanding the Importance of Reducing Complexity in Data module of the Performing Dimension Analysis with R course.

Understanding the importance of reducing complexity in data is crucial because it sets the foundation for effective dimensionality reduction and feature selection. This step will help learners grasp the problems associated with high-dimensionality data and the difference between feature selection and dimensionality reduction.

Dive into the world of data by exploring the HRIS.csv dataset. Though this dataset is relatively small, the techniques you learn will apply to datasets of any size. Using R, you'll perform exploratory data analysis (EDA) to identify key aspects of the data. Tools such as ggplot2 for visualization and dplyr for data manipulation will be essential in this step. Objective: Gain a comprehensive understanding of high-dimensional data characteristics and the necessity for dimensionality reduction.

Task 1.1: Load Required Libraries

Before diving into the data, you need to load the necessary libraries for data manipulation and visualization. For this task, load the dplyr and ggplot2 libraries.

🔍 Hint

Use the library() function to load each library. For example, to load dplyr, you would use library('dplyr').
🔑 Solution

library(dplyr) library(ggplot2)
Task 1.2: Read and Inspect the Dataset

Now that the libraries are loaded, the next step is to read the HRIS.csv file into R as hr_data and inspect the first few rows of the dataset to understand its structure.

🔍 Hint

Use the read.csv() function to read the file, and the head() function to display the first few rows. Remember to specify the correct path to the file.
🔑 Solution

hr_data <- read.csv('HRIS.csv') head(hr_data)
Task 1.3: Explore Data Dimensions

Understanding the dimensions of your dataset is crucial for recognizing high-dimensionality. Use R to find out the number of rows and columns in the hr_data dataset.

🔍 Hint

Use the dim() function to get the dimensions of the dataset.
🔑 Solution

dim(hr_data)
Task 1.4: Summarize the Dataset

Get a quick summary of the dataset to understand the distribution of data across different columns. This will help in identifying any potential issues with the data.

🔍 Hint

Use the summary() function to get a summary of the dataset.
🔑 Solution

summary(hr_data)
Task 1.5: Visualize Salary Distribution

Visualizing data can provide insights that are not immediately obvious from the raw data. Create a histogram to visualize the distribution of salaries in the dataset.

🔍 Hint

Use the ggplot() function from ggplot2, along with geom_histogram(), to create the histogram. Make sure to specify salary as the data to plot.
🔑 Solution

ggplot(hr_data, aes(x = salary)) + geom_histogram(binwidth = 5000)
Challenge

Implementing Principal Components Analysis (PCA)
Implementing Principal Components Analysis (PCA)

To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Continuous Data module of the Performing Dimension Analysis with R course.

Principal Components Analysis (PCA) is important because it's a fundamental technique for reducing dimensionality in linear data, helping to mitigate the curse of dimensionality. This step will provide hands-on experience in applying PCA to continuous data.

In this step, you'll apply Principal Components Analysis (PCA) to the HRIS.csv dataset to reduce its dimensionality. The goal is to extract the most significant components that explain the majority of the variance in the data. You'll use the prcomp function in R. Objective: Learn how to perform PCA on a dataset and interpret the results to understand the underlying structure of the data.

Task 2.1: Loading and Inspecting the HRIS Dataset

Load the HRIS.csv dataset into R as hr_data and inspect the first few rows to understand its structure.

🔍 Hint

Use the read.csv() function to load the dataset and the head() function to display the first few rows.
🔑 Solution

# Load the HRIS dataset hr_data <- read.csv('HRIS.csv') # Display the first few rows of the dataset head(hr_data)
Task 2.2: Preprocessing Data for PCA

Before performing PCA, it's important to preprocess the data. Load the dplyr package to help with preprocessing.

For PCA, only meaningful numeric variables are appropriate. Subset the data to columns that fit this description.

🔍 Hint

Use the select() function from the dplyr package to select variables. Note that the employee_id column is numeric, but is not meaningful for PCA.
🔑 Solution

# Load the dplyr package library(dplyr) # Retain only meaningful numeric variables hr_data_numeric <- hr_data %>% select(salary, benefits, latitude, longitude, employee_level, tenure_in_days)
Task 2.3: Performing PCA

Perform a principal componenets analysis (PCA) on the dataset. Specify that the variables should be scaled to unit variance (i.e., standardized). Summarize the result. How much cumulative variance is explained by 3 principal components?

🔍 Hint

Use the prcomp function to perform PCA. Make sure to set the scale. argument to TRUE to standardize the variables. Use summary to summarize the results
🔑 Solution

# Perform PCA pca <- prcomp(hr_data_numeric, scale. = TRUE) summary(pca) # ___% of variance is explained by 3 components # ~82%
Task 2.4: Visualize with a Biplot

Use the biplot function to plot the PCA results. Which variables are most related to the first principle component (the x-axis)?

🔍 Hint

Use the biplot function with your PCA results as the input argument.
🔑 Solution

# Make a biplot biplot(pca) # The variables ____ are most related to the first principal component # salary, employee_level, benefits
Challenge

Applying Linear Discriminant Analysis (LDA)
Applying Linear Discriminant Analysis (LDA)

To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Categorical Data module of the Performing Dimension Analysis with R course.

Linear Discriminant Analysis (LDA) is important because it provides a method for dimensionality reduction that is particularly useful for categorical data. This step will offer practical experience in using LDA for dimensionality reduction.

This step involves applying Linear Discriminant Analysis (LDA) to the HRIS.csv dataset to reduce its dimensionality while considering the categorical nature of some of its variables. You'll convert appropriate columns to factors and use the MASS package's lda function to perform LDA. Objective: Understand how to apply LDA to a dataset with categorical variables and interpret the results.

Task 3.1: Loading and Inspecting the Dataset

Begin by loading the HRIS.csv dataset into R as hr_data. Use appropriate functions to inspect the dataset, focusing on its structure and the first few rows to understand the data you'll be working with.

🔍 Hint

Use the read.csv() function to load the dataset. Then, apply the str() function to inspect the structure and head() to display the first few rows.
🔑 Solution

# Load the HRIS.csv dataset hr_data <- read.csv('HRIS.csv') # Inspect the structure of the dataset str(hr_data) # Display the first few rows of the dataset head(hr_data)
Task 3.2: Converting Categorical Variables to Factors

Some columns in the dataset are categorical and should be treated as factors for the analysis. Convert the gender, position, and department columns to factors.

🔍 Hint

Use the factor() function to convert the gender, position, and department columns to factors. Assign the converted columns back to the dataset.
🔑 Solution

# Convert the specified columns to factors hr_data$gender <- factor(hr_data$gender) hr_data$position <- factor(hr_data$position) hr_data$department <- factor(hr_data$department)
Task 3.3: Performing LDA

Load the MASS package and perform linear discriminant analysis (LDA). Predict an employee's position using their salary, benefits, and employee_level. Print the results.

🔍 Hint

Use the lda() function to perform LDA. The formula should follow the pattern predicted ~ predictor1 + predictor2....
🔑 Solution

library(MASS) lda_model <- lda(formula = position ~ salary + benefits + employee_level, data = hr_data) lda_model
Challenge

Exploring Manifold Learning Techniques
Exploring Manifold Learning Techniques

To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Non-linear Data module of the Performing Dimension Analysis with R course.

Manifold Learning is crucial for dealing with non-linear data, as it can uncover the underlying structure of data that is not apparent in high-dimensional space. This step will introduce learners to t-SNE, a manifold learning technique.

Embark on a journey through non-linear data by applying t-SNE to the HRIS.csv dataset. The goal is to discover low-dimensional manifolds that represent the data's structure. You'll use the Rtsne and dplyr packages. Objective: Gain practical experience in applying manifold learning techniques to non-linear data and understand how to interpret the results.

Task 4.1: Loading the HRIS Dataset

Begin by loading the HRIS.csv dataset into R as hr_data. This dataset contains various employee details, which you will use to explore manifold learning techniques. Use the read.csv function to load the dataset into a variable named hr_data.

🔍 Hint

Use the read.csv function with the file path as its argument.
🔑 Solution

hr_data <- read.csv('HRIS.csv')
Task 4.2: Applying t-SNE

Load the dplyr and Rtsne R packages, which are already installed in this environment. Subset the data to meaningful numeric columns. Then apply the t-SNE technique to reduce the dimensionality of the data to 2. Then, apply t-SNE to the dataset and store the result in a variable named tsne_results. Finally, plot the results.

🔍 Hint

Use the select function from dplyr to subset to meaningful numeric columns (if you loaded the MASS package in a previous step, you may need to specify dplyr::select). Then, use the Rtsne function on the numeric data with the dimensions set to 2. Plot the results using the plot function and tsne_results$Y.
🔑 Solution

# Load the dplyr and Rtsne packages library(dplyr) library(Rtsne) # Retain only meaningful numeric variables hr_data_numeric <- hr_data %>% dplyr::select(salary, benefits, latitude, longitude, employee_level, tenure_in_days) # Apply t-SNE to the HRIS dataset tsne_results <- Rtsne(hr_data_numeric, dims = 2) # Plot the t-SNE results plot(tsne_results$Y)

About the author

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.

Performing Dimension Analysis with R Hands-on Practice

Lab Info

Table of Contents

Exploring High-Dimensionality Data

RStudio Guide

Exploring High-Dimensionality Data

Task 1.1: Load Required Libraries

Task 1.2: Read and Inspect the Dataset

Task 1.3: Explore Data Dimensions

Task 1.4: Summarize the Dataset

Task 1.5: Visualize Salary Distribution

Implementing Principal Components Analysis (PCA)

Implementing Principal Components Analysis (PCA)

Task 2.1: Loading and Inspecting the HRIS Dataset

Task 2.2: Preprocessing Data for PCA

Task 2.3: Performing PCA

Task 2.4: Visualize with a Biplot

Applying Linear Discriminant Analysis (LDA)

Applying Linear Discriminant Analysis (LDA)

Task 3.1: Loading and Inspecting the Dataset

Task 3.2: Converting Categorical Variables to Factors

Task 3.3: Performing LDA

Exploring Manifold Learning Techniques

Exploring Manifold Learning Techniques

Task 4.1: Loading the HRIS Dataset

Task 4.2: Applying t-SNE

About the author

Real skill practice before real-world application

Learn by doing

Follow your guide

Turn time into mastery

Get started with Pluralsight