Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Performing Dimension Analysis with R Hands-on Practice

In this lab, you will explore the complexities of high-dimensional data with R, beginning with visualization techniques to understand and explore the intricacies of datasets like the Boston Housing Dataset. Progressing to feature selection and dimensionality reduction, you will learn to streamline data for efficiency in machine learning models using methods such as stepwise regression and PCA. The lab culminates in applying advanced techniques such as LDA, QDA, and manifold learning for both categorical and non-linear data, equipping you with the skills to enhance model performance across diverse datasets.

Labs

Path Info

Duration
Clock icon 1h 0m
Published
Clock icon Apr 05, 2024

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Exploring High-Dimensionality Data

    RStudio Guide

    To get started, click on the 'workspace' folder in the bottom right pane of RStudio. Click on the file entitled "Step 1...". You may want to drag the console pane to be smaller so that you have more room to work. You'll complete each task for Step 1 in that R Markdown file. Remember, you must run the cells with the play button at the top right of each cell for a task before moving onto the next task in the R Markdown file. Continue until you have completed all tasks in this step. Then when you are ready to move onto the next step, you'll come back and click on the file for the next step until you have completed all tasks in all steps of the lab.


    Exploring High-Dimensionality Data

    To review the concepts covered in this step, please refer to the Understanding the Importance of Reducing Complexity in Data module of the Performing Dimension Analysis with R course.

    Understanding the importance of reducing complexity in data is crucial because it sets the foundation for effective dimensionality reduction and feature selection. This step will help learners grasp the problems associated with high-dimensionality data and the difference between feature selection and dimensionality reduction.

    Dive into the world of data by exploring the HRIS.csv dataset. Though this dataset is relatively small, the techniques you learn will apply to datasets of any size. Using R, you'll perform exploratory data analysis (EDA) to identify key aspects of the data. Tools such as ggplot2 for visualization and dplyr for data manipulation will be essential in this step. Objective: Gain a comprehensive understanding of high-dimensional data characteristics and the necessity for dimensionality reduction.


    Task 1.1: Load Required Libraries

    Before diving into the data, you need to load the necessary libraries for data manipulation and visualization. For this task, load the dplyr and ggplot2 libraries.

    πŸ” Hint

    Use the library() function to load each library. For example, to load dplyr, you would use library('dplyr').

    πŸ”‘ Solution
    library(dplyr)
    library(ggplot2)
    

    Task 1.2: Read and Inspect the Dataset

    Now that the libraries are loaded, the next step is to read the HRIS.csv file into R as hr_data and inspect the first few rows of the dataset to understand its structure.

    πŸ” Hint

    Use the read.csv() function to read the file, and the head() function to display the first few rows. Remember to specify the correct path to the file.

    πŸ”‘ Solution
    hr_data <- read.csv('HRIS.csv')
    head(hr_data)
    

    Task 1.3: Explore Data Dimensions

    Understanding the dimensions of your dataset is crucial for recognizing high-dimensionality. Use R to find out the number of rows and columns in the hr_data dataset.

    πŸ” Hint

    Use the dim() function to get the dimensions of the dataset.

    πŸ”‘ Solution
    dim(hr_data)
    

    Task 1.4: Summarize the Dataset

    Get a quick summary of the dataset to understand the distribution of data across different columns. This will help in identifying any potential issues with the data.

    πŸ” Hint

    Use the summary() function to get a summary of the dataset.

    πŸ”‘ Solution
    summary(hr_data)
    

    Task 1.5: Visualize Salary Distribution

    Visualizing data can provide insights that are not immediately obvious from the raw data. Create a histogram to visualize the distribution of salaries in the dataset.

    πŸ” Hint

    Use the ggplot() function from ggplot2, along with geom_histogram(), to create the histogram. Make sure to specify salary as the data to plot.

    πŸ”‘ Solution
    ggplot(hr_data, aes(x = salary)) +
      geom_histogram(binwidth = 5000)
    
  2. Challenge

    Implementing Principal Components Analysis (PCA)

    Implementing Principal Components Analysis (PCA)

    To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Continuous Data module of the Performing Dimension Analysis with R course.

    Principal Components Analysis (PCA) is important because it's a fundamental technique for reducing dimensionality in linear data, helping to mitigate the curse of dimensionality. This step will provide hands-on experience in applying PCA to continuous data.

    In this step, you'll apply Principal Components Analysis (PCA) to the HRIS.csv dataset to reduce its dimensionality. The goal is to extract the most significant components that explain the majority of the variance in the data. You'll use the prcomp function in R. Objective: Learn how to perform PCA on a dataset and interpret the results to understand the underlying structure of the data.


    Task 2.1: Loading and Inspecting the HRIS Dataset

    Load the HRIS.csv dataset into R as hr_data and inspect the first few rows to understand its structure.

    πŸ” Hint

    Use the read.csv() function to load the dataset and the head() function to display the first few rows.

    πŸ”‘ Solution
    # Load the HRIS dataset
    hr_data <- read.csv('HRIS.csv')
    
    # Display the first few rows of the dataset
    head(hr_data)
    

    Task 2.2: Preprocessing Data for PCA

    Before performing PCA, it's important to preprocess the data. Load the dplyr package to help with preprocessing.

    For PCA, only meaningful numeric variables are appropriate. Subset the data to columns that fit this description.

    πŸ” Hint

    Use the select() function from the dplyr package to select variables. Note that the employee_id column is numeric, but is not meaningful for PCA.

    πŸ”‘ Solution
    # Load the dplyr package
    library(dplyr)
    
    # Retain only meaningful numeric variables
    hr_data_numeric <- hr_data %>%
      select(salary, benefits, latitude, longitude, employee_level, tenure_in_days)
    

    Task 2.3: Performing PCA

    Perform a principal componenets analysis (PCA) on the dataset. Specify that the variables should be scaled to unit variance (i.e., standardized). Summarize the result. How much cumulative variance is explained by 3 principal components?

    πŸ” Hint

    Use the prcomp function to perform PCA. Make sure to set the scale. argument to TRUE to standardize the variables. Use summary to summarize the results

    πŸ”‘ Solution
    # Perform PCA
    pca <- prcomp(hr_data_numeric, scale. = TRUE)
    summary(pca)
    
    # ___% of variance is explained by 3 components
    # ~82%
    

    Task 2.4: Visualize with a Biplot

    Use the biplot function to plot the PCA results. Which variables are most related to the first principle component (the x-axis)?

    πŸ” Hint

    Use the biplot function with your PCA results as the input argument.

    πŸ”‘ Solution
    # Make a biplot
    biplot(pca)
    
    # The variables ____ are most related to the first principal component
    # salary, employee_level, benefits
    
  3. Challenge

    Applying Linear Discriminant Analysis (LDA)

    Applying Linear Discriminant Analysis (LDA)

    To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Categorical Data module of the Performing Dimension Analysis with R course.

    Linear Discriminant Analysis (LDA) is important because it provides a method for dimensionality reduction that is particularly useful for categorical data. This step will offer practical experience in using LDA for dimensionality reduction.

    This step involves applying Linear Discriminant Analysis (LDA) to the HRIS.csv dataset to reduce its dimensionality while considering the categorical nature of some of its variables. You'll convert appropriate columns to factors and use the MASS package's lda function to perform LDA. Objective: Understand how to apply LDA to a dataset with categorical variables and interpret the results.


    Task 3.1: Loading and Inspecting the Dataset

    Begin by loading the HRIS.csv dataset into R as hr_data. Use appropriate functions to inspect the dataset, focusing on its structure and the first few rows to understand the data you'll be working with.

    πŸ” Hint

    Use the read.csv() function to load the dataset. Then, apply the str() function to inspect the structure and head() to display the first few rows.

    πŸ”‘ Solution
    # Load the HRIS.csv dataset
    hr_data <- read.csv('HRIS.csv')
    
    # Inspect the structure of the dataset
    str(hr_data)
    
    # Display the first few rows of the dataset
    head(hr_data)
    

    Task 3.2: Converting Categorical Variables to Factors

    Some columns in the dataset are categorical and should be treated as factors for the analysis. Convert the gender, position, and department columns to factors.

    πŸ” Hint

    Use the factor() function to convert the gender, position, and department columns to factors. Assign the converted columns back to the dataset.

    πŸ”‘ Solution
    # Convert the specified columns to factors
    hr_data$gender <- factor(hr_data$gender)
    hr_data$position <- factor(hr_data$position)
    hr_data$department <- factor(hr_data$department)
    

    Task 3.3: Performing LDA

    Load the MASS package and perform linear discriminant analysis (LDA). Predict an employee's position using their salary, benefits, and employee_level. Print the results.

    πŸ” Hint

    Use the lda() function to perform LDA. The formula should follow the pattern predicted ~ predictor1 + predictor2....

    πŸ”‘ Solution
    library(MASS)
    
    lda_model <- lda(formula = position ~ salary + benefits + employee_level, data = hr_data)
    lda_model
    
  4. Challenge

    Exploring Manifold Learning Techniques

    Exploring Manifold Learning Techniques

    To review the concepts covered in this step, please refer to the Performing Dimensional Analysis for Non-linear Data module of the Performing Dimension Analysis with R course.

    Manifold Learning is crucial for dealing with non-linear data, as it can uncover the underlying structure of data that is not apparent in high-dimensional space. This step will introduce learners to t-SNE, a manifold learning technique.

    Embark on a journey through non-linear data by applying t-SNE to the HRIS.csv dataset. The goal is to discover low-dimensional manifolds that represent the data's structure. You'll use the Rtsne and dplyr packages. Objective: Gain practical experience in applying manifold learning techniques to non-linear data and understand how to interpret the results.


    Task 4.1: Loading the HRIS Dataset

    Begin by loading the HRIS.csv dataset into R as hr_data. This dataset contains various employee details, which you will use to explore manifold learning techniques. Use the read.csv function to load the dataset into a variable named hr_data.

    πŸ” Hint

    Use the read.csv function with the file path as its argument.

    πŸ”‘ Solution
    hr_data <- read.csv('HRIS.csv')
    

    Task 4.2: Applying t-SNE

    Load the dplyr and Rtsne R packages, which are already installed in this environment. Subset the data to meaningful numeric columns. Then apply the t-SNE technique to reduce the dimensionality of the data to 2. Then, apply t-SNE to the dataset and store the result in a variable named tsne_results. Finally, plot the results.

    πŸ” Hint

    Use the select function from dplyr to subset to meaningful numeric columns (if you loaded the MASS package in a previous step, you may need to specify dplyr::select). Then, use the Rtsne function on the numeric data with the dimensions set to 2. Plot the results using the plot function and tsne_results$Y.

    πŸ”‘ Solution
    # Load the dplyr and Rtsne packages
    library(dplyr)
    library(Rtsne)
    
    # Retain only meaningful numeric variables
    hr_data_numeric <- hr_data %>%
      dplyr::select(salary, benefits, latitude, longitude, employee_level, tenure_in_days)
    
    # Apply t-SNE to the HRIS dataset
    tsne_results <- Rtsne(hr_data_numeric, dims = 2)
    
    # Plot the t-SNE results
    plot(tsne_results$Y)
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.