Hamburger Icon
  • Labs icon Lab
  • Data
Labs

Designing an Exploratory Data Analysis Research Plan Hands-on Practice

In this lab, Data Exploration, Visualization, and Predictive Analysis Techniques, you'll dive into understanding a dataset through R programming, starting with data structure analysis and advancing to creating insightful visualizations with ggplot2. The lab transitions into predictive modeling, where you'll tackle data wrangling and build models like regression and random forests, assessing their performance for actionable insights. This comprehensive experience equips you with the skills to formulate research questions, explore data relationships, and apply predictive analysis effectively in real-world scenarios.

Labs

Path Info

Duration
Clock icon 30m
Published
Clock icon Mar 27, 2024

Contact sales

By filling out this form and clicking submit, you acknowledge ourΒ privacy policy.

Table of Contents

  1. Challenge

    Data Exploration and Visualization

    Data Exploration and Visualization

    To review the concepts covered in this step, please refer to the Beginning Our Data Exploration module of the Designing an Exploratory Data Analysis Research Plan course.

    Data exploration and visualization are essential steps in understanding a dataset and developing meaningful research questions. By examining the dataset's structure, distributions, and relationships between variables, we can uncover patterns, trends, and anomalies that inform our research direction.

    Goal: Utilize data exploration and visualization techniques to understand the dataset's characteristics and develop insightful research questions.

    1. Begin by examining the dataset's structure, missing values, and data types using the summary and glimpse functions in R. This preliminary analysis will provide a foundational understanding of the data.
    2. Generate visualizations with ggplot2 to compare various aspects of the data, such as the relationships between variables. Use these visualizations to identify patterns or inconsistencies that could lead to research questions.
    3. Use visualizations to develop further insights and questions about the data. These insights will help refine your research questions and direct further analysis.

    Task 1.1: Loading the Dataset

    Start by loading your dataset into the R environment and assigning it to a variable named df. The dataset is stored in a file named Wall Street Market Data - Fictional.csv in the current working directory.

    πŸ” Hint

    Use the read.csv function with the file name as a string argument.

    πŸ”‘ Solution
    df <- read.csv('Wall Street Market Data - Fictional.csv')
    

    Task 1.2: Exploring Dataset Structure

    Load the dplyr library. Use the summary and glimpse functions to explore the structure and distributions of values in your dataset. This will give you a good overview of the data you're working with.

    πŸ” Hint

    First, load the dplyr package using library(dplyr). Then, apply the summary function to df to get a summary of the data. Finally, use the glimpse function from the dplyr package on df to get a detailed view of the dataset structure.

    πŸ”‘ Solution
    library(dplyr)
    
    # Explore the dataset using summary
    summary(df)
    
    # Explore the dataset using glimpse
    glimpse(df)
    

    Task 1.3: Visualizing Data Correlations

    Use ggplot2 to create a scatter plot that shows the relationship between the variables Open and Close.

    πŸ” Hint

    Load the ggplot2 package using library(ggplot2). Then, use the ggplot function to create a scatter plot. You will need to specify df as the data argument and use aes to define the x and y variables you want to compare.

    πŸ”‘ Solution
    library(ggplot2)
    
    # Create a scatter plot to explore correlations between two variables
    ggplot(df, aes(x = Open, y = Close)) +
      geom_point()
    

    Task 1.4: Formulating Questions from Visualizations

    Based on your observations from the data visualizations, formulate questions that could guide further analysis. Write down at least two questions. This task is theoretical and does not require code, but think about how the visualizations you've created could lead to meaningful questions.

    πŸ” Hint

    Consider the patterns, trends, or anomalies you observed in the scatter plot. Is there anything odd or unexpected about the data? What other visualizations or tests could you use to understand more about these variables?

    πŸ”‘ Solution
    # Example questions based on observations:
    # Why are there so few points at middling values? Is data missing from the dataset?
    # How would this scatterplot looks if we isolated specific `Symbol` values?
    # Is there a significant correlation? If so, is the correlation value meaningful in this case?
    # Are there any outliers that could indicate special cases or errors in the data?
    
  2. Challenge

    Data Modeling and Predictive Analysis

    Data Modeling and Predictive Analysis

    To review the concepts covered in this step, please refer to the Data Modeling module of the Designing an Exploratory Data Analysis Research Plan course.

    Building predictive models is essential for answering research questions and providing insights. This step will focus on data wrangling, understanding the data, building predictive models, and assessing their performance.

    Goal: Build predictive models using the dataset and assess their performance.

    1. Begin with data wrangling: handle missing values, detect and address anomalies, and prepare the dataset for modeling. Use the dplyr package to perform simple mean imputation.
    2. Build predictive models including a regression models and a random forest models. Examine the performance of your models on the dataset.

    Task 3.1: Impute Missing Values

    Load the dplyr R package. Then use the data function to load the airquality dataset available in R, which contains some missing values in the Ozone and Solar.R columns. Impute these missing values using the mean of their respective columns.

    πŸ” Hint

    Use the mutate function along with the ifelse() function to replace NA values with the column mean. Use the is.na() function to identify missing values to replace and the mean() function, excluding NA values, to calculate the mean.

    πŸ”‘ Solution
    # Load dplyr
    library('dplyr')
    
    # Load the airquality dataset
    data(airquality)
    
    # Perform mean imputation
    airquality <- airquality %>%
      mutate(
        Ozone = ifelse(
          is.na(Ozone),
          mean(Ozone, na.rm = TRUE),
          Ozone
        ),
        Solar.R = ifelse(
          is.na(Solar.R),
          mean(Solar.R, na.rm = TRUE),
          Solar.R
        )
      )
    

    Task 3.2: Simple Linear Regression Model

    Explore the basics of regression by creating a simple linear regression model using the airquality dataset. Predict Ozone levels using Solar.R as the predictor. Summarize the model and plot the regression line on a scatter plot of the data.

    πŸ” Hint

    Use the lm() function to create a linear model with the formula Ozone ~ Solar.R. Use summary() to summarize the model. For plotting, use ggplot() with geom_point() for the scatter plot and geom_smooth() with method='lm' to add the regression line.

    πŸ”‘ Solution
    # Load the necessary libraries
    library('ggplot2')
    
    # Create a linear regression model
    model <- lm(Ozone ~ Solar.R, data=airquality)
    
    # Summarize the model
    summary(model)
    
    # Plot the regression line on a scatter plot
    ggplot(airquality, aes(x=Solar.R, y=Ozone)) +
      geom_point() +
      geom_smooth(method='lm', col='blue')
    

    Task 3.3: Random Forest Model

    Use the ranger library to create a random forest model. The model should predict Ozone levels using Solar.R, Wind, and Temp as predictors in the airquality dataset.

    πŸ” Hint

    Use the randomForest() function with the formula Ozone ~ Solar.R + Wind + Temp to create the model. Use print() to display a summary of the model.

    πŸ”‘ Solution
    # Load the necessary libraries
    library('ranger')
    
    # Create a random forest model using the ranger package
    model_ranger <- ranger(Ozone ~ Solar.R + Wind + Temp, data=airquality)
    
    # Summarize the model
    print(model_ranger)
    

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.