Designing an Exploratory Data Analysis Research Plan Hands-on Practice

Libraries: If you want this lab, consider one of these libraries.
Data

Designing an Exploratory Data Analysis Research Plan Hands-on Practice

In this lab, Data Exploration, Visualization, and Predictive Analysis Techniques, you'll dive into understanding a dataset through R programming, starting with data structure analysis and advancing to creating insightful visualizations with ggplot2. The lab transitions into predictive modeling, where you'll tackle data wrangling and build models like regression and random forests, assessing their performance for actionable insights. This comprehensive experience equips you with the skills to formulate research questions, explore data relationships, and apply predictive analysis effectively in real-world scenarios.

Get started Contact sales

Lab Info

Last updated

Sep 30, 2025

Duration

30m

Challenge

Data Exploration and Visualization
Data Exploration and Visualization

To review the concepts covered in this step, please refer to the Beginning Our Data Exploration module of the Designing an Exploratory Data Analysis Research Plan course.

Data exploration and visualization are essential steps in understanding a dataset and developing meaningful research questions. By examining the dataset's structure, distributions, and relationships between variables, we can uncover patterns, trends, and anomalies that inform our research direction.

Goal: Utilize data exploration and visualization techniques to understand the dataset's characteristics and develop insightful research questions.
1. Begin by examining the dataset's structure, missing values, and data types using the summary and glimpse functions in R. This preliminary analysis will provide a foundational understanding of the data.
2. Generate visualizations with ggplot2 to compare various aspects of the data, such as the relationships between variables. Use these visualizations to identify patterns or inconsistencies that could lead to research questions.
3. Use visualizations to develop further insights and questions about the data. These insights will help refine your research questions and direct further analysis.
Task 1.1: Loading the Dataset

Start by loading your dataset into the R environment and assigning it to a variable named df. The dataset is stored in a file named Wall Street Market Data - Fictional.csv in the current working directory.

🔍 Hint

Use the read.csv function with the file name as a string argument.
🔑 Solution

df <- read.csv('Wall Street Market Data - Fictional.csv')
Task 1.2: Exploring Dataset Structure

Load the dplyr library. Use the summary and glimpse functions to explore the structure and distributions of values in your dataset. This will give you a good overview of the data you're working with.

🔍 Hint

First, load the dplyr package using library(dplyr). Then, apply the summary function to df to get a summary of the data. Finally, use the glimpse function from the dplyr package on df to get a detailed view of the dataset structure.
🔑 Solution

library(dplyr) # Explore the dataset using summary summary(df) # Explore the dataset using glimpse glimpse(df)
Task 1.3: Visualizing Data Correlations

Use ggplot2 to create a scatter plot that shows the relationship between the variables Open and Close.

🔍 Hint

Load the ggplot2 package using library(ggplot2). Then, use the ggplot function to create a scatter plot. You will need to specify df as the data argument and use aes to define the x and y variables you want to compare.
🔑 Solution

library(ggplot2) # Create a scatter plot to explore correlations between two variables ggplot(df, aes(x = Open, y = Close)) + geom_point()
Task 1.4: Formulating Questions from Visualizations

Based on your observations from the data visualizations, formulate questions that could guide further analysis. Write down at least two questions. This task is theoretical and does not require code, but think about how the visualizations you've created could lead to meaningful questions.

🔍 Hint

Consider the patterns, trends, or anomalies you observed in the scatter plot. Is there anything odd or unexpected about the data? What other visualizations or tests could you use to understand more about these variables?
🔑 Solution

# Example questions based on observations: # Why are there so few points at middling values? Is data missing from the dataset? # How would this scatterplot looks if we isolated specific `Symbol` values? # Is there a significant correlation? If so, is the correlation value meaningful in this case? # Are there any outliers that could indicate special cases or errors in the data?
Challenge

Data Modeling and Predictive Analysis
Data Modeling and Predictive Analysis

To review the concepts covered in this step, please refer to the Data Modeling module of the Designing an Exploratory Data Analysis Research Plan course.

Building predictive models is essential for answering research questions and providing insights. This step will focus on data wrangling, understanding the data, building predictive models, and assessing their performance.

Goal: Build predictive models using the dataset and assess their performance.
1. Begin with data wrangling: handle missing values, detect and address anomalies, and prepare the dataset for modeling. Use the dplyr package to perform simple mean imputation.
2. Build predictive models including a regression models and a random forest models. Examine the performance of your models on the dataset.
Task 3.1: Impute Missing Values

Load the dplyr R package. Then use the data function to load the airquality dataset available in R, which contains some missing values in the Ozone and Solar.R columns. Impute these missing values using the mean of their respective columns.

🔍 Hint

Use the mutate function along with the ifelse() function to replace NA values with the column mean. Use the is.na() function to identify missing values to replace and the mean() function, excluding NA values, to calculate the mean.
🔑 Solution

# Load dplyr library('dplyr') # Load the airquality dataset data(airquality) # Perform mean imputation airquality <- airquality %>% mutate( Ozone = ifelse( is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone ), Solar.R = ifelse( is.na(Solar.R), mean(Solar.R, na.rm = TRUE), Solar.R ) )
Task 3.2: Simple Linear Regression Model

Explore the basics of regression by creating a simple linear regression model using the airquality dataset. Predict Ozone levels using Solar.R as the predictor. Summarize the model and plot the regression line on a scatter plot of the data.

🔍 Hint

Use the lm() function to create a linear model with the formula Ozone ~ Solar.R. Use summary() to summarize the model. For plotting, use ggplot() with geom_point() for the scatter plot and geom_smooth() with method='lm' to add the regression line.
🔑 Solution

# Load the necessary libraries library('ggplot2') # Create a linear regression model model <- lm(Ozone ~ Solar.R, data=airquality) # Summarize the model summary(model) # Plot the regression line on a scatter plot ggplot(airquality, aes(x=Solar.R, y=Ozone)) + geom_point() + geom_smooth(method='lm', col='blue')
Task 3.3: Random Forest Model

Use the ranger library to create a random forest model. The model should predict Ozone levels using Solar.R, Wind, and Temp as predictors in the airquality dataset.

🔍 Hint

Use the randomForest() function with the formula Ozone ~ Solar.R + Wind + Temp to create the model. Use print() to display a summary of the model.
🔑 Solution

# Load the necessary libraries library('ranger') # Create a random forest model using the ranger package model_ranger <- ranger(Ozone ~ Solar.R + Wind + Temp, data=airquality) # Summarize the model print(model_ranger)

About the author

Real skill practice before real-world application

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Learn by doing

Engage hands-on with the tools and technologies you’re learning. You pick the skill, we provide the credentials and environment.

Follow your guide

All labs have detailed instructions and objectives, guiding you through the learning process and ensuring you understand every step.

Turn time into mastery

On average, you retain 75% more of your learning if you take time to practice. Hands-on labs set you up for success to make those skills stick.