- Lab
- Data

Designing an Exploratory Data Analysis Research Plan Hands-on Practice
In this lab, Data Exploration, Visualization, and Predictive Analysis Techniques, you'll dive into understanding a dataset through R programming, starting with data structure analysis and advancing to creating insightful visualizations with ggplot2. The lab transitions into predictive modeling, where you'll tackle data wrangling and build models like regression and random forests, assessing their performance for actionable insights. This comprehensive experience equips you with the skills to formulate research questions, explore data relationships, and apply predictive analysis effectively in real-world scenarios.

Path Info
Table of Contents
-
Challenge
Data Exploration and Visualization
Data Exploration and Visualization
To review the concepts covered in this step, please refer to the Beginning Our Data Exploration module of the Designing an Exploratory Data Analysis Research Plan course.
Data exploration and visualization are essential steps in understanding a dataset and developing meaningful research questions. By examining the dataset's structure, distributions, and relationships between variables, we can uncover patterns, trends, and anomalies that inform our research direction.
Goal: Utilize data exploration and visualization techniques to understand the dataset's characteristics and develop insightful research questions.
- Begin by examining the dataset's structure, missing values, and data types using the
summary
andglimpse
functions in R. This preliminary analysis will provide a foundational understanding of the data. - Generate visualizations with
ggplot2
to compare various aspects of the data, such as the relationships between variables. Use these visualizations to identify patterns or inconsistencies that could lead to research questions. - Use visualizations to develop further insights and questions about the data. These insights will help refine your research questions and direct further analysis.
Task 1.1: Loading the Dataset
Start by loading your dataset into the R environment and assigning it to a variable named
df
. The dataset is stored in a file namedWall Street Market Data - Fictional.csv
in the current working directory.π Hint
Use the
read.csv
function with the file name as a string argument.π Solution
df <- read.csv('Wall Street Market Data - Fictional.csv')
Task 1.2: Exploring Dataset Structure
Load the
dplyr
library. Use thesummary
andglimpse
functions to explore the structure and distributions of values in your dataset. This will give you a good overview of the data you're working with.π Hint
First, load the
dplyr
package usinglibrary(dplyr)
. Then, apply thesummary
function todf
to get a summary of the data. Finally, use theglimpse
function from thedplyr
package ondf
to get a detailed view of the dataset structure.π Solution
library(dplyr) # Explore the dataset using summary summary(df) # Explore the dataset using glimpse glimpse(df)
Task 1.3: Visualizing Data Correlations
Use
ggplot2
to create a scatter plot that shows the relationship between the variablesOpen
andClose
.π Hint
Load the
ggplot2
package usinglibrary(ggplot2)
. Then, use theggplot
function to create a scatter plot. You will need to specifydf
as the data argument and useaes
to define the x and y variables you want to compare.π Solution
library(ggplot2) # Create a scatter plot to explore correlations between two variables ggplot(df, aes(x = Open, y = Close)) + geom_point()
Task 1.4: Formulating Questions from Visualizations
Based on your observations from the data visualizations, formulate questions that could guide further analysis. Write down at least two questions. This task is theoretical and does not require code, but think about how the visualizations you've created could lead to meaningful questions.
π Hint
Consider the patterns, trends, or anomalies you observed in the scatter plot. Is there anything odd or unexpected about the data? What other visualizations or tests could you use to understand more about these variables?
π Solution
# Example questions based on observations: # Why are there so few points at middling values? Is data missing from the dataset? # How would this scatterplot looks if we isolated specific `Symbol` values? # Is there a significant correlation? If so, is the correlation value meaningful in this case? # Are there any outliers that could indicate special cases or errors in the data?
- Begin by examining the dataset's structure, missing values, and data types using the
-
Challenge
Data Modeling and Predictive Analysis
Data Modeling and Predictive Analysis
To review the concepts covered in this step, please refer to the Data Modeling module of the Designing an Exploratory Data Analysis Research Plan course.
Building predictive models is essential for answering research questions and providing insights. This step will focus on data wrangling, understanding the data, building predictive models, and assessing their performance.
Goal: Build predictive models using the dataset and assess their performance.
- Begin with data wrangling: handle missing values, detect and address anomalies, and prepare the dataset for modeling. Use the
dplyr
package to perform simple mean imputation. - Build predictive models including a regression models and a random forest models. Examine the performance of your models on the dataset.
Task 3.1: Impute Missing Values
Load the
dplyr
R package. Then use thedata
function to load theairquality
dataset available in R, which contains some missing values in theOzone
andSolar.R
columns. Impute these missing values using the mean of their respective columns.π Hint
Use the
mutate
function along with theifelse()
function to replace NA values with the column mean. Use theis.na()
function to identify missing values to replace and themean()
function, excluding NA values, to calculate the mean.π Solution
# Load dplyr library('dplyr') # Load the airquality dataset data(airquality) # Perform mean imputation airquality <- airquality %>% mutate( Ozone = ifelse( is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone ), Solar.R = ifelse( is.na(Solar.R), mean(Solar.R, na.rm = TRUE), Solar.R ) )
Task 3.2: Simple Linear Regression Model
Explore the basics of regression by creating a simple linear regression model using the
airquality
dataset. PredictOzone
levels usingSolar.R
as the predictor. Summarize the model and plot the regression line on a scatter plot of the data.π Hint
Use the
lm()
function to create a linear model with the formulaOzone ~ Solar.R
. Usesummary()
to summarize the model. For plotting, useggplot()
withgeom_point()
for the scatter plot andgeom_smooth()
withmethod='lm'
to add the regression line.π Solution
# Load the necessary libraries library('ggplot2') # Create a linear regression model model <- lm(Ozone ~ Solar.R, data=airquality) # Summarize the model summary(model) # Plot the regression line on a scatter plot ggplot(airquality, aes(x=Solar.R, y=Ozone)) + geom_point() + geom_smooth(method='lm', col='blue')
Task 3.3: Random Forest Model
Use the
ranger
library to create a random forest model. The model should predictOzone
levels usingSolar.R
,Wind
, andTemp
as predictors in theairquality
dataset.π Hint
Use the
randomForest()
function with the formulaOzone ~ Solar.R + Wind + Temp
to create the model. Useprint()
to display a summary of the model.π Solution
# Load the necessary libraries library('ranger') # Create a random forest model using the ranger package model_ranger <- ranger(Ozone ~ Solar.R + Wind + Temp, data=airquality) # Summarize the model print(model_ranger)
- Begin with data wrangling: handle missing values, detect and address anomalies, and prepare the dataset for modeling. Use the
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the authorβs guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.