Hypothesis Testing  Interpreting Data with Statistical Models
Aug 16, 2019 • 15 Minute Read
Introduction
Building predictive models, or carrying out data science research, depends on formulating a hypothesis and drawing conclusions using statistical tests. In this guide, you will learn about how to perform these tests using the statistical programming language, 'R'.
The most widely used inferential statistic techniques are covered in this guide, as listed below:

One sample Ttest

Independent Ttest

Chisquare Test

Correlation Test

Analysis of Variance (ANOVA)
We will begin by loading the data.
Data
In this guide, we will be using the fictitious data of loan applicants containing 200 observations and ten variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No").

Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

Income: Annual Income of the applicant (in USD).

Loan_amount: Loan amount (in USD) for which the application was submitted.

Credit_score: Whether the applicant's credit score was good ("Good") or not ("Bad").

approval_status: Whether the loan application was approved ("Yes") or not ("No").

Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

gender: Whether the applicant is "Female" or "Male".

age: The applicant’s age in years.

work_exp: work experience in years.
Let us start by loading the required libraries and the data.
library(readr)
library(dplyr)
library(mlbench)
#loading the data
df < read_csv("data_test.csv")
glimpse(df)
Output:
Observations: 200
Variables: 10
Marital_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
Is_graduate <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ...
Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
Credit_score <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad"...
approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
gender <chr> "Female", "Female", "Female", "Female", "Female", "Fem...
age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
work_exp <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....
Key Terms
Before moving ahead to the statistical tests, it is good to understand a few important terminologies.
 Null and Alternative Hypotheses
The statistical tests in this guide rely on testing a null hypothesis, which is specific for each case.
The null hypothesis assumes the absence of relationship between two or more variables. For example, for two groups, the null hypothesis assumes that there is no correlation or association between the two variables.
The alternative hypothesis is simply the contrary of the null hypothesis.
 Pvalue
For any statistical test, the pvalue is a statistic used to evaluate if we will reject or fail to reject the null hypothesis. It is defined as the probability of obtaining a result equal to or more extreme than what was observed in the data.
 Decision Rule
The pvalue, determined by conducting the statistical test, is then compared to a predetermined value ‘alpha’, which is often taken as 0.05.
The decision rule is: if the pvalue for the test is less than 0.05, we reject the null hypothesis, but if it is greater than or equal to 0.05, we fail to reject the null hypothesis.
One Sample Ttest
The idea behind one sample ttest is to compare the mean of a vector against a theoretical mean. In our data, we will be taking the 'Income' variable, and evaluating it against the theoretical mean.
As per the United States Census Bureau's annual mid year population estimates, the average per capita personal income in the United States, in the year 2018, was USD 53,820. We will be testing a claim that the mean income of the applicants is USD 53,820.
An important assumption of the onesample ttest is that the distribution of the variable 'Income' should be normally distributed. The line of code below creates a histogram, which seems to be approximately normally distributed.
hist(df$Income, main='Annual Income of Loan Applicants in USD',xlab='Income(USD)')
Output:
![image name](https://i.imgur.com/AEpv8FF.png)
Since the normality assumption is satisfied, we will go ahead with the ttest. In 'R', the t.test function is used to perform this task, which is done in the line of code below. The first argument is the vector of numbers, 'Income', while the second argument is the theoretical mean, denoted by the notation 'mu'.
t.test(df$Income, mu=53820)
Output:
One Sample ttest
data: df$Income
t = 11.871, df = 199, pvalue < 2.2e16
alternative hypothesis: true mean is not equal to 53820
95 percent confidence interval: 61266.55, 64233.45
sample estimates: mean of x: 62750
Interpretation of the Output
The output above prints the tstatistic (t = 11.871) and the degrees of freedom, which is 199 (n  1). The pvalue here is close to 0, and less than 0.05, which means that we would reject the null hypothesis that the population mean is equal to USD 53,820.
Another point to notice is the line "alternative hypothesis: true mean is not equal to 53820." This is corresponding to a twosided alternative hypothesis. If we wanted to make it a onesided ttest, then we will add the argument "less" or "greater" in quotes, and that will define the direction of our alternative hypothesis.
Independent Ttest
In this test, we are going to compare two independent groups and see if their means are equal. The variable under study is the ‘work_exp’ variable, and we will test whether the work experience is the same across the male and the female applicants.
The first and second lines of code below create two vectors containing the work experience of the female and male applicants, respectively. We must also test the assumption that both the groups are normally distributed. This is done in the third to fifth lines of code below, which creates two histograms. The histograms suggest that both the variables are approximately normally distributed.
f_workexp = df$work_exp[df$gender=='Female']
m_workexp = df$work_exp[df$gender=='Male']
#histogram
par(mfrow=c(1,2))
hist(f_workexp)
hist(m_workexp)
Output:
![image name](https://i.imgur.com/5dnhPCi.png)
Since the normality assumption is satisfied, we will perform the ttest using the line of code below.
t.test(f_workexp, m_workexp)
Output:
Welch Two Sample ttest
data: f_workexp and m_workexp
t = 0.29465, df = 25.088, pvalue = 0.7707
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval: 0.7904954, 0.5925894
sample estimates: mean of x mean of y: 7.832865, 7.931818
Interpretation of the Output
Since the pvalue of 0.7707 is greater than 0.05, we fail to reject the null hypothesis that the means of these two groups are equal. In other words, there is no significant difference in the work experience of the male and female applicants.
Chisquare Test of Independence
Chi Square Test of Independence is used to determine if there is an association between two or more categorical variables. In our case, we would like to test if the marital status of the applicants has any association with the approval status.
The first step is to create a twoway table between the variables under study, which is done in the lines of code below.
mar_approval <table(df$Marital_status, df$approval_status)
mar_approval
Output:
No Yes
Divorced 31 29
No 66 10
Yes 52 12
The next step is to generate the expected counts using the line of code below.
chisq.test(mar_approval, correct=FALSE)$expected
Output:
No Yes
Divorced 44.70 15.30
No 56.62 19.38
Yes 47.68 16.32
We are now ready to run the test of independence using the chisq.test function, as shown in the line of code below.
chisq.test(mar_approval, correct=FALSE)
Output:
Pearson's Chisquared test
data: mar_approval
Xsquared = 24.095, df = 2, pvalue = 5.859e06
Interpretation of the Output
Since the pvalue is less than 0.05, we reject the null hypothesis that the marital status of the applicants is not associated with the approval status.
Correlation Test
Correlation Tests are used to determine the presence and extent of a linear relationship between two quantitative variables. In our case, we would like to statistically test if there is a correlation between the applicant’s investment and the work experience.
The first step is to visualize the relationship with the scatter plot, which is done in the line of code below.
plot(df$Investment,df$work_exp, main="Correlation between Investment Levels and Work Experience", xlab="Work experience in years", ylab="Investment in USD")
Output:
![image name](https://i.imgur.com/pip40R6.png)
The above plot suggests the absence of linear relationship between the two variables. We can quantify this inference by calculating the correlation coefficient, which is done below.
cor(df$Investment, df$work_exp)
Output:
1] 0.06168653
The value of 0.06 shows positive but weak linear relationship between the two variables. Let us further confirm this with the correlation test, which is done in ‘R’ with the cor.test() function.
The basic syntax is ***cor.test(var1, var2, method = “method”)***, with the default method being “pearson”. This is done with the line of code below.
cor.test(df$Investment, df$work_exp)
Output:
Pearson's productmoment correlation
data: df$Investment and df$work_exp
t = 0.86966, df = 198, pvalue = 0.3855
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval: 0.07771964, 0.19872675
sample estimates: cor 0.06168653
Interpretation of the Output
Since the pvalue of 0.3855 is greater than 0.05, we fail to reject the null hypothesis that the relationship between the applicant’s investment and their work experience is not significant.
Analysis of Variance (ANOVA)
The Analysis of Variance (ANOVA) test is used to determine if the categorical group ('Marital_status') has any impact on the numerical variable ('Income'). In our case, the null hypothesis to test is that the applicant’s marital status has no impact on their income level.
The first step is to calculate the average income by the applicants, in each category of the variable 'Marital_status'. The line of code below performs this task.
aggregate(Income~Marital_status,df,mean)
Output:
Marital_status Income
Divorced 62166.67
No 63052.63
Yes 62937.50
The next step is to calculate the standard deviation of income levels within each group, which is done in the line of code below.
aggregate(Income~Marital_status,df,sd)
Output:
Marital_status Income
Divorced 11213.14
No 10345.88
Yes 10576.82
The standard deviation is calculated to test if the assumptions of ANOVA is satisfied or not. Since the largest standard deviation is 11213, for 'Divorced’ Group, is not more than twice the smallest standard deviation, 10345, we can conclude that the assumptions are satisfied, and we can go ahead with the test. The final step is to run the ‘anova’ test and print the summary result, which is done in the lines of code below.
anova_1 = aov(df$Income~df$Marital_status)
summary(anova_1)
Output:
Df Sum Sq Mean Sq F value Pr(>F)
df$Marital_status 2 2.963e+07 14813596 0.13 0.878
Residuals 197 2.249e+10 114182095
Interpretation of the Output
Since the pvalue of 0.878 is greater than 0.05, we fail to reject the null hypothesis that there is no impact on the income levels of the applicants, basis their marital status.
Since ANOVA results are not significant, there is no need to conduct the Tukey’s HSD posthoc tests, to understand the differences in the group (Income) means. However, if the results were significant, we could have run the posthoc test, as done in the line of code below.
TukeyHSD(anova_1)
Output:
Tukey multiple comparisons of means
95% familywise confidence level
Fit: aov(formula = df$Income ~ df$Marital_status)
$`df$Marital_status`
diff lwr upr p adj
NoDivorced 885.9649 3472.022 5243.952 0.8807915
YesDivorced 770.8333 3763.821 5305.487 0.9150500
YesNo 115.1316 4396.338 4166.075 0.9977789
All the pvalues are greater than 0.05, which suggests that the variations in the income level of the applicants, based on their marital status, is not significant.
Conclusion
In this guide, you have learned about several techniques for performing hypothesis testing for data interpretation. You also learned about how to interpret the results of the statistical tests in the context of null hypothesis. To learn more about data science using 'R', please refer to the following guides: