Author avatar

Deepika Singh

Hypothesis Testing - Interpreting Data with Statistical Models

Deepika Singh

  • Aug 16, 2019
  • 15 Min read
  • 81 Views
  • Aug 16, 2019
  • 15 Min read
  • 81 Views
Data
Hypothesis Testing

Introduction

Building predictive models, or carrying out data science research, depends on formulating a hypothesis and drawing conclusions using statistical tests. In this guide, you will learn about how to perform these tests using the statistical programming language, 'R'.

The most widely used inferential statistic techniques are covered in this guide, as listed below:

  1. One sample T-test

  1. Independent T-test

  1. Chi-square Test

  1. Correlation Test

  1. Analysis of Variance (ANOVA)

We will begin by loading the data.

Data

In this guide, we will be using the fictitious data of loan applicants containing 200 observations and ten variables, as described below:

  1. Marital_status: Whether the applicant is married ("Yes") or not ("No").

  1. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

  1. Income: Annual Income of the applicant (in USD).

  1. Loan_amount: Loan amount (in USD) for which the application was submitted.

  1. Credit_score: Whether the applicant's credit score was good ("Good") or not ("Bad").

  1. approval_status: Whether the loan application was approved ("Yes") or not ("No").

  1. Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

  1. gender: Whether the applicant is "Female" or "Male".

  1. age: The applicant’s age in years.

  1. work_exp: work experience in years.

Let us start by loading the required libraries and the data.

1
2
3
4
5
6
7
8
library(readr)
library(dplyr)
library(mlbench)
 
#loading the data
df <- read_csv("data_test.csv")
glimpse(df)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
Observations: 200
Variables: 10
Marital_status  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
Is_graduate 	<chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ...
Income      	  <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
Loan_amount  <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
Credit_score	<chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad"...
approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
Investment  	<int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
gender      	<chr> "Female", "Female", "Female", "Female", "Female", "Fem...
age         	  <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
work_exp  	<dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....
 

Key Terms

Before moving ahead to the statistical tests, it is good to understand a few important terminologies.

  1. Null and Alternative Hypotheses

The statistical tests in this guide rely on testing a null hypothesis, which is specific for each case.

The null hypothesis assumes the absence of relationship between two or more variables. For example, for two groups, the null hypothesis assumes that there is no correlation or association between the two variables.

The alternative hypothesis is simply the contrary of the null hypothesis.

  1. P-value

For any statistical test, the p-value is a statistic used to evaluate if we will reject or fail to reject the null hypothesis. It is defined as the probability of obtaining a result equal to or more extreme than what was observed in the data.

  1. Decision Rule

The p-value, determined by conducting the statistical test, is then compared to a predetermined value ‘alpha’, which is often taken as 0.05.

The decision rule is: if the p-value for the test is less than 0.05, we reject the null hypothesis, but if it is greater than or equal to 0.05, we fail to reject the null hypothesis.

One Sample T-test

The idea behind one sample t-test is to compare the mean of a vector against a theoretical mean. In our data, we will be taking the 'Income' variable, and evaluating it against the theoretical mean.

As per the United States Census Bureau's annual mid year population estimates, the average per capita personal income in the United States, in the year 2018, was USD 53,820. We will be testing a claim that the mean income of the applicants is USD 53,820.

An important assumption of the one-sample t-test is that the distribution of the variable 'Income' should be normally distributed. The line of code below creates a histogram, which seems to be approximately normally distributed.

1
2
hist(df$Income, main='Annual Income of Loan Applicants in USD',xlab='Income(USD)')
 
{r}

Output:

1
2
3
![image name](https://i.imgur.com/AEpv8FF.png)
 
 

Since the normality assumption is satisfied, we will go ahead with the t-test. In 'R', the t.test function is used to perform this task, which is done in the line of code below. The first argument is the vector of numbers, 'Income', while the second argument is the theoretical mean, denoted by the notation 'mu'.

1
t.test(df$Income, mu=53820)
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
One Sample t-test
 
data:  df$Income
t = 11.871, df = 199, p-value < 2.2e-16
 
alternative hypothesis: true mean is not equal to 53820
 
95 percent confidence interval: 61266.55, 64233.45
 
sample estimates: mean of x: 62750
 

Interpretation of the Output

The output above prints the t-statistic (t = 11.871) and the degrees of freedom, which is 199 (n - 1). The p-value here is close to 0, and less than 0.05, which means that we would reject the null hypothesis that the population mean is equal to USD 53,820.

Another point to notice is the line "alternative hypothesis: true mean is not equal to 53820." This is corresponding to a two-sided alternative hypothesis. If we wanted to make it a one-sided t-test, then we will add the argument "less" or "greater" in quotes, and that will define the direction of our alternative hypothesis.

Independent T-test

In this test, we are going to compare two independent groups and see if their means are equal. The variable under study is the ‘work_exp’ variable, and we will test whether the work experience is the same across the male and the female applicants.

The first and second lines of code below create two vectors containing the work experience of the female and male applicants, respectively. We must also test the assumption that both the groups are normally distributed. This is done in the third to fifth lines of code below, which creates two histograms. The histograms suggest that both the variables are approximately normally distributed.

1
2
3
4
5
6
7
8
9
 
f_workexp = df$work_exp[df$gender=='Female']
m_workexp = df$work_exp[df$gender=='Male']
 
#histogram
par(mfrow=c(1,2))
hist(f_workexp)
hist(m_workexp)
 
{r}

Output:

1
2
3
![image name](https://i.imgur.com/5dnhPCi.png)
 
 

Since the normality assumption is satisfied, we will perform the t-test using the line of code below.

1
2
3
 
t.test(f_workexp, m_workexp)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
           	Welch Two Sample t-test
 
data:  f_workexp and m_workexp
t = -0.29465, df = 25.088, p-value = 0.7707
 
alternative hypothesis: true difference in means is not equal to 0
 
95 percent confidence interval: -0.7904954, 0.5925894
 
sample estimates: mean of x mean of y: 7.832865, 7.931818
 

Interpretation of the Output

Since the p-value of 0.7707 is greater than 0.05, we fail to reject the null hypothesis that the means of these two groups are equal. In other words, there is no significant difference in the work experience of the male and female applicants.

Chi-square Test of Independence

Chi Square Test of Independence is used to determine if there is an association between two or more categorical variables. In our case, we would like to test if the marital status of the applicants has any association with the approval status.

The first step is to create a two-way table between the variables under study, which is done in the lines of code below.

1
2
3
mar_approval <-table(df$Marital_status, df$approval_status)
mar_approval
 
{r}

Output:

1
2
3
4
5
6
       
	        No   Yes
  Divorced     31     29
  No   	         66     10
  Yes  	         52     12
 

The next step is to generate the expected counts using the line of code below.

1
2
chisq.test(mar_approval, correct=FALSE)$expected
 
{r}

Output:

1
2
3
4
5
          	         No        Yes
  Divorced    44.70    15.30
  No   	        56.62     19.38
  Yes  	        47.68     16.32
 

We are now ready to run the test of independence using the chisq.test function, as shown in the line of code below.

1
2
chisq.test(mar_approval, correct=FALSE)
 
{r}

Output:

1
2
3
4
5
           	Pearson's Chi-squared test
 
data:  mar_approval
X-squared = 24.095, df = 2, p-value = 5.859e-06
 

Interpretation of the Output

Since the p-value is less than 0.05, we reject the null hypothesis that the marital status of the applicants is not associated with the approval status.

Correlation Test

Correlation Tests are used to determine the presence and extent of a linear relationship between two quantitative variables. In our case, we would like to statistically test if there is a correlation between the applicant’s investment and the work experience.

The first step is to visualize the relationship with the scatter plot, which is done in the line of code below.

1
plot(df$Investment,df$work_exp, main="Correlation between Investment Levels and Work Experience", xlab="Work experience in years", ylab="Investment in USD")
{r}

Output:

1
2
3
![image name](https://i.imgur.com/pip40R6.png)
 
 

The above plot suggests the absence of linear relationship between the two variables. We can quantify this inference by calculating the correlation coefficient, which is done below.

1
2
3
 
cor(df$Investment, df$work_exp)
 
{r}

Output:

1
2
[1] 0.06168653
 

The value of 0.06 shows positive but weak linear relationship between the two variables. Let us further confirm this with the correlation test, which is done in ‘R’ with the cor.test() function.

The basic syntax is cor.test(var1, var2, method = “method”), with the default method being “pearson”. This is done with the line of code below.

1
2
cor.test(df$Investment, df$work_exp)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
Pearson's product-moment correlation
 
data:  df$Investment and df$work_exp
t = 0.86966, df = 198, p-value = 0.3855
 
alternative hypothesis: true correlation is not equal to 0
 
95 percent confidence interval: -0.07771964, 0.19872675
 
sample estimates: cor 0.06168653
 

Interpretation of the Output

Since the p-value of 0.3855 is greater than 0.05, we fail to reject the null hypothesis that the relationship between the applicant’s investment and their work experience is not significant.

Analysis of Variance (ANOVA)

The Analysis of Variance (ANOVA) test is used to determine if the categorical group ('Marital_status') has any impact on the numerical variable ('Income'). In our case, the null hypothesis to test is that the applicant’s marital status has no impact on their income level.

The first step is to calculate the average income by the applicants, in each category of the variable 'Marital_status'. The line of code below performs this task.

1
2
aggregate(Income~Marital_status,df,mean)
 
{r}

Output:

1
2
3
4
    Marital_status      	Income
       Divorced          	62166.67
  	No            	63052.63
     	Yes              	62937.50

The next step is to calculate the standard deviation of income levels within each group, which is done in the line of code below.

1
2
3
 
aggregate(Income~Marital_status,df,sd)
 
{r}

Output:

1
2
3
4
Marital_status         	Income
     Divorced                	11213.14
       No                        10345.88
       Yes                   	10576.82

The standard deviation is calculated to test if the assumptions of ANOVA is satisfied or not. Since the largest standard deviation is 11213, for 'Divorced’ Group, is not more than twice the smallest standard deviation, 10345, we can conclude that the assumptions are satisfied, and we can go ahead with the test. The final step is to run the ‘anova’ test and print the summary result, which is done in the lines of code below.

1
2
anova_1 = aov(df$Income~df$Marital_status)
summary(anova_1)
{r}

Output:

1
2
3
               	       	Df	    	Sum Sq         Mean Sq	F value     Pr(>F)
df$Marital_status         2         	2.963e+07    14813596	0.13        0.878
Residuals     	 	197     	            2.249e+10     114182095           	

Interpretation of the Output

Since the p-value of 0.878 is greater than 0.05, we fail to reject the null hypothesis that there is no impact on the income levels of the applicants, basis their marital status.

Since ANOVA results are not significant, there is no need to conduct the Tukey’s HSD post-hoc tests, to understand the differences in the group (Income) means. However, if the results were significant, we could have run the post-hoc test, as done in the line of code below.

1
2
TukeyHSD(anova_1)
 
{r}

Output:

1
2
3
4
5
6
7
8
9
10
11
12
13
 
 
  Tukey multiple comparisons of means
	95% family-wise confidence level
 
Fit: aov(formula = df$Income ~ df$Marital_status)
 
$`df$Marital_status`
              	        	diff   	           lwr             upr 	 	p adj
No-Divorced	  	885.9649   -3472.022   5243.952      0.8807915
Yes-Divorced     	770.8333   -3763.821   5305.487     0.9150500
Yes-No     	     	-115.1316  -4396.338   4166.075      0.9977789
 

All the p-values are greater than 0.05, which suggests that the variations in the income level of the applicants, based on their marital status, is not significant.

Conclusion

In this guide, you have learned about several techniques for performing hypothesis testing for data interpretation. You also learned about how to interpret the results of the statistical tests in the context of null hypothesis. To learn more about data science using 'R', please refer to the following guides:

0