Introduction

12

Building predictive models, or carrying out data science research, depends on formulating a hypothesis and drawing conclusions using statistical tests. In this guide, you will learn about how to perform these tests using the statistical programming language, 'R'.

The most widely used inferential statistic techniques are covered in this guide, as listed below:

One sample T-test

Independent T-test

Chi-square Test

Correlation Test

Analysis of Variance (ANOVA)

We will begin by loading the data.

In this guide, we will be using the fictitious data of loan applicants containing 200 observations and ten variables, as described below:

Marital_status: Whether the applicant is married ("Yes") or not ("No").

Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").

Income: Annual Income of the applicant (in USD).

Loan_amount: Loan amount (in USD) for which the application was submitted.

Credit_score: Whether the applicant's credit score was good ("Good") or not ("Bad").

approval_status: Whether the loan application was approved ("Yes") or not ("No").

Investment: Investments in stocks and mutual funds (in USD), as declared by the applicant.

gender: Whether the applicant is "Female" or "Male".

age: The applicant’s age in years.

work_exp: work experience in years.

Let us start by loading the required libraries and the data.

`1 2 3 4 5 6 7 8`

`library(readr) library(dplyr) library(mlbench) #loading the data df <- read_csv("data_test.csv") glimpse(df)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13`

`Observations: 200 Variables: 10 Marital_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"... Is_graduate <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ... Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000... Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61... Credit_score <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad"... approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"... Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9... gender <chr> "Female", "Female", "Female", "Female", "Female", "Fem... age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33... work_exp <dbl> 9.0, 8.0, 10.0, 9.5, 9.0, 7.0, 6.0, 9.0, 9.0, 11.0, 9....`

Before moving ahead to the statistical tests, it is good to understand a few important terminologies.

Null and Alternative Hypotheses

The statistical tests in this guide rely on testing a null hypothesis, which is specific for each case.

The null hypothesis assumes the absence of relationship between two or more variables. For example, for two groups, the null hypothesis assumes that there is no correlation or association between the two variables.

The alternative hypothesis is simply the contrary of the null hypothesis.

P-value

For any statistical test, the p-value is a statistic used to evaluate if we will reject or fail to reject the null hypothesis. It is defined as the probability of obtaining a result equal to or more extreme than what was observed in the data.

Decision Rule

The p-value, determined by conducting the statistical test, is then compared to a predetermined value ‘alpha’, which is often taken as 0.05.

The decision rule is: if the p-value for the test is less than 0.05, we reject the null hypothesis, but if it is greater than or equal to 0.05, we fail to reject the null hypothesis.

The idea behind one sample t-test is to compare the mean of a vector against a theoretical mean. In our data, we will be taking the 'Income' variable, and evaluating it against the theoretical mean.

As per the United States Census Bureau's annual mid year population estimates, the average per capita personal income in the United States, in the year 2018, was USD 53,820. We will be testing a claim that the mean income of the applicants is USD 53,820.

An important assumption of the one-sample t-test is that the distribution of the variable 'Income' should be normally distributed. The line of code below creates a histogram, which seems to be approximately normally distributed.

`1 2`

`hist(df$Income, main='Annual Income of Loan Applicants in USD',xlab='Income(USD)')`

{r}

Output:

`1 2 3`

`![image name](https://i.imgur.com/AEpv8FF.png)`

Since the normality assumption is satisfied, we will go ahead with the t-test. In 'R', the ** t.test** function is used to perform this task, which is done in the line of code below. The first argument is the vector of numbers, 'Income', while the second argument is the theoretical mean, denoted by the notation 'mu'.

`1`

`t.test(df$Income, mu=53820)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11`

`One Sample t-test data: df$Income t = 11.871, df = 199, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 53820 95 percent confidence interval: 61266.55, 64233.45 sample estimates: mean of x: 62750`

The output above prints the t-statistic (t = 11.871) and the degrees of freedom, which is 199 (n - 1). The p-value here is close to 0, and less than 0.05, which means that we would reject the null hypothesis that the population mean is equal to USD 53,820.

Another point to notice is the line "alternative hypothesis: true mean is not equal to 53820." This is corresponding to a two-sided alternative hypothesis. If we wanted to make it a one-sided t-test, then we will add the argument "less" or "greater" in quotes, and that will define the direction of our alternative hypothesis.

In this test, we are going to compare two independent groups and see if their means are equal. The variable under study is the ‘work_exp’ variable, and we will test whether the work experience is the same across the male and the female applicants.

The *first and second lines of code* below create two vectors containing the work experience of the female and male applicants, respectively. We must also test the assumption that both the groups are normally distributed. This is done in the *third to fifth lines of code* below, which creates two histograms. The histograms suggest that both the variables are approximately normally distributed.

`1 2 3 4 5 6 7 8 9`

`f_workexp = df$work_exp[df$gender=='Female'] m_workexp = df$work_exp[df$gender=='Male'] #histogram par(mfrow=c(1,2)) hist(f_workexp) hist(m_workexp)`

{r}

Output:

`1 2 3`

`![image name](https://i.imgur.com/5dnhPCi.png)`

Since the normality assumption is satisfied, we will perform the t-test using the line of code below.

`1 2 3`

`t.test(f_workexp, m_workexp)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11`

`Welch Two Sample t-test data: f_workexp and m_workexp t = -0.29465, df = 25.088, p-value = 0.7707 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.7904954, 0.5925894 sample estimates: mean of x mean of y: 7.832865, 7.931818`

Since the p-value of 0.7707 is greater than 0.05, we fail to reject the null hypothesis that the means of these two groups are equal. In other words, there is no significant difference in the work experience of the male and female applicants.

Chi Square Test of Independence is used to determine if there is an association between two or more categorical variables. In our case, we would like to test if the marital status of the applicants has any association with the approval status.

The first step is to create a two-way table between the variables under study, which is done in the lines of code below.

`1 2 3`

`mar_approval <-table(df$Marital_status, df$approval_status) mar_approval`

{r}

Output:

`1 2 3 4 5 6`

`No Yes Divorced 31 29 No 66 10 Yes 52 12`

The next step is to generate the expected counts using the line of code below.

`1 2`

`chisq.test(mar_approval, correct=FALSE)$expected`

{r}

Output:

`1 2 3 4 5`

`No Yes Divorced 44.70 15.30 No 56.62 19.38 Yes 47.68 16.32`

We are now ready to run the test of independence using the ** chisq.test** function, as shown in the line of code below.

`1 2`

`chisq.test(mar_approval, correct=FALSE)`

{r}

Output:

`1 2 3 4 5`

`Pearson's Chi-squared test data: mar_approval X-squared = 24.095, df = 2, p-value = 5.859e-06`

Since the p-value is less than 0.05, we reject the null hypothesis that the marital status of the applicants is not associated with the approval status.

Correlation Tests are used to determine the presence and extent of a linear relationship between two quantitative variables. In our case, we would like to statistically test if there is a correlation between the applicant’s investment and the work experience.

The first step is to visualize the relationship with the scatter plot, which is done in the line of code below.

`1`

`plot(df$Investment,df$work_exp, main="Correlation between Investment Levels and Work Experience", xlab="Work experience in years", ylab="Investment in USD")`

{r}

Output:

`1 2 3`

`![image name](https://i.imgur.com/pip40R6.png)`

The above plot suggests the absence of linear relationship between the two variables. We can quantify this inference by calculating the correlation coefficient, which is done below.

`1 2 3`

`cor(df$Investment, df$work_exp)`

{r}

Output:

`1 2`

`[1] 0.06168653`

The value of 0.06 shows positive but weak linear relationship between the two variables. Let us further confirm this with the correlation test, which is done in ‘R’ with the ** cor.test()** function.

The basic syntax is ** cor.test(var1, var2, method = “method”)**, with the default method being “pearson”. This is done with the line of code below.

`1 2`

`cor.test(df$Investment, df$work_exp)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11`

`Pearson's product-moment correlation data: df$Investment and df$work_exp t = 0.86966, df = 198, p-value = 0.3855 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.07771964, 0.19872675 sample estimates: cor 0.06168653`

Since the p-value of 0.3855 is greater than 0.05, we fail to reject the null hypothesis that the relationship between the applicant’s investment and their work experience is not significant.

The Analysis of Variance (ANOVA) test is used to determine if the categorical group ('Marital_status') has any impact on the numerical variable ('Income'). In our case, the null hypothesis to test is that the applicant’s marital status has no impact on their income level.

The first step is to calculate the average income by the applicants, in each category of the variable 'Marital_status'. The line of code below performs this task.

`1 2`

`aggregate(Income~Marital_status,df,mean)`

{r}

Output:

`1 2 3 4`

`Marital_status Income Divorced 62166.67 No 63052.63 Yes 62937.50`

The next step is to calculate the standard deviation of income levels within each group, which is done in the line of code below.

`1 2 3`

`aggregate(Income~Marital_status,df,sd)`

{r}

Output:

`1 2 3 4`

`Marital_status Income Divorced 11213.14 No 10345.88 Yes 10576.82`

The standard deviation is calculated to test if the assumptions of ANOVA is satisfied or not. Since the largest standard deviation is 11213, for 'Divorced’ Group, is not more than twice the smallest standard deviation, 10345, we can conclude that the assumptions are satisfied, and we can go ahead with the test. The final step is to run the ‘anova’ test and print the summary result, which is done in the lines of code below.

`1 2`

`anova_1 = aov(df$Income~df$Marital_status) summary(anova_1)`

{r}

Output:

`1 2 3`

`Df Sum Sq Mean Sq F value Pr(>F) df$Marital_status 2 2.963e+07 14813596 0.13 0.878 Residuals 197 2.249e+10 114182095`

Since the p-value of 0.878 is greater than 0.05, we fail to reject the null hypothesis that there is no impact on the income levels of the applicants, basis their marital status.

Since ANOVA results are not significant, there is no need to conduct the ** Tukey’s HSD post-hoc tests**, to understand the differences in the group (Income) means. However, if the results were significant, we could have run the post-hoc test, as done in the line of code below.

`1 2`

`TukeyHSD(anova_1)`

{r}

Output:

`1 2 3 4 5 6 7 8 9 10 11 12 13`

`Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = df$Income ~ df$Marital_status) $`df$Marital_status` diff lwr upr p adj No-Divorced 885.9649 -3472.022 5243.952 0.8807915 Yes-Divorced 770.8333 -3763.821 5305.487 0.9150500 Yes-No -115.1316 -4396.338 4166.075 0.9977789`

All the p-values are greater than 0.05, which suggests that the variations in the income level of the applicants, based on their marital status, is not significant.

In this guide, you have learned about several techniques for performing hypothesis testing for data interpretation. You also learned about how to interpret the results of the statistical tests in the context of null hypothesis. To learn more about data science using 'R', please refer to the following guides:

12