Testing for Relationships Between Categorical Variables Using the ChiSquare Test
Jan 21, 2020 • 9 Minute Read
Introduction
Understanding and quantifying the relationship between categorical variables is one of the most important tasks in data science. This is useful not just in building predictive models, but also in data science research work. One statistical test that does this is the Chi Square Test of Independence, which is used to determine if there is an association between two or more categorical variables. In this guide, you will learn how to perform the chisquare test using R.
Data
In this guide, we will be using fictitious data of loan applicants containing 200 observations and ten variables, as described below:

Marital_status  Whether the applicant is married ("Yes"), not married ("No") , or divorced ("Divorced")

Is_graduate  Whether the applicant is a graduate ("Yes") or not ("No")

Income  Annual Income of the applicant (in USD)

Loan_amount  Loan amount (in USD) for which the application was submitted

Credit_score  Whether the applicant's credit score was good ("Good") or not ("Bad").

approval_status  Whether the loan application was approved ("Yes") or not ("No").

Investment  Investments in stocks and mutual funds (in USD), as declared by the applicant

Gender  Whether the applicant is "Female" or "Male"
9. Age  The applicant's age in years
10. Work_exp  The applicant's work experience in years
Let's start by loading the required libraries and the data.
library(plyr)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(mlbench)
dat < read_csv("data_test.csv")
glimpse(dat)
Output:
Observations: 200
Variables: 10
$ Marital_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
$ Is_graduate <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ...
$ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
$ Credit_score <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad"...
$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
$ gender <chr> "Female", "Female", "Female", "Female", "Female", "Fem...
$ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
$ work_exp <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...
The output shows that the data has five numerical variables (labeled as 'int', 'dbl') and five character variables (labeled as 'chr'). We will convert these into factor variables using the line of code below.
names < c(1,2,5,6,8)
dat[,names] < lapply(dat[,names] , factor)
glimpse(dat)
Output:
Observations: 200
Variables: 10
$ Marital_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Is_graduate <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
$ Income <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
$ Loan_amount <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
$ Credit_score <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
$ Investment <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
$ gender <fct> Female, Female, Female, Female, Female, Female, Female...
$ age <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
$ work_exp <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...
Frequency Table
Before diving into the chisquare test, it's important to understand the frequency table or matrix that is used as an input for the chisquare function in R. Frequency tables are an effective way of finding dependence or lack of it between the two categorical variables. They also give a firstlevel view of the relationship between the variables.
The table() function can be used to create the twoway table between the variables. In the first line of code below, we create a twoway table between the variables, Marital_status and approval_status. The second line prints the frequency table, while the third line prints the proportion table. The fourth line prints the row proportion table, while the fifth line prints the column proportion table.
# 2  way table
two_way = table(dat$Marital_status, dat$approval_status)
two_way
prop.table(two_way) # cell percentages
prop.table(two_way, 1) # row percentages
prop.table(two_way, 2) # column percentages
Output:
No Yes
Divorced 31 29
No 66 10
Yes 52 12
No Yes
Divorced 0.155 0.145
No 0.330 0.050
Yes 0.260 0.060
No Yes
Divorced 0.5166667 0.4833333
No 0.8684211 0.1315789
Yes 0.8125000 0.1875000
No Yes
Divorced 0.2080537 0.5686275
No 0.4429530 0.1960784
Yes 0.3489933 0.2352941
The output from the column percentages total shows that the divorced applicants have a higher probability (at 56.8 percent) of getting loan approvals compared to the married applicants. To test whether this insight is statistically significant or not, we conduct the chisquare test of independence.
Steps
We'll be using the chisquare test to determine the association between the two categorical variables, Marital_status and approval_status. We begin by specifying the null and alternative hypothesis, like all statistical tests.
Null Hypothesis H0: The two variables Marital_status and approval_status are independent of each other.
Alternate Hypothesis H1: The two variables are related to each other.
The first step is to create a twoway table between the variables under study, which is done in the lines of code below.
mar_approval <table(dat$Marital_status, dat$approval_status)
mar_approval
Output:
No Yes
Divorced 31 29
No 66 10
Yes 52 12
The next step is to perform the chisquare test using the chisq.test() function. It is easy to use this function as shown below, where the table generated above is passed as an argument to the function, which then generates the test result.
chisq.test(mar_approval)
Output:
Pearson's Chisquared test
data: mar_approval
Xsquared = 24.095, df = 2, pvalue = 0.000005859
Interpretation: Since the pvalue is less than 0.05, we reject the null hypothesis that the marital status of the applicants is not associated with the approval status.
Another way of using the function is directly passing in the variables under study as arguments into the chisq.test() function, as shown below.
chisq.test(dat$Marital_status, dat$approval_status)
Output:
Pearson's Chisquared test
data: dat$Marital_status and dat$approval_status
Xsquared = 24.095, df = 2, pvalue = 0.000005859
This produces similar test results, as was expected. Similarly, we can test the relationship between other categorical features.
Conclusion
In this guide, you have learned about the techniques of finding relationships in data for categorical variables. You also learned about the simple but effective chisq.test() function in R and how it can be used to determine the association between two categorical features.
To learn more about data science using 'R', please refer to the following guides: