Important Update
The Guide Feature will be discontinued after December 15th, 2023. Until then, you can continue to access and refer to the existing guides.

Deepika Singh

# Testing for Relationships Between Categorical Variables Using the Chi-Square Test

• Jan 21, 2020
• 9 Min read
• 44,483 Views
• Jan 21, 2020
• 9 Min read
• 44,483 Views
Data
Chi-square

## Introduction

Understanding and quantifying the relationship between categorical variables is one of the most important tasks in data science. This is useful not just in building predictive models, but also in data science research work. One statistical test that does this is the Chi Square Test of Independence, which is used to determine if there is an association between two or more categorical variables. In this guide, you will learn how to perform the chi-square test using R.

## Data

In this guide, we will be using fictitious data of loan applicants containing 200 observations and ten variables, as described below:

1. `Marital_status` - Whether the applicant is married ("Yes"), not married ("No") , or divorced ("Divorced")

2. `Is_graduate` - Whether the applicant is a graduate ("Yes") or not ("No")

3. `Income` - Annual Income of the applicant (in USD)

4. `Loan_amount` - Loan amount (in USD) for which the application was submitted

5. `Credit_score` - Whether the applicant's credit score was good ("Good") or not ("Bad").

6. `approval_status` - Whether the loan application was approved ("Yes") or not ("No").

7. `Investment` - Investments in stocks and mutual funds (in USD), as declared by the applicant

8. `Gender` - Whether the applicant is "Female" or "Male"

`9. Age` - The applicant's age in years

`10. Work_exp` - The applicant's work experience in years

Let's start by loading the required libraries and the data.

``````1library(plyr)
2library(readr)
3library(ggplot2)
4library(GGally)
5library(dplyr)
6library(mlbench)
7
8dat <- read_csv("data_test.csv")
9
10glimpse(dat) ``````

Output:

``````1Observations: 200
2Variables: 10
3\$ Marital_status  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
4\$ Is_graduate     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ...
5\$ Income          <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
6\$ Loan_amount     <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
7\$ Credit_score    <chr> "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad", "Bad"...
8\$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
9\$ Investment      <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
10\$ gender          <chr> "Female", "Female", "Female", "Female", "Female", "Fem...
11\$ age             <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
12\$ work_exp        <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...``````

The output shows that the data has five numerical variables (labeled as 'int', 'dbl') and five character variables (labeled as 'chr'). We will convert these into factor variables using the line of code below.

``````1names <- c(1,2,5,6,8)
2dat[,names] <- lapply(dat[,names] , factor)
3glimpse(dat) ``````

Output:

``````1Observations: 200
2Variables: 10
3\$ Marital_status  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
4\$ Is_graduate     <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
5\$ Income          <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
6\$ Loan_amount     <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
7\$ Credit_score    <fct> Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad, Bad,...
8\$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
9\$ Investment      <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
10\$ gender          <fct> Female, Female, Female, Female, Female, Female, Female...
11\$ age             <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
12\$ work_exp        <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...``````

### Frequency Table

Before diving into the chi-square test, it's important to understand the frequency table or matrix that is used as an input for the chi-square function in R. Frequency tables are an effective way of finding dependence or lack of it between the two categorical variables. They also give a first-level view of the relationship between the variables.

The `table()` function can be used to create the two-way table between the variables. In the first line of code below, we create a two-way table between the variables, `Marital_status` and `approval_status`. The second line prints the frequency table, while the third line prints the proportion table. The fourth line prints the row proportion table, while the fifth line prints the column proportion table.

``````1# 2 - way table
2two_way = table(dat\$Marital_status, dat\$approval_status)
3two_way
4
5prop.table(two_way) # cell percentages
6prop.table(two_way, 1) # row percentages
7prop.table(two_way, 2) # column percentages``````

Output:

``````1         No Yes
2  Divorced 31  29
3  No       66  10
4  Yes      52  12
5
6              No   Yes
7  Divorced 0.155 0.145
8  No       0.330 0.050
9  Yes      0.260 0.060
10
11                  No       Yes
12  Divorced 0.5166667 0.4833333
13  No       0.8684211 0.1315789
14  Yes      0.8125000 0.1875000
15
16
17                  No       Yes
18  Divorced 0.2080537 0.5686275
19  No       0.4429530 0.1960784
20  Yes      0.3489933 0.2352941``````

The output from the column percentages total shows that the divorced applicants have a higher probability (at 56.8 percent) of getting loan approvals compared to the married applicants. To test whether this insight is statistically significant or not, we conduct the chi-square test of independence.

## Steps

We'll be using the chi-square test to determine the association between the two categorical variables, `Marital_status` and `approval_status`. We begin by specifying the null and alternative hypothesis, like all statistical tests.

Null Hypothesis H0: The two variables `Marital_status` and `approval_status` are independent of each other.

Alternate Hypothesis H1: The two variables are related to each other.

The first step is to create a two-way table between the variables under study, which is done in the lines of code below.

``````1mar_approval <-table(dat\$Marital_status, dat\$approval_status)
2mar_approval``````

Output:

``````1           No Yes
2  Divorced 31  29
3  No       66  10
4  Yes      52  12``````

The next step is to perform the chi-square test using the `chisq.test()` function. It is easy to use this function as shown below, where the table generated above is passed as an argument to the function, which then generates the test result.

``1chisq.test(mar_approval) ``

Output:

``````1	Pearson's Chi-squared test
2
3data:  mar_approval
4X-squared = 24.095, df = 2, p-value = 0.000005859``````

Interpretation: Since the p-value is less than 0.05, we reject the null hypothesis that the marital status of the applicants is not associated with the approval status.

Another way of using the function is directly passing in the variables under study as arguments into the `chisq.test()` function, as shown below.

``1chisq.test(dat\$Marital_status, dat\$approval_status)``

Output:

``````1	Pearson's Chi-squared test
2
3data:  dat\$Marital_status and dat\$approval_status
4X-squared = 24.095, df = 2, p-value = 0.000005859``````

This produces similar test results, as was expected. Similarly, we can test the relationship between other categorical features.

## Conclusion

In this guide, you have learned about the techniques of finding relationships in data for categorical variables. You also learned about the simple but effective `chisq.test()` function in R and how it can be used to determine the association between two categorical features.

To learn more about data science using 'R', please refer to the following guides: