 Deepika Singh

# Finding Relationships in Data with R

• Nov 12, 2019
• 1,952 Views
• Nov 12, 2019
• 1,952 Views
Data
R

## Introduction

Building high performing machine learning algorithms depends on identifying the relationships between the variables. This helps in feature engineering as well as deciding on the machine learning algorithm. In this guide, you will learn techniques of finding relationships in data with R.

## Data

In this guide, we will use a fictitious dataset of loan applicants containing 200 observations and ten variables, as described below:

1. `Marital_status` Whether the applicant is married ("Yes") or not ("No")

2. `Is_graduate` Whether the applicant is a graduate ("Yes") or not ("No")

3. `Income` Annual Income of the applicant (in USD)

4. `Loan_amount` Loan amount (in USD) for which the application was submitted

5. `Credit_score` Whether the applicant's credit score was good ("Good") or not ("Bad").

6. `Approval_status` Whether the loan application was approved ("Yes") or not ("No").

7. `Investment` Investments in stocks and mutual funds (in USD), as declared by the applicant

8. `Gender` Whether the applicant is "Female" or "Male"

9. `Age` The applicant’s age in years

10. `Work_exp` The applicant's work experience in years

``````1
2
3
4
5
6
7
8
9
``````library(plyr)
library(ggplot2)
library(GGally)
library(dplyr)
library(mlbench)

glimpse(dat)``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
``````Observations: 200
Variables: 10
\$ Marital_status  <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
\$ Is_graduate     <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", ...
\$ Income          <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
\$ Loan_amount     <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
\$ approval_status <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"...
\$ Investment      <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
\$ gender          <chr> "Female", "Female", "Female", "Female", "Female", "Fem...
\$ age             <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
\$ work_exp        <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...``````

The output shows that the dataset has five numerical (labeled as `int`, `dbl`) and five character variables (labelled as `chr`). We will convert these into `factor` variables using the line of code below.

``````1
2
3
4
``````names <- c(1,2,5,6,8)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
``````Observations: 200
Variables: 10
\$ Marital_status  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
\$ Is_graduate     <fct> No, No, No, No, No, No, No, No, No, No, No, No, Yes, Y...
\$ Income          <int> 72000, 64000, 80000, 76000, 72000, 56000, 48000, 72000...
\$ Loan_amount     <int> 70500, 70000, 275000, 100500, 51500, 69000, 147000, 61...
\$ approval_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,...
\$ Investment      <int> 117340, 85340, 147100, 65440, 48000, 136640, 160000, 9...
\$ gender          <fct> Female, Female, Female, Female, Female, Female, Female...
\$ age             <int> 34, 34, 33, 34, 33, 34, 33, 33, 33, 33, 34, 33, 33, 33...
\$ work_exp        <dbl> 8.10, 7.20, 9.00, 8.55, 8.10, 6.30, 5.40, 8.10, 8.10, ...``````

## Relationship Between Numerical Variables

Many machine learning algorithms require that continuous variables should not be correlated with each other, a phenomenon called ‘multicollinearity.’ Establishing relationships between the numerical variables is a common step to detect and treat multicollinearity.

### Correlation Matrix

Creating a correlation matrix is a technique to identify multicollinearity among numerical variables. The lines of code below create the matrix.

``````1
2
3
``````cordata = dat[,c(3,4,7,9,10)]
corr <- round(cor(cordata), 1)
corr``````
{r}

Output:

``````1
2
3
4
5
6
``````            Income Loan_amount Investment  age work_exp
Income         1.0         0.0        0.1 -0.2      0.9
Loan_amount    0.0         1.0        0.8  0.0      0.0
Investment     0.1         0.8        1.0  0.0      0.1
age           -0.2         0.0        0.0  1.0     -0.1
work_exp       0.9         0.0        0.1 -0.1      1.0``````

The output above shows the presence of strong linear correlation between the variables `Income` and `Work_exp` and between `Investment` and `Loan_amount`.

### Correlation Plot

The correlation can also be visualized using a correlation plot, which is implemented using the `ggcorrplot` package. This library is loaded with the first line of code below.

The second line creates the correlogram plot, where arguments like `colors`, `outline.color`, and `show.legend` are used to control the display of the chart.

``````1
2
3
``````library(ggcorrplot)

ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE, lab_size = 3, method="circle", colors = c("blue", "white", "red"), outline.color = "gray", show.legend = TRUE, show.diag = FALSE, title="Correlogram of loan variables")``````
{r}

Output: ### Correlation Test

Correlation Test is another method to determine the presence and extent of a linear relationship between two quantitative variables. In our case, we would like to statistically test whether there is a correlation between the applicants’ investment and work experience.

The first step is to visualize the relationship with a scatter plot, which is done in the line of code below.

``````1
````plot(dat\$Investment,dat\$work_exp, main="Correlation between Investment Levels & Work Exp", xlab="Work experience in years", ylab="Investment in USD")````
{r}

Output: The above plot suggests the absence of linear relationship between the two variables. We can quantify this inference by calculating the correlation coefficient using the line of code below.

``````1
````cor(dat\$Investment, dat\$work_exp)````
{r}

Output:

``````1
```` 0.07653245````

The value of 0.07 shows a positive but weak linear relationship between the two variables. Let’s confirm this with the correlation test, which is done in R with the `cor.test()` function.

The basic syntax is `cor.test(var1, var2, method = “method”)`, with the default method being `pearson`. This is done using the line of code below.

``````1
````cor.test(dat\$Investment, dat\$work_exp)````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
``````
Pearson's product-moment correlation

data:  dat\$Investment and dat\$work_exp

t = 1.0801, df = 198, p-value = 0.2814

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:  -0.0628762,   0.2130117

sample estimates:  cor - 0.07653245``````

Since the p-value of 0.2814 is greater than 0.05, we fail to reject the null hypothesis that the relationship between the applicant’s investment and their work experience is not significant.

Let’s consider another example of correlation between `Income` and `Work_exp` using the line of code below.

``````1
````cor.test(dat\$Income, dat\$work_exp)````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
``````
Pearson's product-moment correlation

data:  dat\$Income and dat\$work_exp

t = 25.869, df = 198, p-value < 2.2e-16

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval: 0.8423810; 0.9066903

sample estimates: cor - 0.8784546 ``````

In this case, the p-value is smaller than 0.05, so we reject the null hypothesis that the relationship between the applicant’s income and their work experience is not significant.

## Relationship Between Categorical Variables

In the previous sections, we covered techniques of finding relationships between numerical variables. It is equally important is to understand and estimate the relationship between categorical variables.

### Frequency Table

Creating a frequency table is a simple but effective way of finding distribution between the two categorical variables. The `table()` function can be used to create the two way table between two variables.

In the first line of code below, we create a two-way table between the variables `marital_status` and `approval_status`. The second line prints the frequency table, while the third line prints the proportion table. The fourth line prints the row proportion table, while the fifth line prints the column proportion table.

``````1
2
3
4
5
6
7
``````# 2 - way table
two_way = table(dat\$Marital_status, dat\$approval_status)
two_way

prop.table(two_way) # cell percentages
prop.table(two_way, 1) # row percentages
prop.table(two_way, 2) # column percentages``````
{r}

Output:

``````1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
``````#Output - two_way table

No Yes
Divorced 31  29
No       66  10
Yes      52  12

#Output - cell percentages table

No   Yes
Divorced 0.155 0.145
No       0.330 0.050
Yes      0.260 0.060

#Output - row percentages table

No       Yes
Divorced 0.5166667 0.4833333
No       0.8684211 0.1315789
Yes      0.8125000 0.1875000

#Output - column percentages table

No       Yes
Divorced 0.2080537 0.5686275
No       0.4429530 0.1960784
Yes      0.3489933 0.2352941``````

The output from the column percentages table shows that divorced applicants (at 56.8 percent) have a higher probability of getting loan approvals compared to married applicants (at 19.6 percent). To test whether this insight is statistically significant, we use the chi-square test of independence.

### Chi-Square Test of Independence

The chi-quare test of independence is used to determine whether there is an association between two or more categorical variables. In our case, we would like to test whether the marital status of the applicants has any association with the approval status.

The first step is to create a two-way table between the variables under study, which is done in the lines of code below.

``````1
2
``````mar_approval <-table(dat\$Marital_status, dat\$approval_status)
mar_approval``````
{r}

Output:

``````1
2
3
4
``````          No Yes
Divorced 31  29
No       66  10
Yes      52  12``````

The next step is to generate the expected counts using the line of code below.

``````1
````chisq.test(mar_approval, correct=FALSE)\$expected````
{r}

Output:

``````1
2
3
4
5
``````            No   Yes
Divorced 44.70 15.30
No       56.62 19.38
Yes      47.68 16.32
``````

We are now ready to run the test of independence using the `chisq.test` function, as in the line of code below.

``````1
````chisq.test(mar_approval, correct=FALSE)````
{r}

Output:

``````1
2
3
4
5
``````	Pearson's Chi-squared test

data:  mar_approval

X-squared = 24.095, df = 2, p-value = 5.859e-06``````

Since the p-value is less than 0.05, we reject the null hypothesis that the marital status of the applicants is not associated with the approval status.

## Conclusion

In this guide, you have learned techniques of finding relationships in data for both numerical and categorical variables. You also learned how to interpret the results of the tests by statistically validating the relationship between the variables. To learn more about data science using R, please refer to the following guides: 1. Interpreting Data Using Descriptive Statistics with R

2