Coping with Missing, Invalid and Duplicate Data in R

Learn how to deal with missing, invalid and duplicate data in the statistical programming language R, to get your data ready for predictive modeling.

Oct 24, 2019 • 15 Minute Read

Subscribe to the newsletter

Introduction

A vital component of data science is cleaning the data and getting it ready for predictive modeling. The most common problem related to data cleaning is coping with missing data, invalid records and duplicate values.

In this guide, you will learn about techniques for dealing with missing, invalid and duplicate data in the statistical programming language R.

Data

In this guide, we will be using a fictitious dataset of loan applications containing 600 observations and 12 variables:

1. UID - Unique identifier for an applicant

2. Marital_status - Whether the applicant is married ("Yes") or not ("No")

3. Dependents - Number of dependents of the applicant

4. Is_graduate - Whether the applicant is a graduate ("Yes") or not ("No")

5. Income - Annual income of the applicant (in USD)

6. Loan_amount - Loan amount (in USD) for which the application was submitted

7. Term_months - Tenure of the loan

8. Credit_score - Whether the applicants credit score is good ("Satisfactory") or not ("Not Satisfactory")

9. Approval_status - Whether the loan application was approved ("1") or not ("0")

10. Age - The applicant's age in years

11. Sex - Whether the applicant is a male ("M") or a female ("F")

12. Purpose - Purpose of applying for the loan

Let’s start by loading the required libraries and the data.

      library(readr)

dat <- read_csv("data_cleaning.csv")
glimpse(dat)
    

Output:

      Observations: 600
Variables: 12
$ UID             <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Marital_status  <chr> "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Y...
$ Dependents      <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,...
$ Is_graduate     <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", ...
$ Income          <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Loan_amount     <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300...
$ Term_months     <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,...
$ Credit_score    <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
$ Age             <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2...
$ Sex             <chr> "M", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M",...
$ Purpose         <chr> "Education", "Travel", "Personal", "Business", "Educat...
    

The output shows that the dataset has six numerical variables (labeled as 'int'), while the remaining six are categorical variables (labelled as 'chr'). We will convert these into 'factor' variables, except for the 'UID' variable, using the line of code below.

      names <- c(2,4,8,9,11,12)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
    

Output:

      Observations: 600
Variables: 12
$ UID             <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Marital_status  <fct> No, Yes, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes...
$ Dependents      <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,...
$ Is_graduate     <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Y...
$ Income          <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Loan_amount     <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300...
$ Term_months     <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,...
$ Credit_score    <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
$ Age             <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2...
$ Sex             <fct> M, M, F, F, F, M, M, M, M, M, M, M, M, F, F, M, M, M, ...
$ Purpose         <fct> Education, Travel, Personal, Business, Education, Educ...
    

We are now ready to carry out the data cleaning steps in the following sections.

Duplicate Values

The first step is to check for duplicate records, one of the most common errors in real world data. Duplicate records increase computation time and decrease model accuracy, and hence must be removed. In our dataset, 'UID' is the unique identifier variable and will be used to drop the duplicate records. The first line of code below uses the duplicated() function to find duplicates, while the second line prints the number of duplicates.

      dup_records <- duplicated(dat$UID)
sum(dup_records)
    

Output:

      1] 3

The output shows that there are three duplicate records. We will drop these records using the first line of code below. The second line prints the dimension of the resulting dataset — 597 observations and 12 variables.

      dat <- dat[!duplicated(dat$UID), ]

dim(dat)

Output:

      1] 597  12

Invalid Values

When we looked at the data using the glimpse() function in the previous section, we realized that the age variable has incorrect entries. Let’s look at the summary of the age variable.

      summary(dat$Age)

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -12.00   36.50   51.00   49.03   61.00   76.00       2
    

The output shows that the minimum value of the variable 'Age' is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of inaccurate records. It is safe to assume that for loan applications, the minimum age should be 18 years. This means that we will remove records of applicants below 18 years of age.

The first two of lines of code below give us the number of records in the dataset for which the age is below 18 years. The number of such records is seven, and they are removed with the third line of code. The fourth line prints the dimensions of the new data — 590 observations and 12 variables.

Finally, we relook at the summary of the age variable. This shows that the range of age is now 23 to 76 years, indicating that the correction has been made.

      age_18 <- dat[which(dat$Age<18),]
dim(age_18)

dat <- dat[-which(dat$Age<18),]
dim(dat)

summary(dat$Age)
    

Output:

      1] 590  12
 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  23.00   37.00   51.00   49.72   61.00   76.00       2
    

It was easy to detect incorrect entries in the age variable. In other cases, invalid values are in the form of outliers. Let’s look at an example of the 'Income' variable. The summary of the variable shows that the minimum and maximum income levels are USD 136700, and 3321001, respectively. This is a highly skewed range, indicating some extreme values. To better understand the distribution, we use the quantile function, which gives us the first to hundredth percentile values of the variable in the sequence of unit percentile.

      summary(dat$Income)

quantile(dat$Income,seq(0,1,0.01))

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 136700  386700  512800  687874  775300 3321001


    0%      1%      2%      3%      4%      5%      6%      7%      8%      9% 
 136700  136700  210676  240000  244288  254540  263684  274664  286108  295100 
    10%     11%     12%     13%     14%     15%     16%     17%     18%     19% 
 301560  311100  317700  320000  329100  332220  333300  335288  344400  346700 
    20%     21%     22%     23%     24%     25%     26%     27%     28%     29% 
 352860  358800  363084  371396  383332  386700  391172  397980  401508  405556 
    30%     31%     32%     33%     34%     35%     36%     37%     38%     39% 
 410220  412724  421052  422244  424804  431960  437016  444400  448620  454972 
    40%     41%     42%     43%     44%     45%     46%     47%     48%     49% 
 458920  465068  468448  476468  479696  486180  491380  495548  500000  506956 
    50%     51%     52%     53%     54%     55%     56%     57%     58%     59% 
 512800  515552  523184  532012  536480  551820  555504  563080  572852  577700 
    60%     61%     62%     63%     64%     65%     66%     67%     68%     69% 
 585380  607584  611276  620300  625904  633300  648308  656708  666700  683156 
    70%     71%     72%     73%     74%     75%     76%     77%     78%     79% 
 700000  721040  733300  753968  761484  775300  788132  800000  807740  821696 
    80%     81%     82%     83%     84%     85%     86%     87%     88%     89% 
 834660  853300  880008  914712  963752 1010680 1058180 1111100 1149276 1219460 
    90%     91%     92%     93%     94%     95%     96%     97%     98%     99% 
1262060 1333300 1392412 1502676 1664032 1944400 2064768 2223884 2608396 3197268 
   100% 
3321001
    

We can remove the outliers using the method described in the previous section. We can also address them through a different method of flooring and capping the extreme values. The first line of code below does the flooring of the lower outliers at the first percentile value, i.e., USD 136700. Similarly, the second line performs the capping of the higher outliers at the 99th percentile value, i.e., USD 3321001.

The third line of code prints the new summary of the variable, indicating that the correction has been done.

      dat$Income[which(dat$Income<136700)]<- 136700

dat$Income[which(dat$Income > 3321001)]<- 3321001

summary(dat$Income)

Output:

      Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 136700  386175  508650  685301  772650 3321001
    

Missing Values

Missing value treatment is the most common data cleaning step performed in a data science project. The line of code below prints the number of missing values in each of the variables.

      sapply(dat, function(x) sum(is.na(x)))

Output:

      UID  Marital_status      Dependents     Is_graduate          Income 
              0               0               3               6               0 
    Loan_amount     Term_months    Credit_score approval_status             Age 
              7               6               0               0               2 
            Sex         Purpose 
              0               0
    

The output above shows the presence of missing values across the variables, most of which are numerical variables, except 'Is_graduate', which is a categorical variable.

Missing Value Imputation for Numerical Variables

The most widely used technique for imputing values for a numerical variable is to replace the missing values with the mean or the median value. In the lines of code below, we replace missing values in 'Loan_amount' with the median value, while the missing values in 'Term_months' are replaced by the mean value. The output shows that the missing values have been imputed.

      dat$Loan_amount[is.na(dat$Loan_amount)] <- median(dat$Loan_amount, na.rm = TRUE)
table(is.na(dat$Loan_amount))

dat$Term_months[is.na(dat$Term_months)] <- mean(dat$Term_months, na.rm = TRUE)
table(is.na(dat$Term_months))
    

Output:

      FALSE 
  590 

FALSE 
  590
    

Missing Value Imputation for Categorical Variables

For categorical variables, it is important to understand the frequency distribution, which can be printed with the line of code below.

      table(dat$Is_graduate)

Output:

      No Yes 
127 457
    

The output shows that most applicants were graduates, identified with the label 'Yes'. The lines of code below replace the missing values with the highest frequency label, 'Yes'.

      dat$Is_graduate[is.na(dat$Is_graduate)] <- "Yes"

table(dat$Is_graduate)

Output:

      No Yes 
127 463
    

Removing Rows with Missing Values

We have imputed missing values using measures of central tendency: mean, median and mode. Another technique is to delete rows where any variable has missing values. This is performed using the na.omit() function, which removes all the rows containing missing values.

      dat <- na.omit(dat)

dim(dat)

Output:

      1] 585  12

The resulting data has 585 observations of 12 variables. We can do a final check to see if all the missing values have been removed using the command below.

      sapply(dat, function(x) sum(is.na(x)))

Output:

      UID  Marital_status      Dependents     Is_graduate          Income 
              0               0               0               0               0 
    Loan_amount     Term_months    Credit_score approval_status             Age 
              0               0               0               0               0 
            Sex         Purpose 
              0               0
    

We can look at the data and post all these transformations using the glimpse command below.

      glimpse(dat)

Output:

      Observations: 585
Variables: 12
$ UID             <chr> "UIDA209", "UIDA238", "UIDA256", "UIDA274", "UIDA430",...
$ Marital_status  <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes, No, No, Yes...
$ Dependents      <int> 0, 2, 0, 2, 0, 0, 0, 1, 3, 0, 0, 1, 0, 0, 1, 0, 0, 0, ...
$ Is_graduate     <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, ...
$ Income          <dbl> 1984000, 1066700, 834100, 775900, 421100, 402300, 1111...
$ Loan_amount     <int> 1070000, 111000, 89000, 1330000, 49500, 56500, 104000,...
$ Term_months     <dbl> 384, 384, 384, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score    <fct> Satisfactory, Satisfactory, Not _satisfactory, Satisfa...
$ approval_status <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, ...
$ Age             <int> 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 25...
$ Sex             <fct> M, M, M, M, M, F, M, M, M, M, F, M, M, M, M, F, M, M, ...
$ Purpose         <fct> Personal, Personal, Personal, Travel, Personal, Person...
    

Conclusion

In this guide, you have learned methods of dealing with missing, invalid and duplicate data in R. These techniques will help you in cleaning data and making it ready for machine learning. To learn more about data science using 'R', please refer to the following guides: