Author avatar

Vivek Kumar

Data Cleaning

Vivek Kumar

  • May 9, 2019
  • 8 Min read
  • 177 Views
  • May 9, 2019
  • 8 Min read
  • 177 Views
Data
Pandas

Introduction

This guide introduces the concept of data cleaning which involves the identification of incomplete data, deletion of coarse data, handling invalid data, etc. Most of the data that appears in the real word scenario is unstructured, requires reshaping, and further data preprocessing. Let us learn a few steps of data cleaning using the Pandas library available in Python.

Consider a CSV file named student_grade.csv which holds data for nearly 1500 students based on four attributes. A good way to get a quick feel of the data is to take a look at the first 10 rows of the dataset as shown below:

Student_IDSt_GRADELeft_SchoolSt_AGE
12K1317.5NO12
22K1527NO12
32K163n/aYES13
42K1749NO13
52K1857NO13.67
62K1749NO13
72K1068NO14.79
82K1856.3YES13
92K1526NO13
102K119n/aNO12

The current data set is already reshaped and therefore we define the following four objectives to perform the task of data cleaning:

  1. Identify the missing grades.

  2. Remove the student records who have left the school.

  3. Identify and drop the duplicate records (repeating entries).

  4. Identify and correct the invalid age.

The Baseline

To initiate the process, the first step is to import the Pandas library and read the dataset.

1
2
3
4
5
import pandas as pd
df = pd.read_csv('student_grade.csv')

# Storing the topmost 10 rows in a new variable df_sub
df_sub = df.head(10)
python

The .head(10) function is used after the df variable to extract the first 10 rows of the DataFrame.

Identify the Missing Values

Before we dive into the code, it is important to understand the possible sources of missing data. Here are a few of the reasons by which missing data can originate:

  • User might have forgotten to fill the data.
  • User might not have full information about the data that is to be filled in.
  • Chances of system error.

In the column St_GRADE we can see the null values as n/a, but we have only a few records in the preview, the complete data might contain any other form of the missing values too. Let us identify the rows with n/a:

1
2
missing_values = 'n/a'
df_sub = df_sub[df_sub.St_GRADE==missing_values]
python

Output:

Student_IDSt_GRADELeft_SchoolSt_AGE
32K163n/aYES13
102K119n/aNO12

Most of the times, missing values are not hard-coded by the user. If the user doesn't mention any specific term (here, n/a) then Pandas will assign a NaN value to each of the missing value cells. Such cells can be identified and handled as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
dummy = pd.Series([45, np.NaN, 12])

# Checking which instances have missing values
dummy.isnull()

# Output:
# 0 False
# 1 True
# 2 False
# dtype: bool

# To fill the missing values, use fillna()
dummy = dummy.fillna(5)
dummy

# Output:
# 0 45.0
# 1 5.0
# 2 12.0
# dtype: float64
python

Notice from the above code that after filling an integer 5 to the dummy Series, we still receive an output with the float datatype. This is due to the presence of NaN initially in the Series whose datatype was float64.

Remove the Students’ Records Who Have Left the School

Next, let us remove the records of the students who have left the school. This is because, moving forward, such data might not be useful or this might lead to increased data size. To achieve this, we subset the DataFrame that includes all the rows where the value of a cell in the St_GRADE column does not equal to YES.

1
2
# Subsetting the DataFrame df based on the St_GRADE column
df_sub = df_sub[df_sub.St_GRADE != 'YES']
python

You can observe the output with the deleted records as shown below:

Student_IDSt_GRADELeft_SchoolSt_AGE
12K1317.5NO12
22K1527NO12
32K1749NO13
42K1857NO13.67
52K1749NO13
62K1068NO14.79
72K1526NO13

Identify and Drop the Duplicate Records

To identify the duplicate records of the DataFrame, we can use the duplicated() function present inside the Pandas library.

1
2
# Checking the duplicate rows
df_sub["is_duplicate"]= df_sub.duplicated()
python

The above code finds the duplicate instances and tags them with a boolean logic True if the rows are duplicate or else False. Later, the result is stored in the column named is_duplicate of the DataFrame df_sub as shown below:

Student_IDSt_GRADELeft_SchoolSt_AGEis_duplicate
12K1317.5NO12False
22K1527NO12False
32K1749NO13False
42K1857NO13.67False
52K1749NO13True
62K1068NO14.79False
72K1526NO13True

Now, let’s drop these duplicate records using the following code:

1
2
# Dropping the duplicate records
df_sub.drop_duplicates()
python

As per the above code, the first occurrence of the duplicate record is kept and the second occurrence of the duplicate record is deleted. However, we can also reverse this process by passing last inside the argument keep as shown:

1
df_sub.drop_duplicates(keep='last')
python

Using either of the above codeblocks, we can drop the duplicate records. However, this will be temporary and there will be no change in the original DataFrame df_sub. To reflect permanent changes to the original DataFrame, we can follow either of these steps:

1
2
3
4
5
# Dropping the duplicate records permanently
df_sub.drop_duplicates(inplace=True)

# Or,
df_sub = df_sub.drop_duplicates().reset_index(drop=True) # Resetting index is optional
python

We can even drop all the instances i.e., the original record as well as the duplicates. This can be achieved using the keep argument and this time passing False.

1
df_sub.drop_duplicates(keep=False, inplace=True)
python

The output of above code is shown below:

Student_IDSt_GRADELeft_SchoolSt_AGEis_duplicate
12K1317.5NO12False
22K1527NO12False
32K1857NO13.67False
42K1068NO14.79False

Identify and Correct Invalid Age of the Student

We consider an age to be invalid if its value is less than or equal to zero. This can be checked using a direct vectorized operation as shown:

1
2
# Counting number of invalid student age
(df_sub.St_AGE <= 0).sum()
python

The result of the above code is zero.

Now, we know there are no invalid ages in the dataset. So, let us try to round up all the ages because few of the given student ages are present in the decimal format. This can be achieved using astype(int) function and storing the results back to the St_AGE column as shown:

1
df_sub.St_AGE = df_sub.St_AGE.astype(int) 
python

The above code changes all the values in the column St_AGE to integer dataype as shown:

Student_IDSt_GRADELeft_SchoolSt_AGEis_duplicate
12K1317.5NO12False
22K1527NO12False
32K1857NO13False
42K1068NO14False

Conclusion

This guide taught you the basic operations used for performing data cleaning by handling duplicate data, missing data, and invalid data.

2