This guide introduces the concept of data cleaning which involves the identification of incomplete data, deletion of coarse data, handling invalid data, etc. Most of the data that appears in the real word scenario is unstructured, requires reshaping, and further data preprocessing.
Let us learn a few steps of data cleaning using the
Pandas library available in Python.
Consider a CSV file named
student_grade.csv which holds data for nearly 1500 students based on four attributes. A good way to get a quick feel of the data is to take a look at the first 10 rows of the dataset as shown below:
The current data set is already reshaped and therefore we define the following four objectives to perform the task of data cleaning:
Identify the missing grades.
Remove the student records who have left the school.
Identify and drop the duplicate records (repeating entries).
Identify and correct the invalid age.
To initiate the process, the first step is to import the Pandas library and read the dataset.
1 2 3 4 5
import pandas as pd df = pd.read_csv('student_grade.csv') # Storing the topmost 10 rows in a new variable df_sub df_sub = df.head(10)
.head(10) function is used after the
df variable to extract the first 10 rows of the DataFrame.
Before we dive into the code, it is important to understand the possible sources of missing data. Here are a few of the reasons by which missing data can originate:
In the column
St_GRADE we can see the null values as
n/a, but we have only a few records in the preview, the complete data might contain any other form of the missing values too. Let us identify the rows with
missing_values = 'n/a' df_sub = df_sub[df_sub.St_GRADE==missing_values]
Most of the times, missing values are not hard-coded by the user. If the user doesn't mention any specific term (here,
n/a) then Pandas will assign a
NaN value to each of the missing value cells. Such cells can be identified and handled as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
import numpy as np dummy = pd.Series([45, np.NaN, 12]) # Checking which instances have missing values dummy.isnull() # Output: # 0 False # 1 True # 2 False # dtype: bool # To fill the missing values, use fillna() dummy = dummy.fillna(5) dummy # Output: # 0 45.0 # 1 5.0 # 2 12.0 # dtype: float64
Notice from the above code that after filling an integer
5 to the
dummy Series, we still receive an output with the float datatype. This is due to the presence of
NaN initially in the Series whose datatype was
Next, let us remove the records of the students who have left the school. This is because, moving forward, such data might not be useful or this might lead to increased data size. To achieve this, we subset the DataFrame that includes all the rows where the value of a cell in the
St_GRADE column does not equal to
# Subsetting the DataFrame df based on the St_GRADE column df_sub = df_sub[df_sub.St_GRADE != 'YES']
You can observe the output with the deleted records as shown below:
To identify the duplicate records of the DataFrame, we can use the
duplicated() function present inside the Pandas library.
# Checking the duplicate rows df_sub["is_duplicate"]= df_sub.duplicated()
The above code finds the duplicate instances and tags them with a boolean logic
True if the rows are duplicate or else
False. Later, the result is stored in the column named
is_duplicate of the DataFrame
df_sub as shown below:
Now, let’s drop these duplicate records using the following code:
# Dropping the duplicate records df_sub.drop_duplicates()
As per the above code, the first occurrence of the duplicate record is kept and the second occurrence of the duplicate record is deleted. However, we can also reverse this process by passing
last inside the argument
keep as shown:
Using either of the above codeblocks, we can drop the duplicate records. However, this will be temporary and there will be no change in the original DataFrame
df_sub. To reflect permanent changes to the original DataFrame, we can follow either of these steps:
1 2 3 4 5
# Dropping the duplicate records permanently df_sub.drop_duplicates(inplace=True) # Or, df_sub = df_sub.drop_duplicates().reset_index(drop=True) # Resetting index is optional
We can even drop all the instances i.e., the original record as well as the duplicates. This can be achieved using the
keep argument and this time passing
The output of above code is shown below:
We consider an age to be invalid if its value is less than or equal to zero. This can be checked using a direct vectorized operation as shown:
# Counting number of invalid student age (df_sub.St_AGE <= 0).sum()
The result of the above code is zero.
Now, we know there are no invalid ages in the dataset. So, let us try to round up all the ages because few of the given student ages are present in the decimal format. This can be achieved using
astype(int) function and storing the results back to the
St_AGE column as shown:
df_sub.St_AGE = df_sub.St_AGE.astype(int)
The above code changes all the values in the column
St_AGE to integer dataype as shown:
This guide taught you the basic operations used for performing data cleaning by handling duplicate data, missing data, and invalid data.
Test your skills. Learn something new. Get help. Repeat.Start a FREE 10-day trial