Author avatar

Vivek Kumar

Reading and Writing Files

Vivek Kumar

  • Mar 7, 2019
  • 8 Min read
  • 55 Views
  • Mar 7, 2019
  • 8 Min read
  • 55 Views
Data
Pandas

Introduction

In this guide, you'll learn about the Pandas I/O tools using CSV reader and writer.

Consider a CSV file named pandas_read.csv which holds data for nearly 1000 students based on 11 attributes. A preview of first few rows of the file is shown below:

ID1Id2Id2.1ID3Agekgkg.1cmDOBStudyGrade
0???????????
1???????????
2???????????
3274818471847289848808015111/13/1990Ph.D.A-

The above data has been read directly in a Pandas DataFrame. So, let us learn to implement custom read and write procedure keeping the following objectives in mind:

  1. There are two columns Id2.1 and kg.1, which are duplicates of the Id2 and kg columns. Drop these two columns and read only unique columns.

  2. Remove the first three rows which consist of garbage data.

  3. The last three rows hold a footer and, hence, need to be removed.

  4. The Date of Birth, DOB, column needs to be parsed to a DateTime datatype.

  5. Write all the changes to a new CSV file named pandas_write.csv by changing all column names to uppercase and replacing two column names (kg and cm) to (WEIGHT and HEIGHT)

The Baseline

To initiate the I/O process, the first step is to import the pandas library followed by reading the CSV file without specifying any explicit arguments.

1
2
import pandas as pd
pd.read_csv('pandas_read.csv')
python

Reading Only Unique Columns

The duplicate columns (Id2.1 and kg.1) are placed on index 2 and 6 respectively. So, we need to utilize the usecols argument and specify the column indexes which we need to retain.

1
pd.read_csv('pandas_read.csv', usecols=[0, 1, 3, 4, 5, 7, 8, 9, 10]).head(4)
python

We have appended the function with the head(4) method to get a preview of the first four records. As you can observe in the table below, the duplicate columns have been dropped.

ID1Id2ID3AgekgcmDOBStudyGrade
0?????????
1?????????
2?????????
3274818472898488015111/13/1990Ph.D.A-

Removing the Garbage Data

This can be achieved by mentioning the row numbers in the skiprows argument. Do remember that index in skiprows starts with 1, not 0. So, here we will be skipping 1st, 2nd, and 3rd rows.

1
pd.read_csv('pandas_read.csv', usecols=[0, 1, 3, 4, 5, 7, 8, 9, 10], skiprows=[1, 2, 3]).head(4)
python

The table below depicts that the first three records with ? have been removed without changing the row numbers.

ID1Id2ID3AgekgcmDOBStudyGrade
0274818472898488015111/13/1990Ph.D.A-
111712077412451571658/4/1988DoctorateF-
21732378414025010119411/11/1989DoctorateD-
3439447991141459017112/26/1995Ph.D.A+

Removing the Footer

Let us take a look at how the file looks at the bottom. We have used the .tail(4) method after reading the file to get the last four rows of the DataFrame.

1
pd.read_csv('pandas_read.csv', usecols=[0, 1, 3, 4, 5, 7, 8, 9, 10], skiprows=[1, 2, 3]).tail(4)
python
ID1Id2ID3AgekgcmDOBStudyGrade
996471964644144310315012/3/1985MasterF+
997The data in the CSV file lists three IDs of 10...NaNNaNNaNNaNNaNNaNNaNNaN
998NaNNaNNaNNaNNaNNaNNaNNaNNaN
999NaNNaNNaNNaNNaNNaNNaNNaNNaN

As you may observe, only cell ID1 997 holds the footer data and rest of the cells in rows 997, 998, and 999 are empty, which are represented as NaN (Not A Number) by Pandas automatically. So, let us drop the last three rows to remove the footer while reading the file. This can be achieved by using skipfooter argument and mentioning the number of rows to be skipped. Since the footer, in our case, comprises of three rows, hence, we will assign three to the skipfooter argument. Furthermore, the fastest engine used by Pandas library is c, whereas skipfooter is only compatible with the python engine, therefore we also have to mention the python engine explicitly.

1
pd.read_csv('pandas_read.csv', usecols=[0, 1, 3, 4, 5, 7, 8, 9, 10], skiprows=[1, 2, 3], skipfooter=3, engine='python').tail(4)
python

As you can observe from the table, rows 997-999 have been removed, leaving the last row as 996.

ID1Id2ID3AgekgcmDOBStudyGrade
99326136533633305817611/9/1995BachelorA+
99448392027302341851912/5/1991DoctorateA+
99532331947458441501519/23/1996DoctorateB+
996471964644144310315012/3/1985MasterF+

Parsing the DOB Column as a DateTime Datatype

Let us find the datatype for the DOB column using Pandas dtypes attribute.

1
pd.read_csv('pandas_read.csv', usecols=[0, 1, 3, 4, 5, 7, 8, 9, 10], skiprows=[1, 2, 3], skipfooter=3, engine='python').DOB.dtypes
python

This results in an object datatype.

1
dtype('O')
python

To convert this column to a datetime column, we need to mention a datetime parser along with the format of the date (MM/DD/YYYY). This can be accomplished by mentioning the column inside parse_dates argument followed by date parser function. The function can be implemented using lambda and pandas datetime.strptime methods.

1
2
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y')
pd.read_csv('pandas_read.csv', usecols=[0, 1, 3, 4, 5, 7, 8, 9, 10], skiprows=[1, 2, 3], skipfooter=3, engine='python', parse_dates=['DOB'], date_parser=dateparse).DOB.dtypes
python

This results in the Pandas datetime format.

1
dtype('<M8[ns]')
python

Writing to a New CSV File

Let us transform all the column names to uppercase and replace the two column names (kg and cm) to (WEIGHT and HEIGHT). There are many ways to perform these changes, but we can also do it during the writing phase by assigning new column names to the header argument in the form of a list. Also, make sure to mark the index argument as False so that row numbers do not arrive as a new column.

1
2
3
4
5
6
dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y')

> Using a variable named file to store the changes in CSV file and write it to the pandas_write.csv file
file = pd.read_csv('pandas_read.csv', usecols=[0, 1, 3, 4, 5, 7, 8, 9, 10], skiprows=[1, 2, 3], skipfooter=3, engine='python', parse_dates=['DOB'], date_parser=dateparse)

file.to_csv('pandas_write.csv', header=['ID1', 'ID2', 'ID3', 'AGE', 'WEIGHT', 'HEIGHT', 'DOB', 'STUDY', 'GRADE'], index=False)
python

Conclusion

By going through this guide, you have covered custom read and write operations on a CSV file. Similar operations can be performed to utilize other available Pandas readers and writers.

3