Data Wrangling with Pandas

Learn how to use the Pandas open source library for data sorting, filtration, reduction, processing, and more.

Mar 19, 2019 • 13 Minute Read

Introduction

Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:

Data Sorting: To rearrange values in ascending or descending order.
Data Filtration: To create a subset of available data.
Data Reduction: To eliminate or replace unwanted values.
Data Access: To read or write data files.
Data Processing: To perform aggregation, statistical, and similar operations on specific values.

Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.

Prerequisites

Execute the below command to install pandas:

      $ pip install pandas

Import pandas in python script as:

      import pandas

To install the latest version of pandas, make sure you have the latest version of pip. Use the below commands to upgrade the pip package on a terminal (Mac/Linux):

      $ pip install --upgrade pip

On Windows

      python -m pip install -U pip

Data Structures in Pandas

During data analysis, often the requirement is to store series or tabular data. This can be done using lists but python lists store the data using pointers and python objects, which is quite inefficient in terms of memory and performance. Pandas is bundled with custom data structures to store and process the data effectively.

There are two data structures:

Series: A series is just like a one dimensional indexed/labeled array that can be created using Series() function by passing an object of list/dictionary/array as a parameter to Series() function. Every series object has an associated data type.

object data type is used for heterogeneous data like a list of string and int.

      import pandas as pd # creating alias as pd

nums = pd.Series([1, 2, 3]) # list
nums.head() # to print series
print(nums[2]) # print value at index 2
    

Output:

      0    1
1    2
2    3
dtype: int64
3
    

Index can be customized by passing a list to index property:

      custom_index = pd.Series(['1', float('NaN'), 3], index=['a', 11, 12]) # NaN means not a number
custom_index.head()
# custom_index = pd.Series({'a':'1', 11:float('NaN'), 12:3}) # using dictionary
    

Output:

      a       1
11    NaN
12      3
dtype: object
    

DataFrame: DataFrame stores the data in tabular form. DataFrame is a two-dimensional data structure that organizes the data into rows and columns. A dataFrame can be created using lists, dictionaries, or numpy arrays as:

      # using list
product_data=[['e-book', 2000], ['plants', 6000], ['Pencil', 3000]] # 2d list, similar to 2d array
indexes=[1,2,3]
columns_name=['product', 'unit_sold']
product_df = pd.DataFrame(data=product_data, index=indexes, columns=columns_name)
print(product_df)

# using dictionaries

product_data={'product': ['e-book', 'plants', 'Pencil'], 'unit_sold': [2000, 5000, 3000]}
product_df = pd.DataFrame(data=product_data)
print(product_df)

# using numpy array
products = np.array([['','product','unit_sold'], [1, 'E-book', 2000],[2, 'Plants', 6000], [3, 'Pencil', 3000]])
product_pf = pd.DataFrame(data=products[1:,1:], # [1:,1:] from first row till end, from first column till end
                          index=products[1:,0], # [1:,0] from first row till end, only first column
                          columns=products[0,1:]) # [1:,0] only first row, form first column till end
print(product_pf) # output is same as of first case
    

Output:

      product  unit_sold
E-book       2000
Plants       6000
Pencil       3000

  product  unit_sold
E-book       2000
Plants       5000
Pencil       3000
    

With numpy, the values of index and columns properties are being supplied using array slicing syntax.

Use describe() on Dataframe/Series object to get the summary of numeric values along with analytical data like min, max, standard deviation, count, and percentage slabs.

Note: describe() will skip analysis information if column's datatype is string, for e.g. dataframe with numpy has an empty string at first index.

Index and Labels

Row Index: By default, the first column is for row indexes, starting from zero. Indexes can also be customized by passing a list of indexes to index property. Pandas has iloc[int_index_value] function which can only take int values to fetch the rows as:

      row = product_df.iloc[2] # fetch third row
rows = product_df.iloc[0:2] # fetch rows from first till third but not third
    

If the index contains string values then use loc[label] or ix[label], where loc search for the matched value and ix first use loc and in case of failure, it fall-backs to iloc.

A column can also be defined as an index column using data_frame.set_index('index_based_column_name') and can be reset using dataframe.reset_index.

Column Name: Column names can be used as labels and can be modified by passing a list of names to columns property. Column names are quite useful while performing selection, arithmetic, or searching operations as:

      product_df['product'] # return column as series object

gt_products = product_df['unit_sold'] > 2500 # return a series object of bool, for values greater than 2500
products

# calculate next month target by increasing 10% and store the resultant series object in product_df under next_target column
product_df['next_target'] = product_df['unit_sold'] + ( product_df['unit_sold'] * 10)/100
    

Both column and row values can be used to get a particular value using at or iat function as:

      product_df.at['2','product']

If the index contains any non-integer value then the type of index will be an object and should be used as a string like '2'.

Import or Export Data

Data is the integral part of analysis and often stored in files (CSV, Excel, JSON, XML, SQL etc). So pandas has inbuilt support to load data from files as a dataframe. CSV is the most commonly used format to create datasets and there are many free datasets available on the web. Let's import a Daily show guests dataset using pandas as:

Download the Daily show guests file and save it in the same folder where python script is stored or you can use a relative path of the CSV file.

      guest_list_df = pd.read_csv('daily_show_guests.csv', sep=',') # separator can also be changed
print(guest_list_df.head(5))
    

To read data from other formats, use read_sql, read_json, and read_excel functions. To read specific or multiple sheets, use pd.read_excel('file_name.xlsx', sheet_name='Sheet1').

Pandas requires xlrd module to read excel files so make sure to install it as:

      $ pip install xlrd

SQL data access requires you to setup the connection using pyodbc.connect and then use pd.read_sql(sql_query, connection_object) to retrieve data.

Kaggle is another popular site to download datasets or use this data repository for more datasets.

Data Manipulation

Data manipulation includes sorting, merging, grouping, and altering the data. Pandas offers multiple functions to manipulate data as:

Sorting

To sort the dataframe in ascending (default) or descending order, use sort_values function. It uses quicksort by default for sorting, though it can be replaced with mergesort or heapsort using kind property. sort_values as:

      sorted_guest_df = guest_list_df.sort_values('GoogleKnowlege_Occupation', # sort by column
                                 ascending=False, # enable descending order
                                 kind='heapsort', #sorting algorithm
                                 na_position='last') # keep NaN value at last
    

Merge and Concatenation

Pandas has merge function which can be used to combine two dataframes, just like two SQL tables using joins as:

      # Merge
sorted_guest_df = pd.merge(guest_list_df.head(3),
                           guest_list_df.tail(3),
                           how='outer',
                           indicator = True)
    

head and tail will get the three rows from the top and bottom as dataframes.
outer is required to combine two dissimilar (no common rows) dataframes (tables).
Enabling indicator will provide information about the dataframe source of each row (left or right).

Pandas also has the join function which uses merge internally, though it is more concise and visually specific.

To combine both dataframes into a new one, use concat as:

      top_df = guest_list_df.head(3)
bottom_df = guest_list_df.tail(3)
combined_guest_df = pd.concat( [top_df, bottom_df] )
    

Merge combines dataframes based on common columns or indexes and can also support left and right join. concat only works with Iterable objects.

Grouping

Grouping is used to aggregate the data into different categories. A common use case would be to create different types of users (paid/free) or to find the number of guests, having Acting as Group.

      guest_group = guest_list_df.groupby('Group')
print(guest_group.get_group('Acting'))
    

Renaming

Use 'rename' to change the column name Group to Occupation as:

      guest_list_df = guest_list_df.rename(columns={'Group':'Occupation'}, index={1:0001})
    

Data Cleaning

Datasets are not always perfect. Data duplication or missing values can affect the analysis process. Often missing values are represented with "NA" or "-" or similar indicators and need to be replaced with NaN before processing numeric data. Missing data can be verified as:

      guest_list_df.isnull().values.any() # verify if any value is null
guest_list_df.isnull().sum() # display summary of null values per column
    

While reading data from files, missing values can be handled using the na_values property which can take a list of values to replace with NaN as:

      guest_list_df = pd.read_csv("daily_show_guests.csv", na_values = ['NA', 'N/A','-']])
    

fillna: To replace NA and NaN values with desired value.

      product_data={'product': ['E-book', 'Plants'], 'unit_sold': [12, np.NaN]} # need to import numpy as np
product_df = pd.DataFrame(data=product_data)
product_df.fillna(0, inplace=True)
    

dropna() function can also be used to remove rows or columns (via axis).

Dataframe also has functions like replace to replace specific values or rename to change the column names.

drop_duplicates: Row duplication can be removed using drop_duplicates function which will remove duplicate values based on column as:

      df = pd.DataFrame({'Name':['Pavneet','Pavneet']}, index=[1,2])
df.drop_duplicates(subset='Name',keep='last')
    

Output:

      Name
2  Pavneet
    

Data Analysis

Pandas makes calculations fairly easy by providing inbuilt support for mathematical and statistics operations to calculate various measures like mean, median, standard deviation, min, max, etc. These functions can also be performed using describe() or can be performed on a single row or a column using the axis property as:

      product_df = pd.DataFrame(data={'product': ['E-book', 'Plants'],
                                'unit_sold': [2000, 5000],
                                'types': [800, 200]})
product_df.describe()
product_df['unit_sold'].mean()
product_df.mean(axis=1)
    

axis = 1 will perform the operation on the values of rows and, by default, it is 0 which means column-wide calculations.

Data Storage

Pandas objects (dataframe, series) can easily be exported to external files using to_csv as:

      product_df = pd.DataFrame(data={'product': ['E-book', 'Plants'],
                        'unit_sold': [2000, 5000],
                        'types': [800, 200]})
product_df.to_csv('aa.csv',
                  index=False, # otherwise will add extra comma at start
                  sep=',',
                  encoding='utf-8)
    

And pandas has similar functions for other formats like to_parquet, to_sql, to_hdf, to_excel, to_json, to_html, etc.

Tips

Every pandas operation returns a modified copy of the source object. So, to do the changes in source object, use inplace but using inplace=True doesn't guarantee any faster operations so often it's better to keep the original data intact.
Pandas also supports plotting graphs via a function like boxplot or hist, but it requires matplotlib library for graphs.
Use pandas.MultiIndex with dataframes instead of Panel for 3D structures because panel has been deprecated.

The code is available on pandas-demo repo for demonstration and practice.