Author avatar

Pavneet Singh

Data Wrangling with Pandas

Pavneet Singh

  • Mar 19, 2019
  • 13 Min read
  • Mar 19, 2019
  • 13 Min read


Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:

  • Data Sorting: To rearrange values in ascending or descending order.
  • Data Filtration: To create a subset of available data.
  • Data Reduction: To eliminate or replace unwanted values.
  • Data Access: To read or write data files.
  • Data Processing: To perform aggregation, statistical, and similar operations on specific values.

Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.


Execute the below command to install pandas:

1$ pip install pandas

Import pandas in python script as:

1import pandas

To install the latest version of pandas, make sure you have the latest version of pip. Use the below commands to upgrade the pip package on a terminal (Mac/Linux):

1$ pip install --upgrade pip

On Windows

1python -m pip install -U pip

Data Structures in Pandas

During data analysis, often the requirement is to store series or tabular data. This can be done using lists but python lists store the data using pointers and python objects, which is quite inefficient in terms of memory and performance. Pandas is bundled with custom data structures to store and process the data effectively.

There are two data structures:

  • Series: A series is just like a one dimensional indexed/labeled array that can be created using Series() function by passing an object of list/dictionary/array as a parameter to Series() function. Every series object has an associated data type.

object data type is used for heterogeneous data like a list of string and int.

1import pandas as pd # creating alias as pd
3nums = pd.Series([1, 2, 3]) # list
4nums.head() # to print series
5print(nums[2]) # print value at index 2


10    1
21    2
32    3
4dtype: int64
  • Index can be customized by passing a list to index property:
1custom_index = pd.Series(['1', float('NaN'), 3], index=['a', 11, 12]) # NaN means not a number
3# custom_index = pd.Series({'a':'1', 11:float('NaN'), 12:3}) # using dictionary


1a       1
211    NaN
312      3
4dtype: object
  • DataFrame: DataFrame stores the data in tabular form. DataFrame is a two-dimensional data structure that organizes the data into rows and columns. A dataFrame can be created using lists, dictionaries, or numpy arrays as:
1# using list
2product_data=[['e-book', 2000], ['plants', 6000], ['Pencil', 3000]] # 2d list, similar to 2d array
4columns_name=['product', 'unit_sold']
5product_df = pd.DataFrame(data=product_data, index=indexes, columns=columns_name)
8# using dictionaries
10product_data={'product': ['e-book', 'plants', 'Pencil'], 'unit_sold': [2000, 5000, 3000]}
11product_df = pd.DataFrame(data=product_data)
14# using numpy array
15products = np.array([['','product','unit_sold'], [1, 'E-book', 2000],[2, 'Plants', 6000], [3, 'Pencil', 3000]])
16product_pf = pd.DataFrame(data=products[1:,1:], # [1:,1:] from first row till end, from first column till end
17                          index=products[1:,0], # [1:,0] from first row till end, only first column
18                          columns=products[0,1:]) # [1:,0] only first row, form first column till end
19print(product_pf) # output is same as of first case


1  product  unit_sold
21  E-book       2000
32  Plants       6000
43  Pencil       3000
6  product  unit_sold
70  E-book       2000
81  Plants       5000
92  Pencil       3000

With numpy, the values of index and columns properties are being supplied using array slicing syntax.

Use describe() on Dataframe/Series object to get the summary of numeric values along with analytical data like min, max, standard deviation, count, and percentage slabs.

Note: describe() will skip analysis information if column's datatype is string, for e.g. dataframe with numpy has an empty string at first index.

Index and Labels

  • Row Index: By default, the first column is for row indexes, starting from zero. Indexes can also be customized by passing a list of indexes to index property. Pandas has iloc[int_index_value] function which can only take int values to fetch the rows as:
1row = product_df.iloc[2] # fetch third row
2rows = product_df.iloc[0:2] # fetch rows from first till third but not third

If the index contains string values then use loc[label] or ix[label], where loc search for the matched value and ix first use loc and in case of failure, it fall-backs to iloc.

A column can also be defined as an index column using data_frame.set_index('index_based_column_name') and can be reset using dataframe.reset_index.

  • Column Name: Column names can be used as labels and can be modified by passing a list of names to columns property. Column names are quite useful while performing selection, arithmetic, or searching operations as:
1product_df['product'] # return column as series object
3gt_products = product_df['unit_sold'] > 2500 # return a series object of bool, for values greater than 2500
6# calculate next month target by increasing 10% and store the resultant series object in product_df under next_target column
7product_df['next_target'] = product_df['unit_sold'] + ( product_df['unit_sold'] * 10)/100

Both column and row values can be used to get a particular value using at or iat function as:['2','product']

If the index contains any non-integer value then the type of index will be an object and should be used as a string like '2'.

Import or Export Data

Data is the integral part of analysis and often stored in files (CSV, Excel, JSON, XML, SQL etc). So pandas has inbuilt support to load data from files as a dataframe. CSV is the most commonly used format to create datasets and there are many free datasets available on the web. Let's import a Daily show guests dataset using pandas as:

Download the Daily show guests file and save it in the same folder where python script is stored or you can use a relative path of the CSV file.

1guest_list_df = pd.read_csv('daily_show_guests.csv', sep=',') # separator can also be changed

Output: top_five

To read data from other formats, use read_sql, read_json, and read_excel functions. To read specific or multiple sheets, use pd.read_excel('file_name.xlsx', sheet_name='Sheet1').

Pandas requires xlrd module to read excel files so make sure to install it as:

1$ pip install xlrd

SQL data access requires you to setup the connection using pyodbc.connect and then use pd.read_sql(sql_query, connection_object) to retrieve data.

Kaggle is another popular site to download datasets or use this data repository for more datasets.

Data Manipulation

Data manipulation includes sorting, merging, grouping, and altering the data. Pandas offers multiple functions to manipulate data as:


To sort the dataframe in ascending (default) or descending order, use sort_values function. It uses quicksort by default for sorting, though it can be replaced with mergesort or heapsort using kind property. sort_values as:

1sorted_guest_df = guest_list_df.sort_values('GoogleKnowlege_Occupation', # sort by column
2                                 ascending=False, # enable descending order
3                                 kind='heapsort', #sorting algorithm
4                                 na_position='last') # keep NaN value at last


Merge and Concatenation

Pandas has merge function which can be used to combine two dataframes, just like two SQL tables using joins as:

1# Merge
2sorted_guest_df = pd.merge(guest_list_df.head(3),
3                           guest_list_df.tail(3),
4                           how='outer',
5                           indicator = True)
  • head and tail will get the three rows from the top and bottom as dataframes.
  • outer is required to combine two dissimilar (no common rows) dataframes (tables).
  • Enabling indicator will provide information about the dataframe source of each row (left or right).


Pandas also has the join function which uses merge internally, though it is more concise and visually specific.

To combine both dataframes into a new one, use concat as:

1top_df = guest_list_df.head(3)
2bottom_df = guest_list_df.tail(3)
3combined_guest_df = pd.concat( [top_df, bottom_df] )

Merge combines dataframes based on common columns or indexes and can also support left and right join. concat only works with Iterable objects.


Grouping is used to aggregate the data into different categories. A common use case would be to create different types of users (paid/free) or to find the number of guests, having Acting as Group.

1guest_group = guest_list_df.groupby('Group')


Use 'rename' to change the column name Group to Occupation as:

1guest_list_df = guest_list_df.rename(columns={'Group':'Occupation'}, index={1:0001})


Data Cleaning

Datasets are not always perfect. Data duplication or missing values can affect the analysis process. Often missing values are represented with "NA" or "-" or similar indicators and need to be replaced with NaN before processing numeric data. Missing data can be verified as:

1guest_list_df.isnull().values.any() # verify if any value is null
2guest_list_df.isnull().sum() # display summary of null values per column

While reading data from files, missing values can be handled using the na_values property which can take a list of values to replace with NaN as:

1guest_list_df = pd.read_csv("daily_show_guests.csv", na_values = ['NA', 'N/A','-']])
  • fillna: To replace NA and NaN values with desired value.
1product_data={'product': ['E-book', 'Plants'], 'unit_sold': [12, np.NaN]} # need to import numpy as np
2product_df = pd.DataFrame(data=product_data)
3product_df.fillna(0, inplace=True)

dropna() function can also be used to remove rows or columns (via axis).

Dataframe also has functions like replace to replace specific values or rename to change the column names.

  • drop_duplicates: Row duplication can be removed using drop_duplicates function which will remove duplicate values based on column as:
1df = pd.DataFrame({'Name':['Pavneet','Pavneet']}, index=[1,2])


1      Name
22  Pavneet

Data Analysis

Pandas makes calculations fairly easy by providing inbuilt support for mathematical and statistics operations to calculate various measures like mean, median, standard deviation, min, max, etc. These functions can also be performed using describe() or can be performed on a single row or a column using the axis property as:

1product_df = pd.DataFrame(data={'product': ['E-book', 'Plants'],
2                                'unit_sold': [2000, 5000],
3                                'types': [800, 200]})


axis = 1 will perform the operation on the values of rows and, by default, it is 0 which means column-wide calculations.

Data Storage

Pandas objects (dataframe, series) can easily be exported to external files using to_csv as:

1product_df = pd.DataFrame(data={'product': ['E-book', 'Plants'],
2                        'unit_sold': [2000, 5000],
3                        'types': [800, 200]})
5                  index=False, # otherwise will add extra comma at start
6                  sep=',',
7                  encoding='utf-8)

And pandas has similar functions for other formats like to_parquet, to_sql, to_hdf, to_excel, to_json, to_html, etc.


  • Every pandas operation returns a modified copy of the source object. So, to do the changes in source object, use inplace but using inplace=True doesn't guarantee any faster operations so often it's better to keep the original data intact.
  • Pandas also supports plotting graphs via a function like boxplot or hist, but it requires matplotlib library for graphs.
  • Use pandas.MultiIndex with dataframes instead of Panel for 3D structures because panel has been deprecated.

The code is available on pandas-demo repo for demonstration and practice.