Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:
Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.
Execute the below command to install pandas
:
1$ pip install pandas
Import pandas in python script as:
1import pandas
To install the latest version of pandas, make sure you have the latest version of
pip
. Use the below commands to upgrade the pip package on a terminal (Mac/Linux):
1$ pip install --upgrade pip
On Windows
1python -m pip install -U pip
During data analysis, often the requirement is to store series or tabular data. This can be done using lists but python lists store the data using pointers and python objects, which is quite inefficient in terms of memory and performance. Pandas is bundled with custom data structures to store and process the data effectively.
There are two data structures:
Series()
function by passing an object of list/dictionary/array as a parameter to Series()
function. Every series object has an associated data type.
object
data type is used for heterogeneous data like a list ofstring
andint
.
1import pandas as pd # creating alias as pd
2
3nums = pd.Series([1, 2, 3]) # list
4nums.head() # to print series
5print(nums[2]) # print value at index 2
Output:
10 1
21 2
32 3
4dtype: int64
53
index
property:1custom_index = pd.Series(['1', float('NaN'), 3], index=['a', 11, 12]) # NaN means not a number
2custom_index.head()
3# custom_index = pd.Series({'a':'1', 11:float('NaN'), 12:3}) # using dictionary
Output:
1a 1
211 NaN
312 3
4dtype: object
1# using list
2product_data=[['e-book', 2000], ['plants', 6000], ['Pencil', 3000]] # 2d list, similar to 2d array
3indexes=[1,2,3]
4columns_name=['product', 'unit_sold']
5product_df = pd.DataFrame(data=product_data, index=indexes, columns=columns_name)
6print(product_df)
7
8# using dictionaries
9
10product_data={'product': ['e-book', 'plants', 'Pencil'], 'unit_sold': [2000, 5000, 3000]}
11product_df = pd.DataFrame(data=product_data)
12print(product_df)
13
14# using numpy array
15products = np.array([['','product','unit_sold'], [1, 'E-book', 2000],[2, 'Plants', 6000], [3, 'Pencil', 3000]])
16product_pf = pd.DataFrame(data=products[1:,1:], # [1:,1:] from first row till end, from first column till end
17 index=products[1:,0], # [1:,0] from first row till end, only first column
18 columns=products[0,1:]) # [1:,0] only first row, form first column till end
19print(product_pf) # output is same as of first case
Output:
1 product unit_sold
21 E-book 2000
32 Plants 6000
43 Pencil 3000
5
6 product unit_sold
70 E-book 2000
81 Plants 5000
92 Pencil 3000
With numpy, the values of
index
andcolumns
properties are being supplied using array slicing syntax.Use
describe()
on Dataframe/Series object to get the summary of numeric values along with analytical data like min, max, standard deviation, count, and percentage slabs.
Note: describe()
will skip analysis information if column's datatype is string
, for e.g. dataframe with numpy has an empty string at first index.
index
property. Pandas has iloc[int_index_value]
function which can only take int
values to fetch the rows as:1row = product_df.iloc[2] # fetch third row
2rows = product_df.iloc[0:2] # fetch rows from first till third but not third
If the index contains string values then use
loc[label]
orix[label]
, whereloc
search for the matched value andix
first useloc
and in case of failure, it fall-backs toiloc
.A column can also be defined as an index column using
data_frame.set_index('index_based_column_name')
and can be reset usingdataframe.reset_index
.
columns
property. Column names are quite useful while performing selection, arithmetic, or searching operations as:1product_df['product'] # return column as series object
2
3gt_products = product_df['unit_sold'] > 2500 # return a series object of bool, for values greater than 2500
4products
5
6# calculate next month target by increasing 10% and store the resultant series object in product_df under next_target column
7product_df['next_target'] = product_df['unit_sold'] + ( product_df['unit_sold'] * 10)/100
Both column and row values can be used to get a particular value using at
or iat
function as:
1product_df.at['2','product']
If the index contains any non-integer value then the type of index will be an object and should be used as a string like '2'.
Data is the integral part of analysis and often stored in files (CSV, Excel, JSON, XML, SQL etc). So pandas has inbuilt support to load data from files as a dataframe. CSV is the most commonly used format to create datasets and there are many free datasets available on the web. Let's import a Daily show guests dataset using pandas as:
Download the Daily show guests file and save it in the same folder where python script is stored or you can use a relative path of the CSV file.
1guest_list_df = pd.read_csv('daily_show_guests.csv', sep=',') # separator can also be changed 2print(guest_list_df.head(5))
pythonOutput:
To read data from other formats, use read_sql
, read_json
, and read_excel
functions. To read specific or multiple sheets, use pd.read_excel('file_name.xlsx', sheet_name='Sheet1')
.
Pandas requires
xlrd
module to read excel files so make sure to install it as:
1$ pip install xlrd
SQL data access requires you to setup the connection using
pyodbc.connect
and then usepd.read_sql(sql_query, connection_object)
to retrieve data.Kaggle is another popular site to download datasets or use this data repository for more datasets.
Data manipulation includes sorting, merging, grouping, and altering the data. Pandas offers multiple functions to manipulate data as:
To sort the dataframe in ascending (default) or descending order, use sort_values
function. It uses quicksort
by default for sorting, though it can be replaced with mergesort
or heapsort
using kind
property. sort_values
as:
1sorted_guest_df = guest_list_df.sort_values('GoogleKnowlege_Occupation', # sort by column
2 ascending=False, # enable descending order
3 kind='heapsort', #sorting algorithm
4 na_position='last') # keep NaN value at last
Pandas has merge
function which can be used to combine two dataframes, just like two SQL tables using joins as:
1# Merge
2sorted_guest_df = pd.merge(guest_list_df.head(3),
3 guest_list_df.tail(3),
4 how='outer',
5 indicator = True)
head
and tail
will get the three rows from the top and bottom as dataframes.outer
is required to combine two dissimilar (no common rows) dataframes (tables).indicator
will provide information about the dataframe source of each row (left or right).Pandas also has the
join
function which usesmerge
internally, though it is more concise and visually specific.
To combine both dataframes into a new one, use concat
as:
1top_df = guest_list_df.head(3)
2bottom_df = guest_list_df.tail(3)
3combined_guest_df = pd.concat( [top_df, bottom_df] )
Merge combines dataframes based on common columns or indexes and can also support left and right join.
concat
only works withIterable
objects.
Grouping is used to aggregate the data into different categories. A common use case would be to create different types of users (paid/free) or to find the number of guests, having Acting
as Group
.
1guest_group = guest_list_df.groupby('Group')
2print(guest_group.get_group('Acting'))
Use 'rename' to change the column name Group
to Occupation
as:
1guest_list_df = guest_list_df.rename(columns={'Group':'Occupation'}, index={1:0001})
Datasets are not always perfect. Data duplication or missing values can affect the analysis process. Often missing values are represented with "NA" or "-" or similar indicators and need to be replaced with NaN
before processing numeric data. Missing data can be verified as:
1guest_list_df.isnull().values.any() # verify if any value is null
2guest_list_df.isnull().sum() # display summary of null values per column
While reading data from files, missing values can be handled using the na_values
property which can take a list of values to replace with NaN
as:
1guest_list_df = pd.read_csv("daily_show_guests.csv", na_values = ['NA', 'N/A','-']])
NA
and NaN
values with desired value.1product_data={'product': ['E-book', 'Plants'], 'unit_sold': [12, np.NaN]} # need to import numpy as np
2product_df = pd.DataFrame(data=product_data)
3product_df.fillna(0, inplace=True)
dropna()
function can also be used to remove rows or columns (via axis).Dataframe also has functions like
replace
to replace specific values orrename
to change the column names.
drop_duplicates
function which will remove duplicate values based on column as:1df = pd.DataFrame({'Name':['Pavneet','Pavneet']}, index=[1,2])
2df.drop_duplicates(subset='Name',keep='last')
Output:
1 Name
22 Pavneet
Pandas makes calculations fairly easy by providing inbuilt support for mathematical and statistics operations to calculate various measures like mean, median, standard deviation, min, max, etc. These functions can also be performed using describe()
or can be performed on a single row or a column using the axis
property as:
1product_df = pd.DataFrame(data={'product': ['E-book', 'Plants'],
2 'unit_sold': [2000, 5000],
3 'types': [800, 200]})
4product_df.describe()
5product_df['unit_sold'].mean()
6product_df.mean(axis=1)
axis = 1
will perform the operation on the values of rows and, by default, it is0
which means column-wide calculations.
Pandas objects (dataframe, series) can easily be exported to external files using to_csv
as:
1product_df = pd.DataFrame(data={'product': ['E-book', 'Plants'],
2 'unit_sold': [2000, 5000],
3 'types': [800, 200]})
4product_df.to_csv('aa.csv',
5 index=False, # otherwise will add extra comma at start
6 sep=',',
7 encoding='utf-8)
And pandas has similar functions for other formats like to_parquet
, to_sql
, to_hdf
, to_excel
, to_json
, to_html
, etc.
inplace
but using inplace=True
doesn't guarantee any faster operations so often it's better to keep the original data intact.boxplot
or hist
, but it requires matplotlib
library for graphs.pandas.MultiIndex
with dataframes instead of Panel
for 3D structures because panel
has been deprecated.The code is available on pandas-demo repo for demonstration and practice.