Author avatar

Vivek Kumar

Indexing and Selecting Data

Vivek Kumar

  • Mar 22, 2019
  • 9 Min read
  • 163 Views
  • Mar 22, 2019
  • 9 Min read
  • 163 Views
Data
Pandas

Introduction

The objective of this guide is to introduce the process of slicing and getting the subsets of Pandas objects.

We will learn these operations using dummy data stored in a CSV file and breakdown the learning process into four steps:

  1. Extracting values from Pandas Series and DataFrame based on their labels.
  2. Extracting values from Pandas Series and DataFrame based on their indexes.
  3. Selecting columns by excluding a column from the beginning, center, and the end of a DataFrame.
  4. Filtering out DataFrame columns based on a certain string.

Understanding the Dataset

We have been given a CSV file named file.csv. Let us import the necessary library, (here Pandas) in the python environment and observe a few of its first and last observations.

1
2
3
4
5
6
7
8
# Importing required library
import pandas as pd

# Reading the CSV file to a Pandas DataFrame object variable, named df 
df = pd.read_csv('file.csv')

# Displaying top 5 observations of the file
df.head()
python
Serial numberStatePopulationCompany IDEstablished dateEmployees countGrowthCurrent status
0Alabama4777326C8691/1/1950256511104713Open
1Alaska711139C444911/23/195274192707190Open
2Arizona6410979C350910/16/195519619584596Closed
3Arkansas2916372C11179/7/19582170351009Open
4California37325068C42147/31/1961394273587460Open
1
2
# Displaying last 5 observations of the file
df.tail()
python
Serial numberStatePopulationCompany IDEstablished dateEmployees countGrowthCurrent status
15Iowa3047646C10876/3/1993360532336225Open
16Kansas2851183C21414/26/1996183071619008Open
17Kentucky4340167C32393/19/1999468534566179Closed
18Louisiana4529605C48102/8/2002260641532999Open
19Maine1329084C38501/1/2005479561930855Closed

Extracting Values from Pandas Series and DataFrame Based on Their Labels

To extract values from either a Series or a DataFrame based on their labels, Pandas provides a method loc. It can be used to access a value, or a group of values, based on their name(s).

Series

Consider the top five values of the dataset column, State. The chosen column State will be treated as a Pandas Series as shown:

1
2
3
4
5
# Selecting top five values of State
s = df.State.head()
type(s)

# pandas.core.series.Series
python

Let us reindex the s and observe how loc is helpful in accessing the values as shown in given codeblocks:

1
2
# Reindexing the Series s
s.index = ['b', 'd', 'e', 'a', 'f']
python
Original SeriesSeries after reindexing
0 Alabamab Alabama
1 Alaskad Alaska
2 Arizonae Arizona
3 Arkansasa Arkansas
4 Californiaf California

To access Arizona from the Series after reindexing, use the given code:

1
s.loc['e']
python

Notice, that e is written inside apostrophes and loc has square brackets rather than round brackets.

DataFrame

Let us learn to access a group of values using loc in a DataFrame. Using the top five rows of the dataset, let us fetch only the Established date and Growth of all the companies.

1
df.head().loc[:, ['Established date', 'Growth']]
python
Established dateGrowth
1/1/19501104713
11/23/19522707190
10/16/1955584596
9/7/195851009
7/31/19613587460

From the above code, you can observe that, in general, for a DataFrame loc consists of two parameters as illustrated, loc[parameter1, parameter2]. Here, the parameter1 corresponds to the row names/labels, whereas the parameter2 corresponds to column names/labels.

Extracting Values from Pandas Series and DataFrame Based on Their Indexes

It is not always necessary that you have to access the values of a Pandas object using label names. Pandas provides another great method, iloc, which we can use to access the values just by referring to the index.

Series

Let us use the same reindexed Series with alphabetical indexes. This time, let us access the centermost three values which are Alaska, Arizona, and Arkansas. This can be achieved using the following code:

1
2
3
4
5
6
s.iloc[1:4]
s
# d      Alaska
# e     Arizona
# a    Arkansas
# Name: State, dtype: object
python

As you can observe, iloc doesn't consider the real alphabetical indexes.

DataFrame

Using iloc, let us fetch the centermost three rows with all the columns out of the topmost five rows of the DataFrame df. This can be achieved using the given code:

1
df.head().iloc[1:4, :]
python
Serial numberStatePopulationCompany IDEstablished dateEmployees countGrowthCurrent status
1Alaska711139C444911/23/195274192707190Open
2Arizona6410979C350910/16/195519619584596Closed
3Arkansas2916372C11179/7/19582170351009Open

The : inside the iloc represents selecting all the columns.

Selecting Columns by Excluding a Column from the Beginning, Center, and the End of a DataFrame

The task of selecting a few columns and excluding rest is quite common in day-to-day analysis. The given three tasks can be completed either by iloc or loc. Both of these methods are illustrated.

1
2
3
4
# 1. Excluding the first column
df.head().iloc[:, 1:]  # Using iloc

df.head().loc[:, ['State', 'Population', 'Company ID', 'Established date', 'Employees count', 'Growth', 'Current status']] # Using loc
python
StatePopulationCompany IDEstablished dateEmployees countGrowthCurrent status
Alabama4777326C8691/1/1950256511104713Open
Alaska711139C444911/23/195274192707190Open
Arizona6410979C350910/16/195519619584596Closed
Arkansas2916372C11179/7/19582170351009Open
California37325068C42147/31/1961394273587460Open
1
2
3
4
5
# 2. Excluding the center column, say Established date
import numpy as np
df.head().iloc[:, np.r_[0:4, 5:8]] # Using iloc and NumPy method np.r_

df.head().loc[:, ['Serial number', 'State', 'Population', 'Company ID', 'Employees count', 'Growth', 'Current status']] # Using loc
python
Serial numberStatePopulationCompany IDEmployees countGrowthCurrent status
0Alabama4777326C869256511104713Open
1Alaska711139C444974192707190Open
2Arizona6410979C350919619584596Closed
3Arkansas2916372C11172170351009Open
4California37325068C4214394273587460Open
1
2
3
4
# 3. Excluding the last column
df.head().iloc[:, :7] # Using iloc

df.head().loc[:, ['Serial number', 'State', 'Population', 'Company ID', 'Established date' ,'Employees count', 'Growth']] # Using loc
python
Serial numberStatePopulationCompany IDEstablished dateEmployees countGrowth
0Alabama4777326C8691/1/1950256511104713
1Alaska711139C444911/23/195274192707190
2Arizona6410979C350910/16/195519619584596
3Arkansas2916372C11179/7/19582170351009
4California37325068C42147/31/1961394273587460

Filtering Out DataFrame Columns Based on a Certain String

Sometimes, there's a need to select columns based on their alphabets. The filter method provided by the Pandas is quite helpful to select columns automatically. Let us try to fetch columns which either begin with S or ends with s by using the given code:

1
2
df.head().filter(regex='^S|s$')
# The resultant columns are: Serial number, State, and Current status
python
Serial numberStateCurrent status
0AlabamaOpen
1AlaskaOpen
2ArizonaClosed
3ArkansasOpen
4CaliforniaOpen

Conclusion

In this guide, we have learned the various techniques of slicing and extracting values out of a Pandas objects using methods like iloc, loc, and filter.

0