Author avatar

Vivek Kumar

Data Transformation

Vivek Kumar

  • Mar 25, 2019
  • 9 Min read
  • 45 Views
  • Mar 25, 2019
  • 9 Min read
  • 45 Views
Data
Pandas

Introduction

Data Transformation is the process of manipulating data to the desired formatted information required by a group or an individual.

Transformed data (Formatted Information) = Data (Unformatted/Formatted Data) + Transformation (Convert Format)

It includes various activities like aggregating, grouping, filtering, removing empty fields or duplicate data, stacking, unstacking, pivoting, etc.

In this guide, you will learn about the data transformation fundamentals using Pandas library. This guide contains the dataset of a company in country_sales.csv file which includes six attributes as: COUNTRY, CUST_NAME, QTY, LIST_PRICE_AMT, SALES, ITEM_ID.

Sample rows of country_sales.csv file are:

COUNTRYCUST_NAMEQTYLIST_PRICE_AMTSALESITEM_ID
INDIAMB DUAMAL MNC16001.35135000YZASE390
INDIAUDRO UNTARNATUONAL100000.1111000SO08536
INDIAOMAU490.0464600POV6064
INDIAJAMUM AVARULDA VALANZUALA250000.1875918759ED33479
CHINAUNTARPLAX ANGUNAARAD PROD.24000.0969600ED36039
CHINANULLNULLNULLNULLYZAMG305

The above data has been read directly in a Pandas DataFrame and has few rows with empty sales value.

So, let us start the learning with the following objectives:

  1. Find the number of customers (CUST_NAME) associated with each ITEM (ITEM_ID).

  2. Find the median sales (MEDIAN_SALES) of each item (ITEM_ID) per country (COUNTRY).

  3. Convert the dataset to a stacked format by dropping empty sales (SALES is NaN).

  4. Convert the dataset to an unstacked format by filling the empty sales (SALES) through the country’s median sales.

The Baseline

To initiate the I/O process, the first step is to import the Pandas library followed by reading the CSV file.

1
2
3
4
5
6
import pandas as pd

dataframe = pd.read_csv('Country_Sales.csv')

#To check the DataFrame top rows 
dataframe.head(6) 
python

We will get the output as :

COUNTRYCUST_NAMEQTYLIST_PRICE_AMTSALESITEM_ID
0CHINACONDUPAR MUNAM COND.ELE.LTDA250.00.2100021000.01224141
1UKFUAT GRAMME16000.06.58000658000.0AE206808
2USAAmorUm Auto EletrUca ltda me480.00.00621621.0CD224702
3UKUGARAMHU MOTORM MALEM UMA LLC9316.60.014711471.0CD225980
4CHINADRAW TUTE, UNC.144.04.40200440200.0ED223023
5INDIADRAW TUTE, UNC.25000.00.1875918759.0ED33479

Finding the Number of Customers Associated with Each Item

As per the first objective, we have to find the number of customers associated with each item.

This can be done in various ways, we have a set of functions available inside the pandas which can be used to achieve the target.

Given below are the list of functions we will be using for this objective:

  1. groupby(): Used to group DataFrame or Series by using a mapper or by a Series of columns.
  2. count(): Used to count non-NAN (non-empty) cells for each column or row.
  3. sort_values: Used to sort the result in descending order
  4. rename(): Used to alter axes labels (CUST_COUNT).
  5. reset_index(): Used to reset the index or a level of it.
1
2
3
4
5
# Fetching the number of customers per item
cust_count = dataframe.groupby('ITEM_ID')['CUST_NAME']\
						.count().sort_values(ascending=False)\
						.rename('CUST_COUNT').reset_index()\
						.head()
python

Output

ITEM_IDCUST_COUNT
0VDV884713
1HU25924911
2pro280491
3SO085361
4AE2068081

As we can observe, an item with ID VDV8847 has the highest customer count followed by HU259249 and the rest of the items have just one customer buying them.

Finding the Median Sales of Each Item Per Country

To get the median sales of a column, we will group by columns to apply the aggregate function.

As per the second objective, we have to find the MEDIAN_SALES of each ITEM_ID per COUNTRY.

To achieve this, we will require the aggregate function. It is used to aggregate data using one or more operations (median, mean) over the specified axis (['SALES']). Alternatively, we can also use the median function to achieve the same results.

We will group by columns ('COUNTRY','ITEM_ID') and take the median of sales. To showcase results from each of the countries we will select the largest ten median values and arrange the results by sorting by country names.

1
2
3
4
med_sales = dataframe.groupby(['COUNTRY', 'ITEM_ID'])['SALES']\
						.aggregate('median').rename('MEDIAN_SALES')\
						.nlargest(10).reset_index()\
						.sort_values('COUNTRY').reset_index(drop=True)
python

Output

COUNTRYITEM_IDMEDIAN_SALES
0CHINAED223023440200.0
1CHINASO02032350000.0
2CHINASO207740291500.0
3INDIAYZAMG4834819000.0
4INDIAVDV8847342100.0
5INDIAYZASE390135000.0
6UKAE206808658000.0
7UKHU259249368000.0
8USASO87540345300.0
9USASO202556187700.0

This way we can find the median of our columns using groupby and aggregate functions.

Converting the Dataset to a Stacked Format by Dropping Empty Sales

We can divide this objective into two steps:

1. Dropping Empty Columns

To drop the empty columns we can use Pandas dropna() function.

1
2
# Dropping empty columns
dataframe = dataframe.dropna(how='any').reset_index()
python

Output

COUNTRYCUST_NAMEQTYLIST_PRICE_AMTSALESITEM_ID
00CHINACONDUPAR MUNAM COND.ELE.LTDA250.00.2100021000.01224141
11UKFUAT GRAMME16000.06.58000658000.0AE206808
22USAAmorUm Auto EletrUca ltda me480.00.00621621.0CD224702
33UKUGARAMHU MOTORM MALEM UMA LLC9316.60.014711471.0CD225980
44CHINADRAW TUTE, UNC.144.04.40200440200.0ED223023
55INDIADRAW TUTE, UNC.25000.00.1875918759.0ED33479

We can observe that no rows have been dropped from the first few rows, suggesting that none of them has a missing value.

2. Stacking DataFrames

To stack the DataFrame, we can use Pandas stack function.

1
2
# Stacking the DataFrame
df_stack = dataframe.stack()
python

The stack function doesn't bring any changes in the values of the original DataFrame; rather, it changes the shape of the DataFrame to a narrow one. Here is a brief glimpse of the stacked DataFrame.

0COUNTRYCHINA
CUST_NAMECONDUPAR MUNAM COND.ELE.LTDA
QTY250
LIST_PRICE_AMT0.21
SALES21000
ITEM_ID1224141
1COUNTRYUK
CUST_NAMEFUAT GRAMME
QTY16000
LIST_PRICE_AMT6.58
SALES658000
ITEM_IDAE206808

This way we can remove all the rows with empty sales and later stack the DataFrame.

Converting the Dataset to an Unstacked Format by Filling the Empty Sales Through the Country’s Median Sales

To convert this dataset to an unstacked format and replace the empty sales (SALES) by the median of each country's median sales, we will use the unstack function. This function pivots a level of the index labels, returning a DataFrame that has a new level of column labels whose inner-most level consists of the pivoted index labels.

1
2
3
4
5
# Total median sales for all countries
to_fill = med_sales.MEDIAN_SALE.median() 

# Unstacking the DataFrame
df_stack.unstack(level=-1, fill_value=to_fill) 
python

Output

COUNTRYCUST_NAMEQTYLIST_PRICE_AMTSALESITEM_ID
0CHINACONDUPAR MUNAM COND.ELE.LTDA2500.21210001224141
1UKFUAT GRAMME160006.58658000AE206808
2USAAmorUm Auto EletrUca ltda me4800.00621621CD224702
3UKUGARAMHU MOTORM MALEM UMA LLC9316.60.014711471CD225980
4CHINADRAW TUTE, UNC.1444.402440200ED223023
5INDIADRAW TUTE, UNC.250000.1875918759ED33479

This way, we can unstack data and replace the empty values with some default value. Here, we’re using the median of each country's sales median.

Conclusion

By going through this guide, you have covered the fundamentals of data transformation using Pandas library. Other operations such as pivot, pivot_table, multiindexing, etc can also be performed to transform the data.

0