Exploratory Data Analysis and Pre-processing in Python

By Gaurav Singhal

Mar 5, 2020 • 10 Minute Read

Introduction

The world is running on data. Data can be anything—numbers, documents, images, facts, etc. It can be in digital or in any physical form. The word "data" is the plural of "datum," which means "something given" and usually refers to a single piece of information.

Raw data is only useful after we analyze and interpret it to get the information we desire. This kind of information can help organizations design strategies based on facts and trends.

With recent advances in Python packages and their ability to perform higher-end analytical tasks, it has become a go-to language for data analysts.

By the end of Part 1, you will have hands-on experience with:

Important data analysis libraries
Data pre-processing
Exploratory data analysis

Part 2 will cover data visualization and building a predictive model.

Data scientists and analysts spend most of their time on data pre-processing and visualization. Model building is much easier. In these guides, we will use New York City Airbnb Open Data. We will predict the price of a rental and see how close our prediction is to the actual price. Download the data here.

Important Data Analysis Libraries

What makes Python useful for data analysis? It contains packages and libraries that are open-source and widely used to crunch data. Let's learn more about them.

Fundamental Scientific Computing

Numpy: The name stands for Numeric Python. This library is capable of performing random numbers, linear algebra, and Fourier fransform.
SciPy: The name stands for Scientific Python. This library contains a high-level science and engineering module. You can perform linear algebra, optimization, and fast Fourier transforms. SciPy is built on NumPy.

Data Manipulation and Visualization

pandas: In data analysis and machine learning, pandas are used in the form of data frames. This package allows you to read data from different file formats, such as CSV, Excel, plain text, JSON, SQL, etc.
Matplotlib: This library is used for plotting and visualizing data. You can plot histograms, graphs, line plots, heatmaps, and lot more. It can be embedded in GUI toolkits.

Machine Learning

Scikit Learn: This is a free machine learning library. Scikit Learn is built on NumPy, SciPy, and Matplotlib. It contains efficient tools for statistical model building. It can run various classification, regression, and clustering algorithms. It integrates well with pandas while working on dataframes.

Importing Libraries and Loading the Data

          from __future__ import division
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('nyc_airbnb'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
import geopandas as gpd #pip install geopandas

from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

sns.set_style('darkgrid')
    

Exploratory Data Analysis (EDA)

In data analysis, EDA is used to get a better understanding of data. Looking at the data, questions may arise, such as, how many rows and columns are there? Is the data numeric? What are the names of the features (columns)? Are there any missing values, text, and numeric symbols inappropriate to the data?

The shape and info classes are the answer we are looking for. The head function will display the first five rows of the dataframe, and the tail function will display the last five. The class describe function will give the statistical summary of the dataset. To split the data by groups giving specific criteria, we will use the groupby() function.

First, let's read our data.

          data = pd.read_csv(r'nyc_airbnb\AB_NYC_2019.csv')

print('Number of features: %s' %data.shape[1])
print('Number of examples: %s' %data.shape[0])
    

      data.head().append(data.tail())

      data.info()

      data.describe()

Evaluation of Data

Let's start looking at which are the best hosts and neighborhoods.

          # Evaluation_1-top_3_hosts

top_3_hosts = (pd.DataFrame(data.host_id.value_counts())).head(3)
top_3_hosts.columns=['Listings']
top_3_hosts['host_id'] = top_3_hosts.index
top_3_hosts.reset_index(drop=True, inplace=True)
top_3_hosts
    

          # Evaluation_2-top_3_neighbourhoood_groups

top_3_neigh = pd.DataFrame(data['neighbourhood_group'].value_counts().head(3))
top_3_neigh.columns=['Listings']
top_3_neigh['Neighbourhood Group'] = top_3_neigh.index
top_3_neigh.reset_index(drop=True, inplace=True)
top_3_neigh
    

A word cloud will show a collection of the most frequent words written in the reviews. The larger the size of the word, the more frequently it is used. Start by installing a word cloud library.

          from wordcloud import WordCloud, ImageColorGenerator
wordcloud = WordCloud(
                          background_color='white'
                         ).generate(" ".join(data.neighbourhood))
plt.figure(figsize=(15,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('neighbourhood.png')
plt.show()
    

Data Cleaning

The below code will perform data cleaning on our raw data. We have to prepare the data before visualizing and predicting. This is a significant step in the data analysis workflow. Here we will use the pandas library, specifically the drop , isnull ,fillna and transform classes.

      data.drop(['id','host_id','host_name','last_review'],axis=1,inplace=True)

      data.isnull().sum()

There are different ways of filling values. The most common practice is to fill either by mean or median of the variable. We will perform the z-test to know which will fit better.

A skewed data distribution has a long tail to either the right (positively skewed) or left (negatively skewed). For example, say we want to determine the income of a state, which is not distributed uniformly. A handful of people earning significantly more than the average will produce outliers("lies outside") in the dataset. Outliers are a severe threat to any data analysis. In such cases, the median income will be closer than the mean to the middle-class (majority) income.

Means are handy when data is uniformly distributed.

          data_check_distrib=data.drop(data[pd.isnull(data.reviews_per_month)].index)

{"Mean":np.nanmean(data.reviews_per_month),"Median":np.nanmedian(data.reviews_per_month),
 "Standard Dev":np.nanstd(data.reviews_per_month)}
    

The mean > median. Let's plot the distribution curve.

          def impute_median(series):
    return series.fillna(series.median())
    

          # plot a histogram 
plt.hist(data_check_distrib.reviews_per_month,  bins=50)
plt.title("Distribution of reviews_per_month")
plt.xlim((min(data_check_distrib.reviews_per_month), max(data_check_distrib.reviews_per_month)))
    

It is right-skewed! Let's fill the values.

          def impute_median(series):
    return series.fillna(series.median())
    

      data.reviews_per_month=data["reviews_per_month"].transform(impute_median)

Correlation Matrix Plot

For a given set of features, the correlation matrix shows the correlation, or mutual-relationship between the coefficients. Each random variable is correlated with each of its other values. The diagonal elements are always 1 because the correlation between a variable and itself is always 100%. An excellent way to check correlations among features is by visualizing the correlation matrix as a heatmap.

          data['reviews_per_month'].fillna(value=0, inplace=True)

f,ax=plt.subplots(figsize=(10,10))
sns.heatmap(data.corr(),annot=True,linewidths=5,fmt='.1f',ax=ax, cmap='Reds')
plt.show()
    

Notice the pastel shades. The darker the shade, the better the correlation. Accordingly, number_of_reviews is highly correlated with reviews_per_month, which is quite logical. We also find a correlation between price, number_of_reviews, and longitude with availability.

Conclusion

In this guide, we've looked at exploratory data analysis and data pre-processing. In Part 2, we will move on to visualizing and building a machine learning model to predict the price of Airbnb rentals.

Feel free to contact me with any questions at Codealphabet.

Gaurav S.

Guarav is a Data Scientist with a strong background in computer science and mathematics. He has extensive research experience in data structures, statistical data analysis, and mathematical modeling. With a solid background in Web development he works with Python, JAVA, Django, HTML, Struts, Hibernate, Vaadin, Web Scrapping, Angular, and React. His data science skills include Python, Matplotlib, Tensorflows, Pandas, Numpy, Keras, CNN, ANN, NLP, Recommenders, Predictive analysis. He has built systems that have used both basic machine learning algorithms and complex deep neural network. He has worked in many data science projects, some of them are product recommendation, user sentiments, twitter bots, information retrieval, predictive analysis, data mining, image segmentation, SVMs, RandomForest etc.

More about this author