Author avatar

Gaurav Singhal

Building Your First Python Analytics Solution: Part 2

Gaurav Singhal

  • Mar 5, 2020
  • 11 Min read
  • 1,167 Views
  • Mar 5, 2020
  • 11 Min read
  • 1,167 Views
Languages Frameworks and Tools
Python

Introduction

In Part 1 of this series on data analysis in Python, we discussed data preparation. In this guide, we will focus on different data visualization and building a machine learning model. Both guides use the New York City Airbnb Open Data. If you didn't read Part 1, check it out to see how we pre-processed the data.

By the end of this guide, you will have hands-on experience with:

  • Data Visualization
  • Building a Predictive Model

Let's get started!

Visualizations

It is been said that we understand faster when we visualize data. In the following code, we will work on six types of plots.

1
2
3
4
fig = plt.figure(figsize = (15,10))
ax = fig.gca()
data.hist(ax=ax)
plt.show()
python

Imgur

1
2
3
4
5
6
7
8
9
labels = data.neighbourhood_group.value_counts().index
colors = ['lightblue','beige','lightgreen','orange','cyan']
explode = [0,0,0,0,0]
sizes = data.neighbourhood_group.value_counts().values

plt.figure(0,figsize = (7,7))
plt.pie(sizes, explode=[0.1,0.0,0.3,0.5,0.0], labels=labels, colors=colors, autopct='%1.1f%%',shadow=True)
plt.title('Neighbourhood Group',color = 'black',fontsize = 15)
plt.show()
python

Imgur

1
2
3
4
5
#neighbourhood_group-price
result = data.groupby(["neighbourhood_group"])['price'].aggregate(np.median).reset_index().sort_values('price')
sns.barplot(x='neighbourhood_group', y="price", data=data,palette=colors, order=result['neighbourhood_group']) 
plt.xticks(rotation=45)
plt.show()
python

Imgur

1
2
3
4
#neighbourhood_group-availability_365
result = data.groupby(["neighbourhood_group"])['availability_365'].aggregate(np.median).reset_index().sort_values('availability_365')
sns.boxplot(x='neighbourhood_group', y="availability_365", data=data) 
plt.show()
python

Imgur

1
2
3
4
5
6
7
8
9
10
11
labels = data.room_type.value_counts().index
colors = ['lightblue','pink','beige']
explode = [0,0,0]
sizes = data.room_type.value_counts().values


plt.figure(0,figsize = (7,7))
plt.pie(sizes, explode=[0,0.05,0.5], labels=labels, colors=colors, autopct='%1.1f%%', shadow=True)
# plot.pie(explode=,autopct='%1.1f%%',ax=ax[0],)
plt.title('Room-Type',color = 'Brown',fontsize = 15)
plt.show()
python

Imgur

1
2
3
4
#room_type-price
result = data.groupby(["room_type"])['price'].aggregate(np.median).reset_index().sort_values('price')
sns.barplot(x='room_type', y="price", data=data, order=result['room_type']) 
plt.show()
python

Imgur

1
2
3
4
#room_type-availability_365
result = data.groupby(["room_type"])['availability_365'].aggregate(np.median).reset_index().sort_values('availability_365')
sns.boxplot(x='room_type', y="availability_365", data=data, order=result['room_type']) 
plt.show()
python

Imgur

1
2
sns.lineplot(x='availability_365',y='price',data=data)
plt.show()
python

Imgur

1
2
3
plt.figure(figsize=(10,6))
sns.scatterplot(data.longitude,data.latitude,hue=data.neighbourhood_group)
plt.ioff()
python

Imgur

1
2
3
plt.figure(figsize=(10,6))
sns.scatterplot(data.longitude,data.latitude,hue=data.availability_365)
plt.ioff()
python

Imgur

Predictive Models

Linear regression is famously used for forecasting. Both linear regression (LR) and Random Forest Regression (RFR) models are based on supervised learning and can be used for classification and regression. Compared to RFR, LR is simple and easy to implement. But simplicity always comes at the cost of overfitting the model. It needs regularization to avoid it. RFR contains inbuilt regularization, and we can focus more on the model. Learn more about LR and its syntax here.

Mathematical single LR is written as:

Imgur

Random Forest is an advanced regressor. it uses the ensemble technique for prediction. A model comprised of many models is called an ensemble model. There are two types of bagging and boosting. RFR is a bagging technique. The trees in random forests are run in parallel. While the trees are building, they don't interact with each other. Once all the trees are built, then voting or average is taken across them.

This is what the RFR tree looks like:

Imgur

To avoid any float-type error, we will prepare the dataset for prediction.

1
data_pred = pd.read_csv(r'nyc_airbnb\AB_NYC_2019.csv')
python
1
2
3
#prepare-data
data_pred.drop(['name', 'host_name', 'last_review'], inplace=True, axis=1)
data_pred['reviews_per_month'] = data_pred['reviews_per_month'].fillna(value=0, inplace=False)
python

As we saw earlier, neighbourhood_group, neighborhood, and room type are in text form. Our model will not carry out its prediction with text data to make the data model understandable. We use the LableEncoder class from the sklearn.

1
2
3
4
5
6
7
8
9
10
le = preprocessing.LabelEncoder()

le.fit(data_pred['neighbourhood_group'])
data_pred['neighbourhood_group']=le.transform(data_pred['neighbourhood_group'])

le.fit(data_pred['neighbourhood'])
data_pred['neighbourhood']=le.transform(data_pred['neighbourhood'])

le.fit(data_pred['room_type'])
data_pred['room_type']=le.transform(data_pred['room_type'])
python
1
predi= random_forest.predict(X_test)
Python

Hold-out

Imgur

The model learns on the train dataset. It contains a known output. Our model's prediction is done on the test dataset. The data is either split into the 70:30 ratio or 80:20. We have set the size to 80:20 for this model.

We will test its reliability using an R2 score.

1
2
3
4
5
6
7
8
lm = LinearRegression()

X = data_pred.drop(['price', 'longitude'], inplace=False, axis=1)
y = data_pred['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

lm.fit(X_train, y_train)
python

Imgur

1
predictions = lm.predict(X_test)
python
1
2
3
4
5
6
7
8
9
10
11
12
# Evaluated-metrics

mae = metrics.mean_absolute_error(y_test, predictions)
mse = metrics.mean_squared_error(y_test, predictions)
rmse = np.sqrt(metrics.mean_squared_error(y_test, predictions))
r2 = metrics.r2_score(y_test, predictions)


print('MAE (Mean-Absolute-Error): %s' %mae)
print('MSE (Mean-Squared-Error): %s' %mse)
print('RMSE (Root-MSE): %s' %rmse)
print('R2 score: %s' %r2)
python

Imgur

The R2 score for the hold-out method did not perform well for this dataset. 0.09 is unstable and can lead to overfitting or underfitting the data. We will check out the cross-validation method. But this problem is not permanent. We can improve the score by repeating the calculations multiple times on the subset of data.

Cross-validation (k-fold)

Imgur

This is a procedure to estimate the skill of the machine learning model. Cross validation (CV) has a parameter K denoting the number of sections/folds. Each fold is used as a testing set at some point. Once the process is complete, we can summarize and evaluate the matrix.

1
2
from sklearn.model_selection import KFold, GridSearchCV,  cross_val_score
from sklearn.ensemble import RandomForestRegressor
python

We will use 5-fold CV by using a random_forest() model. cross_val_score fits the mode and generates CV scores.

1
2
3
4
kf = KFold(n_splits=5, shuffle=True, random_state=27)
random_forest = RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=20, min_samples_split=2)
cv_score = cross_val_score(random_forest, X_train, y_train, scoring='r2', cv=kf)
cv_score
python

Imgur

1
random_forest.fit(X_train, y_train)
python

Imgur

1
predi= random_forest.predict(X_test)
python
1
2
3
4
5
6
7
8
9
10
11
# Evaluated-metrics

mae = metrics.mean_absolute_error(y_test, predi)
mse = metrics.mean_squared_error(y_test, predi)
rmse = np.sqrt(metrics.mean_squared_error(y_test, predi))
r2 = metrics.r2_score(y_test, predi)

print('MAE (Mean-Absolute-Error): %s' %mae)
print('MSE (Mean-Squared-Error): %s' %mse)
print('RMSE (Root-MSE): %s' %rmse)
print('R2 score: %s' %r2)
python

Imgur

The R2 score is much more stable and the MSE is also less than what we got for the hold-out method. Remember to use CV only when your hold-out method underperforms. Now let's check the predicted value against the actual value.

1
2
error2 = pd.DataFrame({'Actual-Values': np.array(y_test).flatten(), 'Predicted-Values': predi.flatten()})
error2.head(10) #try for linear
python

Imgur

1
print(f'Model_Accuracy: {random_forest.score(X, y)}')
python
1
Model_Accuracy: 0.7992861793429817

Conclusion

Model accuracy of around 80% is not bad; it can be improved by further tuning the model. Change the k-folds and other variables to see the changes. Cleaning the data should be your first step to solving any analysis problem. It is also essential to know the features and relations between them.

"Practice makes perfect!"

I suggest working on more such datasets and seeing what insights you can find. Try to fit the data in the models mentioned above. I also suggest reading the documentation links provided in the guide. Explore data evaluation and plots in addition to those mentioned above, and in no time data analysis will be at your fingertips!

Feel free to contact me with any questions at Codealphabet.

7