Introduction

8

In Part 1 of this series on data analysis in Python, we discussed data preparation. In this guide, we will focus on different data visualization and building a machine learning model. Both guides use the New York City Airbnb Open Data. If you didn't read Part 1, check it out to see how we pre-processed the data.

By the end of this guide, you will have hands-on experience with:

- Data Visualization
- Building a Predictive Model

Let's get started!

It is been said that we understand faster when we visualize data. In the following code, we will work on six types of plots.

`1 2 3 4`

`fig = plt.figure(figsize = (15,10)) ax = fig.gca() data.hist(ax=ax) plt.show()`

python

`1 2 3 4 5 6 7 8 9`

`labels = data.neighbourhood_group.value_counts().index colors = ['lightblue','beige','lightgreen','orange','cyan'] explode = [0,0,0,0,0] sizes = data.neighbourhood_group.value_counts().values plt.figure(0,figsize = (7,7)) plt.pie(sizes, explode=[0.1,0.0,0.3,0.5,0.0], labels=labels, colors=colors, autopct='%1.1f%%',shadow=True) plt.title('Neighbourhood Group',color = 'black',fontsize = 15) plt.show()`

python

`1 2 3 4 5`

`#neighbourhood_group-price result = data.groupby(["neighbourhood_group"])['price'].aggregate(np.median).reset_index().sort_values('price') sns.barplot(x='neighbourhood_group', y="price", data=data,palette=colors, order=result['neighbourhood_group']) plt.xticks(rotation=45) plt.show()`

python

`1 2 3 4`

`#neighbourhood_group-availability_365 result = data.groupby(["neighbourhood_group"])['availability_365'].aggregate(np.median).reset_index().sort_values('availability_365') sns.boxplot(x='neighbourhood_group', y="availability_365", data=data) plt.show()`

python

`1 2 3 4 5 6 7 8 9 10 11`

`labels = data.room_type.value_counts().index colors = ['lightblue','pink','beige'] explode = [0,0,0] sizes = data.room_type.value_counts().values plt.figure(0,figsize = (7,7)) plt.pie(sizes, explode=[0,0.05,0.5], labels=labels, colors=colors, autopct='%1.1f%%', shadow=True) # plot.pie(explode=,autopct='%1.1f%%',ax=ax[0],) plt.title('Room-Type',color = 'Brown',fontsize = 15) plt.show()`

python

`1 2 3 4`

`#room_type-price result = data.groupby(["room_type"])['price'].aggregate(np.median).reset_index().sort_values('price') sns.barplot(x='room_type', y="price", data=data, order=result['room_type']) plt.show()`

python

`1 2 3 4`

`#room_type-availability_365 result = data.groupby(["room_type"])['availability_365'].aggregate(np.median).reset_index().sort_values('availability_365') sns.boxplot(x='room_type', y="availability_365", data=data, order=result['room_type']) plt.show()`

python

`1 2`

`sns.lineplot(x='availability_365',y='price',data=data) plt.show()`

python

`1 2 3`

`plt.figure(figsize=(10,6)) sns.scatterplot(data.longitude,data.latitude,hue=data.neighbourhood_group) plt.ioff()`

python

`1 2 3`

`plt.figure(figsize=(10,6)) sns.scatterplot(data.longitude,data.latitude,hue=data.availability_365) plt.ioff()`

python

**Linear regression** is famously used for forecasting. Both linear regression (LR) and Random Forest Regression (RFR) models are based on supervised learning and can be used for *classification* and *regression*. Compared to RFR, LR is simple and easy to implement. But simplicity always comes at the cost of overfitting the model. It needs *regularization* to avoid it. RFR contains *inbuilt regularization*, and we can focus more on the model. Learn more about LR and its syntax here.

Mathematical single LR is written as:

*Random Forest* is an *advanced regressor*. it uses the *ensemble* technique for prediction. A model comprised of many models is called an *ensemble model*. There are two types of *bagging* and *boosting*. *RFR is a bagging technique*. The trees in random forests are run in parallel. While the trees are building, they don't interact with each other. Once all the trees are built, then voting or average is taken across them.

This is what the RFR tree looks like:

To avoid any float-type error, we will prepare the dataset for prediction.

`1`

`data_pred = pd.read_csv(r'nyc_airbnb\AB_NYC_2019.csv')`

python

`1 2 3`

`#prepare-data data_pred.drop(['name', 'host_name', 'last_review'], inplace=True, axis=1) data_pred['reviews_per_month'] = data_pred['reviews_per_month'].fillna(value=0, inplace=False)`

python

As we saw earlier, `neighbourhood_group`

, `neighborhood`

, and `room type`

are in text form. Our model will not carry out its prediction with text data to make the data model understandable. We use the `LableEncoder`

class from the sklearn.

`1 2 3 4 5 6 7 8 9 10`

`le = preprocessing.LabelEncoder() le.fit(data_pred['neighbourhood_group']) data_pred['neighbourhood_group']=le.transform(data_pred['neighbourhood_group']) le.fit(data_pred['neighbourhood']) data_pred['neighbourhood']=le.transform(data_pred['neighbourhood']) le.fit(data_pred['room_type']) data_pred['room_type']=le.transform(data_pred['room_type'])`

python

`1`

`predi= random_forest.predict(X_test)`

Python

The model learns on the *train* dataset. It contains a known output. Our model's prediction is done on the *test* dataset. The data is either *split* into the 70:30 ratio or 80:20. We have set the size to 80:20 for this model.

We will test its reliability using an R2 score.

`1 2 3 4 5 6 7 8`

`lm = LinearRegression() X = data_pred.drop(['price', 'longitude'], inplace=False, axis=1) y = data_pred['price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101) lm.fit(X_train, y_train)`

python

`1`

`predictions = lm.predict(X_test)`

python

`1 2 3 4 5 6 7 8 9 10 11 12`

`# Evaluated-metrics mae = metrics.mean_absolute_error(y_test, predictions) mse = metrics.mean_squared_error(y_test, predictions) rmse = np.sqrt(metrics.mean_squared_error(y_test, predictions)) r2 = metrics.r2_score(y_test, predictions) print('MAE (Mean-Absolute-Error): %s' %mae) print('MSE (Mean-Squared-Error): %s' %mse) print('RMSE (Root-MSE): %s' %rmse) print('R2 score: %s' %r2)`

python

The R2 score for the hold-out method did not perform well for this dataset. 0.09 is unstable and can lead to *overfitting* or *underfitting* the data. We will check out the cross-validation method. But this problem is not permanent. We can improve the score by repeating the calculations multiple times on the subset of data.

This is a procedure to estimate the skill of the machine learning model. Cross validation (CV) has a parameter K denoting the number of sections/folds. Each fold is used as a testing set at some point. Once the process is complete, we can summarize and evaluate the matrix.

`1 2`

`from sklearn.model_selection import KFold, GridSearchCV, cross_val_score from sklearn.ensemble import RandomForestRegressor`

python

We will use 5-fold CV by using a `random_forest()`

model. `cross_val_score`

fits the mode and generates CV scores.

`1 2 3 4`

`kf = KFold(n_splits=5, shuffle=True, random_state=27) random_forest = RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=20, min_samples_split=2) cv_score = cross_val_score(random_forest, X_train, y_train, scoring='r2', cv=kf) cv_score`

python

`1`

`random_forest.fit(X_train, y_train)`

python

`1`

`predi= random_forest.predict(X_test)`

python

`1 2 3 4 5 6 7 8 9 10 11`

`# Evaluated-metrics mae = metrics.mean_absolute_error(y_test, predi) mse = metrics.mean_squared_error(y_test, predi) rmse = np.sqrt(metrics.mean_squared_error(y_test, predi)) r2 = metrics.r2_score(y_test, predi) print('MAE (Mean-Absolute-Error): %s' %mae) print('MSE (Mean-Squared-Error): %s' %mse) print('RMSE (Root-MSE): %s' %rmse) print('R2 score: %s' %r2)`

python

The R2 score is much more stable and the MSE is also less than what we got for the hold-out method. *Remember to use CV only when your hold-out method underperforms.* Now let's check the predicted value against the actual value.

`1 2`

`error2 = pd.DataFrame({'Actual-Values': np.array(y_test).flatten(), 'Predicted-Values': predi.flatten()}) error2.head(10) #try for linear`

python

`1`

`print(f'Model_Accuracy: {random_forest.score(X, y)}')`

python

`1`

`Model_Accuracy: 0.7992861793429817`

Model accuracy of around 80% is not bad; it can be improved by further tuning the model. Change the k-folds and other variables to see the changes. Cleaning the data should be your first step to solving any analysis problem. It is also essential to know the features and relations between them.

"Practice makes perfect!"

I suggest working on more such datasets and seeing what insights you can find. Try to fit the data in the models mentioned above. I also suggest reading the documentation links provided in the guide. Explore data evaluation and plots in addition to those mentioned above, and in no time data analysis will be at your fingertips!

Feel free to contact me with any questions at Codealphabet.

8