Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Visualizing and Building a Machine Learning Model in Python

Mar 5, 2020 • 11 Minute Read


In Part 1 of this series on data analysis in Python, we discussed data preparation. In this guide, we will focus on different data visualization and building a machine learning model. Both guides use the New York City Airbnb Open Data. If you didn't read Part 1, check it out to see how we pre-processed the data.

By the end of this guide, you will have hands-on experience with:

  • Data Visualization
  • Building a Predictive Model

Let's get started!


It is been said that we understand faster when we visualize data. In the following code, we will work on six types of plots.

      fig = plt.figure(figsize = (15,10))
ax = fig.gca()
      labels = data.neighbourhood_group.value_counts().index
colors = ['lightblue','beige','lightgreen','orange','cyan']
explode = [0,0,0,0,0]
sizes = data.neighbourhood_group.value_counts().values

plt.figure(0,figsize = (7,7))
plt.pie(sizes, explode=[0.1,0.0,0.3,0.5,0.0], labels=labels, colors=colors, autopct='%1.1f%%',shadow=True)
plt.title('Neighbourhood Group',color = 'black',fontsize = 15)
result = data.groupby(["neighbourhood_group"])['price'].aggregate(np.median).reset_index().sort_values('price')
sns.barplot(x='neighbourhood_group', y="price", data=data,palette=colors, order=result['neighbourhood_group']) 
result = data.groupby(["neighbourhood_group"])['availability_365'].aggregate(np.median).reset_index().sort_values('availability_365')
sns.boxplot(x='neighbourhood_group', y="availability_365", data=data)
      labels = data.room_type.value_counts().index
colors = ['lightblue','pink','beige']
explode = [0,0,0]
sizes = data.room_type.value_counts().values

plt.figure(0,figsize = (7,7))
plt.pie(sizes, explode=[0,0.05,0.5], labels=labels, colors=colors, autopct='%1.1f%%', shadow=True)
# plot.pie(explode=,autopct='%1.1f%%',ax=ax[0],)
plt.title('Room-Type',color = 'Brown',fontsize = 15)
result = data.groupby(["room_type"])['price'].aggregate(np.median).reset_index().sort_values('price')
sns.barplot(x='room_type', y="price", data=data, order=result['room_type'])
result = data.groupby(["room_type"])['availability_365'].aggregate(np.median).reset_index().sort_values('availability_365')
sns.boxplot(x='room_type', y="availability_365", data=data, order=result['room_type'])

Predictive Models

Linear regression is famously used for forecasting. Both linear regression (LR) and Random Forest Regression (RFR) models are based on supervised learning and can be used for classification and regression. Compared to RFR, LR is simple and easy to implement. But simplicity always comes at the cost of overfitting the model. It needs regularization to avoid it. RFR contains inbuilt regularization, and we can focus more on the model. Learn more about LR and its syntax here.

Mathematical single LR is written as:

Random Forest is an advanced regressor. it uses the ensemble technique for prediction. A model comprised of many models is called an ensemble model. There are two types of bagging and boosting. RFR is a bagging technique. The trees in random forests are run in parallel. While the trees are building, they don't interact with each other. Once all the trees are built, then voting or average is taken across them.

This is what the RFR tree looks like:

To avoid any float-type error, we will prepare the dataset for prediction.

      data_pred = pd.read_csv(r'nyc_airbnb\AB_NYC_2019.csv')
data_pred.drop(['name', 'host_name', 'last_review'], inplace=True, axis=1)
data_pred['reviews_per_month'] = data_pred['reviews_per_month'].fillna(value=0, inplace=False)

As we saw earlier, neighbourhood_group, neighborhood, and room type are in text form. Our model will not carry out its prediction with text data to make the data model understandable. We use the LableEncoder class from the sklearn.

      le = preprocessing.LabelEncoder()['neighbourhood_group'])
      predi= random_forest.predict(X_test)


The model learns on the train dataset. It contains a known output. Our model's prediction is done on the test dataset. The data is either split into the 70:30 ratio or 80:20. We have set the size to 80:20 for this model.

We will test its reliability using an R2 score.

      lm = LinearRegression()

X = data_pred.drop(['price', 'longitude'], inplace=False, axis=1)
y = data_pred['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101), y_train)
      predictions = lm.predict(X_test)
      # Evaluated-metrics

mae = metrics.mean_absolute_error(y_test, predictions)
mse = metrics.mean_squared_error(y_test, predictions)
rmse = np.sqrt(metrics.mean_squared_error(y_test, predictions))
r2 = metrics.r2_score(y_test, predictions)

print('MAE (Mean-Absolute-Error): %s' %mae)
print('MSE (Mean-Squared-Error): %s' %mse)
print('RMSE (Root-MSE): %s' %rmse)
print('R2 score: %s' %r2)

The R2 score for the hold-out method did not perform well for this dataset. 0.09 is unstable and can lead to overfitting or underfitting the data. We will check out the cross-validation method. But this problem is not permanent. We can improve the score by repeating the calculations multiple times on the subset of data.

Cross-validation (k-fold)

This is a procedure to estimate the skill of the machine learning model. Cross validation (CV) has a parameter K denoting the number of sections/folds. Each fold is used as a testing set at some point. Once the process is complete, we can summarize and evaluate the matrix.

      from sklearn.model_selection import KFold, GridSearchCV,  cross_val_score
from sklearn.ensemble import RandomForestRegressor

We will use 5-fold CV by using a random_forest() model. cross_val_score fits the mode and generates CV scores.

      kf = KFold(n_splits=5, shuffle=True, random_state=27)
random_forest = RandomForestRegressor(n_estimators=100, criterion='mse', max_depth=20, min_samples_split=2)
cv_score = cross_val_score(random_forest, X_train, y_train, scoring='r2', cv=kf)
 , y_train)
      predi= random_forest.predict(X_test)
      # Evaluated-metrics

mae = metrics.mean_absolute_error(y_test, predi)
mse = metrics.mean_squared_error(y_test, predi)
rmse = np.sqrt(metrics.mean_squared_error(y_test, predi))
r2 = metrics.r2_score(y_test, predi)

print('MAE (Mean-Absolute-Error): %s' %mae)
print('MSE (Mean-Squared-Error): %s' %mse)
print('RMSE (Root-MSE): %s' %rmse)
print('R2 score: %s' %r2)

The R2 score is much more stable and the MSE is also less than what we got for the hold-out method. Remember to use CV only when your hold-out method underperforms. Now let's check the predicted value against the actual value.

      error2 = pd.DataFrame({'Actual-Values': np.array(y_test).flatten(), 'Predicted-Values': predi.flatten()})
error2.head(10) #try for linear
      print(f'Model_Accuracy: {random_forest.score(X, y)}')
      Model_Accuracy: 0.7992861793429817


Model accuracy of around 80% is not bad; it can be improved by further tuning the model. Change the k-folds and other variables to see the changes. Cleaning the data should be your first step to solving any analysis problem. It is also essential to know the features and relations between them.

"Practice makes perfect!"

I suggest working on more such datasets and seeing what insights you can find. Try to fit the data in the models mentioned above. I also suggest reading the documentation links provided in the guide. Explore data evaluation and plots in addition to those mentioned above, and in no time data analysis will be at your fingertips!

Feel free to contact me with any questions at Codealphabet.