Introduction

Until a few decades ago, back data has been used primarily for two purposes:

- To know what happened.

- To identify the root cause of why it happened.

Though the factors mentioned above are important, we have entered a new era in the last decade or so where data is being used to predict what could potentially happen in the future. This is where machine learning plays a very important role. One of the key factors in predicting a future problem is developing a machine learning model. A machine learning model is what you get when you combine data with a machine learning algorithm. There are many machine learning algorithms that have been developed to solve various business problems.

Data by itself offers us little to no value unless an analyst works on it to derive meaningful information and help us derive insights from the data. Let's see a few challenges that need to be addressed before data can be used in building a machine learning model.

- Missing data

- Improper scaling

- Too many features

- Hidden features

- Imbalanced data

- Insufficient data

To name a few. Each of these is a topic by itself and it's beyond the scope of this guide to cover them in detail.

Data on a broad scale can be classified into two categories:

- Labeled data

- Unlabeled Data

Depending on the type of data, machine learning algorithms fall under two major categories.

- Supervised learning

- Unsupervised learning

Supervised learning is primarily used to address two kinds of problems.

- Continuous data

In machine learning language, this type of problem is called a regression problem. An example is the price of a house in a specific area over the last 30 years. Some of the common algorithms used to solve a regression problem are linear regression and polynomial regression.

- Categorical data

Cases with categorical data are addressed as a classification problem. A typical example is to classify if a product has reached its expiration date or not. Some of the commonly used algorithms to address a classification problem are logistic regression and Nearest Neighbor.

Unsupervised learning is primarily used to address clustered data. For example, to group products in a retail store according to their department. In this case, the data has no label. K-means is one of the popularly used clustering algorithms.

There are three stages in building a supervised machine learning model.

Training Phase

- Testing Phase

- Prediction Phase

All of the available data is split into two categories. In the training phase, we use 75% of the data in training the model. The remaining 25% of the data is used in the testing phase to validate the accuracy of the model built. In the prediction phase, the model is deployed in production and we use actual live data in predicting the outcome.

In this guide, we are going to use a python open source library named **scikit-learn** and sample data to develop our model using the **linear regression algorithm**.

Let's start a demo by analyzing the data that shows how the weight of a teenage kid varies depending on the number of hours he plays video games.

I am going to import the pandas library, read the csv data, and print it. To keep things simple, we assume the data is already normalized and there are no missing elements. In other words, data can be used as-is in our model development process.

```
1import pandas as pd
2
3df = pd.read_csv('grades.csv')
4
5print(df)
```

This should print the data that is the source of our model.

```
1 Play_Hours Weight
2
3 1 145
4 2 157
5 3 166
6 4 173
7 5 176
8 6 179
9 7 181
10 8 190
11 9 196
12 10 200
```

From the data, it's obvious that this is continuous data and it will be best modeled with a regression algorithm. We are going to use a linear regression algorithm in this demo.

Let's import scikit package and create the linear regression object:

```
1import sklearn.linear_model as sk
2
3lr = sk.LinearRegression()
```

We are going to import a numpy package, use its newaxis object, and convert the x axis to a 2D array so that we can pass both x and y axis values to linear regression and perform a fit operation. The fit operation applies the data to the linear regression algorithm and produces a machine learning model that best fits the data.

```
1import numpy as np
2
3x = df.Play_Hours[:, np.newaxis]
4
5y = df.Weight.values
6lr.fit(x,y)
```

Let's print the intercept* and coef* values. A linear equation is represented in the form of Y = mX +c where m is the slope of the line and c is the intercept. Slope and the intercept define the relationship that exists between the two variables. Once we have these two values identified, we can use this to predict any future values. Let's say we want to know the weight of a teenager who plays video games for 15 hours a day; we can use this model to predict his weight. In this case, the number of hours played (X) is an independent variable and the weight of teenagers (Y) is the dependent variable.

```
1print("Intercept - c:", lr.intercept_)
2
3print("Coeff - m:", lr.coef_)
```

You should see an output similar to below that lists the intercept and coeff values. Intercept is that value when the independent variable value is 0 or the point in the graph for X = 0 and coeff is the slope of the line.

```
1Intercept -c: 145.8
2Coeff -m: [5.54545455]
```

Let's plot this graph and see the results visually. Import the matplotlib package as shown below. Now that we have a basic model in place, let's see how we can use this model to predict test data. We use the predict object to predict test values, as shown below.

```
1import matplotlib.pyplot as plt
2
3predict = lr.predict([[6]])
4
5plt.scatter(x, y, color='black')
6
7plt.plot(x, lr.predict(x), color='blue', linewidth=3)
```

Now that we have predicted the value, let's see how good this prediction is. We will use three different metrics to evaluate the performance of this prediction.

- Mean Absolute Error

- Root Mean Squared Error (RMSE)

- R Squared Error

For a given value of X, if Y is the predicted value and Y' is the actual value, the Mean Absolute Error (MSE) is calculated as the sum of absolute differences between predicted and actual values divided by the total number of entities.

Taking absolute error is very important in this case. If not, the positive error would negate the negative error and the result will be erroneous. Let's calculate the MSE using sklearn.

```
1from sklearn.metrics import mean_absolute_error
2
3df['predict'] = lr.predict(x)
4
5print(mean_absolute_error(df.Weight, df.predict))
6
72.290909090909088
```

RMSE is the square root of the mean of the squared errors. A lower value of RMSE means that the model performance is good, as RMSE shows how close predicted and actual values are.

```
1from sklearn.metrics import mean_absolute_error
2
3print(np.sqrt(mean_squared_error(df.Weight, df.predict)))
4
53.147293209323613
```

This error signifies how close or how well the model fits the training data. Its value ranges between 0 and 1. A value closer to 1 indicates an accurate model.

```
1from sklearn.metrics import r2_score
2
3print(r2_score(df.Weight, df.predict))
4
50.9624238285897556
```

The purpose of this guide is to show how to develop a simple machine learning model. We assumed a very simple data structure that is very well defined. In practical cases, the amount of data is huge and we may need to spend a lot of time preparing the data to be used with a machine learning algorithm.