Time series algorithms are used extensively for analyzing and forecasting time-based data. However, given the complexity of other factors besides time, machine learning has emerged as a powerful method for understanding hidden complexities in time series data and generating good forecasts.
In this guide, you'll learn the concepts of feature engineering and machine learning from a time series perspective, along with the techniques to implement them in Python.
To begin, get familiar with the data. In this guide, you'll be using a fictitious dataset of daily sales data at a supermarket that contains 3,533 observations and four variables, as described below:
Date
: daily sales date
Sales
: sales at the supermarket for that day, in thousands of dollars
Inventory
: total units of inventory at the supermarket
Class
: training and test data class for modelingStart by loading the required libraries and the data.
1import pandas as pd
2import numpy as np
3
4# Reading the data
5df = pd.read_csv("ml_python.csv")
6print(df.shape)
7print(df.info())
8df.head(5)
Output:
1(3533, 4)
2
3<class 'pandas.core.frame.DataFrame'>
4RangeIndex: 3533 entries, 0 to 3532
5Data columns (total 4 columns):
6Date 3533 non-null object
7Sales 3533 non-null int64
8Inventory 3533 non-null int64
9Class 3533 non-null object
10dtypes: int64(2), object(2)
11memory usage: 110.5+ KB
12None
13
14 Date Sales Inventory Class
150 29-04-2010 51 40 Train
161 30-04-2010 56 44 Train
172 01-05-2010 93 74 Train
183 02-05-2010 86 68 Train
194 03-05-2010 57 45 Train
Sometimes classical time series algorithms won't suffice for making powerful predictions. In such cases, it's sensible to convert the time series data to a machine learning algorithm by creating features from the time variable. The code below uses the pd.DatetimeIndex()
function to create time features like year, day of the year, quarter, month, day, weekdays, etc.
1import datetime
2df['Date'] = pd.to_datetime(df['Date'])
3df['Date'] = df['Date'].dt.strftime('%d.%m.%Y')
4df['year'] = pd.DatetimeIndex(df['Date']).year
5df['month'] = pd.DatetimeIndex(df['Date']).month
6df['day'] = pd.DatetimeIndex(df['Date']).day
7df['dayofyear'] = pd.DatetimeIndex(df['Date']).dayofyear
8df['weekofyear'] = pd.DatetimeIndex(df['Date']).weekofyear
9df['weekday'] = pd.DatetimeIndex(df['Date']).weekday
10df['quarter'] = pd.DatetimeIndex(df['Date']).quarter
11df['is_month_start'] = pd.DatetimeIndex(df['Date']).is_month_start
12df['is_month_end'] = pd.DatetimeIndex(df['Date']).is_month_end
13print(df.info())
Output:
1
2<class 'pandas.core.frame.DataFrame'>
3RangeIndex: 3533 entries, 0 to 3532
4Data columns (total 13 columns):
5Date 3533 non-null object
6Sales 3533 non-null int64
7Inventory 3533 non-null int64
8Class 3533 non-null object
9year 3533 non-null int64
10month 3533 non-null int64
11day 3533 non-null int64
12dayofyear 3533 non-null int64
13weekofyear 3533 non-null int64
14weekday 3533 non-null int64
15quarter 3533 non-null int64
16is_month_start 3533 non-null bool
17is_month_end 3533 non-null bool
You don’t need the Date
variable now, so you can drop it.
1df = df.drop(['Date'], axis = 1)
Some of the variables in the dataset, such as year
or quarter
, need to be treated as categorical variables. So, you will convert these variables to numeric variables that can be used as factors using a technique called dummy encoding. In this technique, the features are encoded so there is no duplication of the information. This is achieved by passing in the argument drop_first=True
to the .get_dummies()
function, as done in the code below. The last line prints the information about the data, which indicates that the data now has 37 variables.
1df = pd.get_dummies(df, columns=['year'], drop_first=True, prefix='year')
2
3df = pd.get_dummies(df, columns=['month'], drop_first=True, prefix='month')
4
5df = pd.get_dummies(df, columns=['weekday'], drop_first=True, prefix='wday')
6df = pd.get_dummies(df, columns=['quarter'], drop_first=True, prefix='qrtr')
7
8df = pd.get_dummies(df, columns=['is_month_start'], drop_first=True, prefix='m_start')
9
10df = pd.get_dummies(df, columns=['is_month_end'], drop_first=True, prefix='m_end')
11
12df.info()
Output:
1<class 'pandas.core.frame.DataFrame'>
2RangeIndex: 3533 entries, 0 to 3532
3Data columns (total 37 columns):
4Sales 3533 non-null int64
5Inventory 3533 non-null int64
6Class 3533 non-null object
7day 3533 non-null int64
8dayofyear 3533 non-null int64
9weekofyear 3533 non-null int64
10year_2011 3533 non-null uint8
11year_2012 3533 non-null uint8
12year_2013 3533 non-null uint8
13year_2014 3533 non-null uint8
14year_2015 3533 non-null uint8
15year_2016 3533 non-null uint8
16year_2017 3533 non-null uint8
17year_2018 3533 non-null uint8
18year_2019 3533 non-null uint8
19month_2 3533 non-null uint8
20month_3 3533 non-null uint8
21month_4 3533 non-null uint8
22month_5 3533 non-null uint8
23month_6 3533 non-null uint8
24month_7 3533 non-null uint8
25month_8 3533 non-null uint8
26month_9 3533 non-null uint8
27month_10 3533 non-null uint8
28month_11 3533 non-null uint8
29month_12 3533 non-null uint8
30wday_1 3533 non-null uint8
31wday_2 3533 non-null uint8
32wday_3 3533 non-null uint8
33wday_4 3533 non-null uint8
34wday_5 3533 non-null uint8
35wday_6 3533 non-null uint8
36qrtr_2 3533 non-null uint8
37qrtr_3 3533 non-null uint8
38qrtr_4 3533 non-null uint8
39m_start_True 3533 non-null uint8
40m_end_True 3533 non-null uint8
41dtypes: int64(5), object(1), uint8(31)
With the data prepared, you are ready to move to machine learning in the subsequent sections. However, before moving to predictive modeling techniques, it's important to divide the data into training and test sets.
1train = df[df["Class"] == "Train"]
2test = df[df["Class"] == "Test"]
3
4print(train.shape)
5print(test.shape)
Output:
1(3442, 37)
2
3(91, 37)
You don’t need the Class
variable now, so that can be dropped using the code below.
1train = train.drop(['Class'], axis = 1)
2test = test.drop(['Class'], axis = 1)
With the data partitioned, the next step is to create arrays for the features and response variables. The first line of code creates an object of the target variable called target_column_train
. The second line gives us the list of all the features, excluding the target variable Sales
. The next two lines create the arrays for the training data, and the last two lines print its shape.
1target_column_train = ['Sales']
2predictors_train = list(set(list(train.columns))-set(target_column_train))
3
4X_train = train[predictors_train].values
5y_train = train[target_column_train].values
6
7print(X_train.shape)
8print(y_train.shape)
Output:
1(3442, 35)
2(3442, 1)
Repeat the same process for the test data with the code below.
1target_column_test = ['Sales']
2predictors_test = list(set(list(test.columns))-set(target_column_test))
3
4X_test = test[predictors_test].values
5y_test = test[target_column_test].values
6
7print(X_test.shape)
8print(y_test.shape)
Output:
1(91, 35)
2(91, 1)
You are now ready to build machine learning models. Start by loading the libraries and the modules.
1from sklearn import model_selection
2from sklearn.tree import DecisionTreeRegressor
3from sklearn.ensemble import RandomForestRegressor
4from sklearn.metrics import r2_score
5from sklearn.metrics import mean_squared_error
6from math import sqrt
Decision Trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiator is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.
Create a CART regression model using the DecisionTreeRegressor
class. The first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set. The arguments used are max_depth
, which indicates the maximum depth of the tree, and min_samples_leaf
, which indicates the minimum number of samples required to be at a leaf node.
1dtree = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13, random_state=3)
2dtree.fit(X_train, y_train)
Output:
1DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
2 max_leaf_nodes=None, min_impurity_decrease=0.0,
3 min_impurity_split=None, min_samples_leaf=0.13,
4 min_samples_split=2, min_weight_fraction_leaf=0.0,
5 presort=False, random_state=3, splitter='best')
Once the model is built on the training set, you can make the predictions. The first line of code below predicts on the training set. The second and third lines of code print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.
1# Code lines 1 to 3
2pred_train_tree= dtree.predict(X_train)
3print(np.sqrt(mean_squared_error(y_train,pred_train_tree)))
4print(r2_score(y_train, pred_train_tree))
5
6# Code lines 4 to 6
7pred_test_tree= dtree.predict(X_test)
8print(np.sqrt(mean_squared_error(y_test,pred_test_tree)))
9print(r2_score(y_test, pred_test_tree))
Output:
17.42649982627121
20.8952723676224715
313.797384350596744
40.4567663254694022
The above output shows that the RMSE is 7.4 for the training data and 13.8 for the test data. On the other hand, the R-squared value is 89% for the training data and 46% for the test data. There is a gap between the training and test set results, and more improvement can be done by parameter tuning. Change the values of the parameter max_depth
, to see how that affects the model performance.
The first four lines of code below instantiate and fit the regression trees with a max_depth
parameter of two and five, respectively. The fifth and sixth lines of code generate predictions on the training data, whereas the seventh and eight lines of code give predictions on the testing data.
1# Code Lines 1 to 4: Fit the regression tree 'dtree1' and 'dtree2'
2dtree1 = DecisionTreeRegressor(max_depth=2)
3dtree2 = DecisionTreeRegressor(max_depth=5)
4dtree1.fit(X_train, y_train)
5dtree2.fit(X_train, y_train)
6
7# Code Lines 5 to 6: Predict on training data
8tr1 = dtree1.predict(X_train)
9tr2 = dtree2.predict(X_train)
10
11#Code Lines 7 to 8: Predict on testing data
12y1 = dtree1.predict(X_test)
13y2 = dtree2.predict(X_test)
The code below generates the evaluation metrics—RMSE and R-squared—for the first regression tree, 'dtree1'.
1# Print RMSE and R-squared value for regression tree 'dtree1' on training data
2print(np.sqrt(mean_squared_error(y_train,tr1)))
3print(r2_score(y_train, tr1))
4
5# Print RMSE and R-squared value for regression tree 'dtree1' on testing data
6print(np.sqrt(mean_squared_error(y_test,y1)))
7print(r2_score(y_test, y1))
Output:
17.146794965406164
20.9030125411762373
311.751081527241734
40.6059522633855321
The above output for 'dtree1' model shows that the RMSE is 7.14 for the training data and 11.7 for the test data. The R-squared value is 90% for the training and 61% for the test data. This model is better than the previous model in both the evaluation metrics and the gap between the training and test set results have also come down.
We will now examine the performance of the decision tree model, 'dtree2', by running the following lines of code.
1# Print RMSE and R-squared value for regression tree 'dtree2' on training data
2print(np.sqrt(mean_squared_error(y_train,tr2)))
3print(r2_score(y_train, tr2))
4
5# Print RMSE and R-squared value for regression tree 'dtree2' on testing data
6print(np.sqrt(mean_squared_error(y_test,y2)))
7print(r2_score(y_test, y2))
Output:
12.13305836695393
20.9913603049571774
311.236614430353763
40.6397001209411287
The above output shows significant improvement from the earlier models. The R-squared values for the training and test sets increased to 99% and 64%, respectively. This is better than the earlier models and shows that the gap between the training and test datasets has also decreased. So the regression tree model with a max_depth
parameter of five is performing better, demonstrating how parameter tuning can improve model performance.
Decision Trees are useful, but they often tend to overfit the training data, leading to high variances in the test data. Random Forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a Forest because they are the collection, or ensemble, of several decision trees. One major difference between a Decision Tree and a Random Forest model is how the splits happen. In a Random Forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.
In scikit-learn, the RandomForestRegressor
class is used for building regression trees. The first line of code below instantiates the Random Forest Regression model with an n_estimators
value of 5000. The argument n_estimators
indicates the number of trees in the forest. The second line fits the model to the training data.
The third line of code predicts, while the fourth and fifth lines print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the sixth to eighth lines of code.
1#RF model
2model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100)
3model_rf.fit(X_train, y_train)
4pred_train_rf= model_rf.predict(X_train)
5print(np.sqrt(mean_squared_error(y_train,pred_train_rf)))
6print(r2_score(y_train, pred_train_rf))
7
8pred_test_rf = model_rf.predict(X_test)
9print(np.sqrt(mean_squared_error(y_test,pred_test_rf)))
10print(r2_score(y_test, pred_test_rf))
Output:
10.5859145906944213
20.9993481291496544
38.717826785121657
40.7831249311061259
The above output shows that the RMSE and R-squared values on the training data are 0.58 and 99.9%, respectively. For the test data, the results for these metrics are 8.7 and 78%, respectively. The performance of the Random Forest model is far superior to the Decision Tree models built earlier.
In this guide, you learned how to perform machine learning on time series data. You learned how to create features from the Date
variable and use them as independent features for model building. You were also introduced to powerful non-linear regression tree algorithms like Decision Trees and Random Forest, which you used to build and evaluate a machine learning model.
To learn more about data science using Python, please refer to the following guides.