Time series algorithms are used extensively for analyzing and forecasting time-based data. However, given the complexity of other factors besides time, machine learning has emerged as a powerful method for understanding hidden complexities in time series data and generating good forecasts.
In this guide, you'll learn the concepts of feature engineering and machine learning from a time series perspective, along with the techniques to implement them in Python.
To begin, get familiar with the data. In this guide, you'll be using a fictitious dataset of daily sales data at a supermarket that contains 3,533 observations and four variables, as described below:
Date: daily sales date
Sales: sales at the supermarket for that day, in thousands of dollars
Inventory: total units of inventory at the supermarket
Class: training and test data class for modeling
Start by loading the required libraries and the data.
1import pandas as pd 2import numpy as np 3 4# Reading the data 5df = pd.read_csv("ml_python.csv") 6print(df.shape) 7print(df.info()) 8df.head(5)
1(3533, 4) 2 3<class 'pandas.core.frame.DataFrame'> 4RangeIndex: 3533 entries, 0 to 3532 5Data columns (total 4 columns): 6Date 3533 non-null object 7Sales 3533 non-null int64 8Inventory 3533 non-null int64 9Class 3533 non-null object 10dtypes: int64(2), object(2) 11memory usage: 110.5+ KB 12None 13 14 Date Sales Inventory Class 150 29-04-2010 51 40 Train 161 30-04-2010 56 44 Train 172 01-05-2010 93 74 Train 183 02-05-2010 86 68 Train 194 03-05-2010 57 45 Train
Sometimes classical time series algorithms won't suffice for making powerful predictions. In such cases, it's sensible to convert the time series data to a machine learning algorithm by creating features from the time variable. The code below uses the
pd.DatetimeIndex() function to create time features like year, day of the year, quarter, month, day, weekdays, etc.
1import datetime 2df['Date'] = pd.to_datetime(df['Date']) 3df['Date'] = df['Date'].dt.strftime('%d.%m.%Y') 4df['year'] = pd.DatetimeIndex(df['Date']).year 5df['month'] = pd.DatetimeIndex(df['Date']).month 6df['day'] = pd.DatetimeIndex(df['Date']).day 7df['dayofyear'] = pd.DatetimeIndex(df['Date']).dayofyear 8df['weekofyear'] = pd.DatetimeIndex(df['Date']).weekofyear 9df['weekday'] = pd.DatetimeIndex(df['Date']).weekday 10df['quarter'] = pd.DatetimeIndex(df['Date']).quarter 11df['is_month_start'] = pd.DatetimeIndex(df['Date']).is_month_start 12df['is_month_end'] = pd.DatetimeIndex(df['Date']).is_month_end 13print(df.info())
1 2<class 'pandas.core.frame.DataFrame'> 3RangeIndex: 3533 entries, 0 to 3532 4Data columns (total 13 columns): 5Date 3533 non-null object 6Sales 3533 non-null int64 7Inventory 3533 non-null int64 8Class 3533 non-null object 9year 3533 non-null int64 10month 3533 non-null int64 11day 3533 non-null int64 12dayofyear 3533 non-null int64 13weekofyear 3533 non-null int64 14weekday 3533 non-null int64 15quarter 3533 non-null int64 16is_month_start 3533 non-null bool 17is_month_end 3533 non-null bool
You don’t need the
Date variable now, so you can drop it.
1df = df.drop(['Date'], axis = 1)
Some of the variables in the dataset, such as
quarter, need to be treated as categorical variables. So, you will convert these variables to numeric variables that can be used as factors using a technique called dummy encoding. In this technique, the features are encoded so there is no duplication of the information. This is achieved by passing in the argument
drop_first=True to the
.get_dummies() function, as done in the code below. The last line prints the information about the data, which indicates that the data now has 37 variables.
1df = pd.get_dummies(df, columns=['year'], drop_first=True, prefix='year') 2 3df = pd.get_dummies(df, columns=['month'], drop_first=True, prefix='month') 4 5df = pd.get_dummies(df, columns=['weekday'], drop_first=True, prefix='wday') 6df = pd.get_dummies(df, columns=['quarter'], drop_first=True, prefix='qrtr') 7 8df = pd.get_dummies(df, columns=['is_month_start'], drop_first=True, prefix='m_start') 9 10df = pd.get_dummies(df, columns=['is_month_end'], drop_first=True, prefix='m_end') 11 12df.info()
1<class 'pandas.core.frame.DataFrame'> 2RangeIndex: 3533 entries, 0 to 3532 3Data columns (total 37 columns): 4Sales 3533 non-null int64 5Inventory 3533 non-null int64 6Class 3533 non-null object 7day 3533 non-null int64 8dayofyear 3533 non-null int64 9weekofyear 3533 non-null int64 10year_2011 3533 non-null uint8 11year_2012 3533 non-null uint8 12year_2013 3533 non-null uint8 13year_2014 3533 non-null uint8 14year_2015 3533 non-null uint8 15year_2016 3533 non-null uint8 16year_2017 3533 non-null uint8 17year_2018 3533 non-null uint8 18year_2019 3533 non-null uint8 19month_2 3533 non-null uint8 20month_3 3533 non-null uint8 21month_4 3533 non-null uint8 22month_5 3533 non-null uint8 23month_6 3533 non-null uint8 24month_7 3533 non-null uint8 25month_8 3533 non-null uint8 26month_9 3533 non-null uint8 27month_10 3533 non-null uint8 28month_11 3533 non-null uint8 29month_12 3533 non-null uint8 30wday_1 3533 non-null uint8 31wday_2 3533 non-null uint8 32wday_3 3533 non-null uint8 33wday_4 3533 non-null uint8 34wday_5 3533 non-null uint8 35wday_6 3533 non-null uint8 36qrtr_2 3533 non-null uint8 37qrtr_3 3533 non-null uint8 38qrtr_4 3533 non-null uint8 39m_start_True 3533 non-null uint8 40m_end_True 3533 non-null uint8 41dtypes: int64(5), object(1), uint8(31)
With the data prepared, you are ready to move to machine learning in the subsequent sections. However, before moving to predictive modeling techniques, it's important to divide the data into training and test sets.
1train = df[df["Class"] == "Train"] 2test = df[df["Class"] == "Test"] 3 4print(train.shape) 5print(test.shape)
1(3442, 37) 2 3(91, 37)
You don’t need the
Class variable now, so that can be dropped using the code below.
1train = train.drop(['Class'], axis = 1) 2test = test.drop(['Class'], axis = 1)
With the data partitioned, the next step is to create arrays for the features and response variables. The first line of code creates an object of the target variable called
target_column_train. The second line gives us the list of all the features, excluding the target variable
Sales. The next two lines create the arrays for the training data, and the last two lines print its shape.
1target_column_train = ['Sales'] 2predictors_train = list(set(list(train.columns))-set(target_column_train)) 3 4X_train = train[predictors_train].values 5y_train = train[target_column_train].values 6 7print(X_train.shape) 8print(y_train.shape)
1(3442, 35) 2(3442, 1)
Repeat the same process for the test data with the code below.
1target_column_test = ['Sales'] 2predictors_test = list(set(list(test.columns))-set(target_column_test)) 3 4X_test = test[predictors_test].values 5y_test = test[target_column_test].values 6 7print(X_test.shape) 8print(y_test.shape)
1(91, 35) 2(91, 1)
You are now ready to build machine learning models. Start by loading the libraries and the modules.
1from sklearn import model_selection 2from sklearn.tree import DecisionTreeRegressor 3from sklearn.ensemble import RandomForestRegressor 4from sklearn.metrics import r2_score 5from sklearn.metrics import mean_squared_error 6from math import sqrt
Decision Trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiator is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.
Create a CART regression model using the
DecisionTreeRegressor class. The first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set. The arguments used are
max_depth, which indicates the maximum depth of the tree, and
min_samples_leaf, which indicates the minimum number of samples required to be at a leaf node.
1dtree = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13, random_state=3) 2dtree.fit(X_train, y_train)
1DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None, 2 max_leaf_nodes=None, min_impurity_decrease=0.0, 3 min_impurity_split=None, min_samples_leaf=0.13, 4 min_samples_split=2, min_weight_fraction_leaf=0.0, 5 presort=False, random_state=3, splitter='best')
Once the model is built on the training set, you can make the predictions. The first line of code below predicts on the training set. The second and third lines of code print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.
1# Code lines 1 to 3 2pred_train_tree= dtree.predict(X_train) 3print(np.sqrt(mean_squared_error(y_train,pred_train_tree))) 4print(r2_score(y_train, pred_train_tree)) 5 6# Code lines 4 to 6 7pred_test_tree= dtree.predict(X_test) 8print(np.sqrt(mean_squared_error(y_test,pred_test_tree))) 9print(r2_score(y_test, pred_test_tree))
17.42649982627121 20.8952723676224715 313.797384350596744 40.4567663254694022
The above output shows that the RMSE is 7.4 for the training data and 13.8 for the test data. On the other hand, the R-squared value is 89% for the training data and 46% for the test data. There is a gap between the training and test set results, and more improvement can be done by parameter tuning. Change the values of the parameter
max_depth, to see how that affects the model performance.
The first four lines of code below instantiate and fit the regression trees with a
max_depth parameter of two and five, respectively. The fifth and sixth lines of code generate predictions on the training data, whereas the seventh and eight lines of code give predictions on the testing data.
1# Code Lines 1 to 4: Fit the regression tree 'dtree1' and 'dtree2' 2dtree1 = DecisionTreeRegressor(max_depth=2) 3dtree2 = DecisionTreeRegressor(max_depth=5) 4dtree1.fit(X_train, y_train) 5dtree2.fit(X_train, y_train) 6 7# Code Lines 5 to 6: Predict on training data 8tr1 = dtree1.predict(X_train) 9tr2 = dtree2.predict(X_train) 10 11#Code Lines 7 to 8: Predict on testing data 12y1 = dtree1.predict(X_test) 13y2 = dtree2.predict(X_test)
The code below generates the evaluation metrics—RMSE and R-squared—for the first regression tree, 'dtree1'.
1# Print RMSE and R-squared value for regression tree 'dtree1' on training data 2print(np.sqrt(mean_squared_error(y_train,tr1))) 3print(r2_score(y_train, tr1)) 4 5# Print RMSE and R-squared value for regression tree 'dtree1' on testing data 6print(np.sqrt(mean_squared_error(y_test,y1))) 7print(r2_score(y_test, y1))
17.146794965406164 20.9030125411762373 311.751081527241734 40.6059522633855321
The above output for 'dtree1' model shows that the RMSE is 7.14 for the training data and 11.7 for the test data. The R-squared value is 90% for the training and 61% for the test data. This model is better than the previous model in both the evaluation metrics and the gap between the training and test set results have also come down.
We will now examine the performance of the decision tree model, 'dtree2', by running the following lines of code.
1# Print RMSE and R-squared value for regression tree 'dtree2' on training data 2print(np.sqrt(mean_squared_error(y_train,tr2))) 3print(r2_score(y_train, tr2)) 4 5# Print RMSE and R-squared value for regression tree 'dtree2' on testing data 6print(np.sqrt(mean_squared_error(y_test,y2))) 7print(r2_score(y_test, y2))
12.13305836695393 20.9913603049571774 311.236614430353763 40.6397001209411287
The above output shows significant improvement from the earlier models. The R-squared values for the training and test sets increased to 99% and 64%, respectively. This is better than the earlier models and shows that the gap between the training and test datasets has also decreased. So the regression tree model with a
max_depth parameter of five is performing better, demonstrating how parameter tuning can improve model performance.
Decision Trees are useful, but they often tend to overfit the training data, leading to high variances in the test data. Random Forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a Forest because they are the collection, or ensemble, of several decision trees. One major difference between a Decision Tree and a Random Forest model is how the splits happen. In a Random Forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.
In scikit-learn, the
RandomForestRegressor class is used for building regression trees. The first line of code below instantiates the Random Forest Regression model with an
n_estimators value of 5000. The argument
n_estimators indicates the number of trees in the forest. The second line fits the model to the training data.
The third line of code predicts, while the fourth and fifth lines print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the sixth to eighth lines of code.
1#RF model 2model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100) 3model_rf.fit(X_train, y_train) 4pred_train_rf= model_rf.predict(X_train) 5print(np.sqrt(mean_squared_error(y_train,pred_train_rf))) 6print(r2_score(y_train, pred_train_rf)) 7 8pred_test_rf = model_rf.predict(X_test) 9print(np.sqrt(mean_squared_error(y_test,pred_test_rf))) 10print(r2_score(y_test, pred_test_rf))
10.5859145906944213 20.9993481291496544 38.717826785121657 40.7831249311061259
The above output shows that the RMSE and R-squared values on the training data are 0.58 and 99.9%, respectively. For the test data, the results for these metrics are 8.7 and 78%, respectively. The performance of the Random Forest model is far superior to the Decision Tree models built earlier.
In this guide, you learned how to perform machine learning on time series data. You learned how to create features from the
Date variable and use them as independent features for model building. You were also introduced to powerful non-linear regression tree algorithms like Decision Trees and Random Forest, which you used to build and evaluate a machine learning model.
To learn more about data science using Python, please refer to the following guides.