46
Time series algorithms are used extensively for analyzing and forecasting time-based data. However, given the complexity of other factors besides time, machine learning has emerged as a powerful method for understanding hidden complexities in time series data and generating good forecasts.
In this guide, you'll learn the concepts of feature engineering and machine learning from a time series perspective, along with the techniques to implement them in Python.
To begin, get familiar with the data. In this guide, you'll be using a fictitious dataset of daily sales data at a supermarket that contains 3,533 observations and four variables, as described below:
Date
: daily sales date
Sales
: sales at the supermarket for that day, in thousands of dollars
Inventory
: total units of inventory at the supermarket
Class
: training and test data class for modeling
Start by loading the required libraries and the data.
1 2 3 4 5 6 7 8
import pandas as pd import numpy as np # Reading the data df = pd.read_csv("ml_python.csv") print(df.shape) print(df.info()) df.head(5)
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
(3533, 4) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3533 entries, 0 to 3532 Data columns (total 4 columns): Date 3533 non-null object Sales 3533 non-null int64 Inventory 3533 non-null int64 Class 3533 non-null object dtypes: int64(2), object(2) memory usage: 110.5+ KB None Date Sales Inventory Class 0 29-04-2010 51 40 Train 1 30-04-2010 56 44 Train 2 01-05-2010 93 74 Train 3 02-05-2010 86 68 Train 4 03-05-2010 57 45 Train
Sometimes classical time series algorithms won't suffice for making powerful predictions. In such cases, it's sensible to convert the time series data to a machine learning algorithm by creating features from the time variable. The code below uses the pd.DatetimeIndex()
function to create time features like year, day of the year, quarter, month, day, weekdays, etc.
1 2 3 4 5 6 7 8 9 10 11 12 13
import datetime df['Date'] = pd.to_datetime(df['Date']) df['Date'] = df['Date'].dt.strftime('%d.%m.%Y') df['year'] = pd.DatetimeIndex(df['Date']).year df['month'] = pd.DatetimeIndex(df['Date']).month df['day'] = pd.DatetimeIndex(df['Date']).day df['dayofyear'] = pd.DatetimeIndex(df['Date']).dayofyear df['weekofyear'] = pd.DatetimeIndex(df['Date']).weekofyear df['weekday'] = pd.DatetimeIndex(df['Date']).weekday df['quarter'] = pd.DatetimeIndex(df['Date']).quarter df['is_month_start'] = pd.DatetimeIndex(df['Date']).is_month_start df['is_month_end'] = pd.DatetimeIndex(df['Date']).is_month_end print(df.info())
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3533 entries, 0 to 3532 Data columns (total 13 columns): Date 3533 non-null object Sales 3533 non-null int64 Inventory 3533 non-null int64 Class 3533 non-null object year 3533 non-null int64 month 3533 non-null int64 day 3533 non-null int64 dayofyear 3533 non-null int64 weekofyear 3533 non-null int64 weekday 3533 non-null int64 quarter 3533 non-null int64 is_month_start 3533 non-null bool is_month_end 3533 non-null bool
You don’t need the Date
variable now, so you can drop it.
1
df = df.drop(['Date'], axis = 1)
Some of the variables in the dataset, such as year
or quarter
, need to be treated as categorical variables. So, you will convert these variables to numeric variables that can be used as factors using a technique called dummy encoding. In this technique, the features are encoded so there is no duplication of the information. This is achieved by passing in the argument drop_first=True
to the .get_dummies()
function, as done in the code below. The last line prints the information about the data, which indicates that the data now has 37 variables.
1 2 3 4 5 6 7 8 9 10 11 12
df = pd.get_dummies(df, columns=['year'], drop_first=True, prefix='year') df = pd.get_dummies(df, columns=['month'], drop_first=True, prefix='month') df = pd.get_dummies(df, columns=['weekday'], drop_first=True, prefix='wday') df = pd.get_dummies(df, columns=['quarter'], drop_first=True, prefix='qrtr') df = pd.get_dummies(df, columns=['is_month_start'], drop_first=True, prefix='m_start') df = pd.get_dummies(df, columns=['is_month_end'], drop_first=True, prefix='m_end') df.info()
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3533 entries, 0 to 3532 Data columns (total 37 columns): Sales 3533 non-null int64 Inventory 3533 non-null int64 Class 3533 non-null object day 3533 non-null int64 dayofyear 3533 non-null int64 weekofyear 3533 non-null int64 year_2011 3533 non-null uint8 year_2012 3533 non-null uint8 year_2013 3533 non-null uint8 year_2014 3533 non-null uint8 year_2015 3533 non-null uint8 year_2016 3533 non-null uint8 year_2017 3533 non-null uint8 year_2018 3533 non-null uint8 year_2019 3533 non-null uint8 month_2 3533 non-null uint8 month_3 3533 non-null uint8 month_4 3533 non-null uint8 month_5 3533 non-null uint8 month_6 3533 non-null uint8 month_7 3533 non-null uint8 month_8 3533 non-null uint8 month_9 3533 non-null uint8 month_10 3533 non-null uint8 month_11 3533 non-null uint8 month_12 3533 non-null uint8 wday_1 3533 non-null uint8 wday_2 3533 non-null uint8 wday_3 3533 non-null uint8 wday_4 3533 non-null uint8 wday_5 3533 non-null uint8 wday_6 3533 non-null uint8 qrtr_2 3533 non-null uint8 qrtr_3 3533 non-null uint8 qrtr_4 3533 non-null uint8 m_start_True 3533 non-null uint8 m_end_True 3533 non-null uint8 dtypes: int64(5), object(1), uint8(31)
With the data prepared, you are ready to move to machine learning in the subsequent sections. However, before moving to predictive modeling techniques, it's important to divide the data into training and test sets.
1 2 3 4 5
train = df[df["Class"] == "Train"] test = df[df["Class"] == "Test"] print(train.shape) print(test.shape)
Output:
1 2 3
(3442, 37) (91, 37)
You don’t need the Class
variable now, so that can be dropped using the code below.
1 2
train = train.drop(['Class'], axis = 1) test = test.drop(['Class'], axis = 1)
With the data partitioned, the next step is to create arrays for the features and response variables. The first line of code creates an object of the target variable called target_column_train
. The second line gives us the list of all the features, excluding the target variable Sales
. The next two lines create the arrays for the training data, and the last two lines print its shape.
1 2 3 4 5 6 7 8
target_column_train = ['Sales'] predictors_train = list(set(list(train.columns))-set(target_column_train)) X_train = train[predictors_train].values y_train = train[target_column_train].values print(X_train.shape) print(y_train.shape)
Output:
1 2
(3442, 35) (3442, 1)
Repeat the same process for the test data with the code below.
1 2 3 4 5 6 7 8
target_column_test = ['Sales'] predictors_test = list(set(list(test.columns))-set(target_column_test)) X_test = test[predictors_test].values y_test = test[target_column_test].values print(X_test.shape) print(y_test.shape)
Output:
1 2
(91, 35) (91, 1)
You are now ready to build machine learning models. Start by loading the libraries and the modules.
1 2 3 4 5 6
from sklearn import model_selection from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import r2_score from sklearn.metrics import mean_squared_error from math import sqrt
Decision Trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiator is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.
Create a CART regression model using the DecisionTreeRegressor
class. The first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set. The arguments used are max_depth
, which indicates the maximum depth of the tree, and min_samples_leaf
, which indicates the minimum number of samples required to be at a leaf node.
1 2
dtree = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13, random_state=3) dtree.fit(X_train, y_train)
Output:
1 2 3 4 5
DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=0.13, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=3, splitter='best')
Once the model is built on the training set, you can make the predictions. The first line of code below predicts on the training set. The second and third lines of code print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.
1 2 3 4 5 6 7 8 9
# Code lines 1 to 3 pred_train_tree= dtree.predict(X_train) print(np.sqrt(mean_squared_error(y_train,pred_train_tree))) print(r2_score(y_train, pred_train_tree)) # Code lines 4 to 6 pred_test_tree= dtree.predict(X_test) print(np.sqrt(mean_squared_error(y_test,pred_test_tree))) print(r2_score(y_test, pred_test_tree))
Output:
1 2 3 4
7.42649982627121 0.8952723676224715 13.797384350596744 0.4567663254694022
The above output shows that the RMSE is 7.4 for the training data and 13.8 for the test data. On the other hand, the R-squared value is 89% for the training data and 46% for the test data. There is a gap between the training and test set results, and more improvement can be done by parameter tuning. Change the values of the parameter max_depth
, to see how that affects the model performance.
The first four lines of code below instantiate and fit the regression trees with a max_depth
parameter of two and five, respectively. The fifth and sixth lines of code generate predictions on the training data, whereas the seventh and eight lines of code give predictions on the testing data.
1 2 3 4 5 6 7 8 9 10 11 12 13
# Code Lines 1 to 4: Fit the regression tree 'dtree1' and 'dtree2' dtree1 = DecisionTreeRegressor(max_depth=2) dtree2 = DecisionTreeRegressor(max_depth=5) dtree1.fit(X_train, y_train) dtree2.fit(X_train, y_train) # Code Lines 5 to 6: Predict on training data tr1 = dtree1.predict(X_train) tr2 = dtree2.predict(X_train) #Code Lines 7 to 8: Predict on testing data y1 = dtree1.predict(X_test) y2 = dtree2.predict(X_test)
The code below generates the evaluation metrics—RMSE and R-squared—for the first regression tree, 'dtree1'.
1 2 3 4 5 6 7
# Print RMSE and R-squared value for regression tree 'dtree1' on training data print(np.sqrt(mean_squared_error(y_train,tr1))) print(r2_score(y_train, tr1)) # Print RMSE and R-squared value for regression tree 'dtree1' on testing data print(np.sqrt(mean_squared_error(y_test,y1))) print(r2_score(y_test, y1))
Output:
1 2 3 4
7.146794965406164 0.9030125411762373 11.751081527241734 0.6059522633855321
The above output for 'dtree1' model shows that the RMSE is 7.14 for the training data and 11.7 for the test data. The R-squared value is 90% for the training and 61% for the test data. This model is better than the previous model in both the evaluation metrics and the gap between the training and test set results have also come down.
We will now examine the performance of the decision tree model, 'dtree2', by running the following lines of code.
1 2 3 4 5 6 7
# Print RMSE and R-squared value for regression tree 'dtree2' on training data print(np.sqrt(mean_squared_error(y_train,tr2))) print(r2_score(y_train, tr2)) # Print RMSE and R-squared value for regression tree 'dtree2' on testing data print(np.sqrt(mean_squared_error(y_test,y2))) print(r2_score(y_test, y2))
Output:
1 2 3 4
2.13305836695393 0.9913603049571774 11.236614430353763 0.6397001209411287
The above output shows significant improvement from the earlier models. The R-squared values for the training and test sets increased to 99% and 64%, respectively. This is better than the earlier models and shows that the gap between the training and test datasets has also decreased. So the regression tree model with a max_depth
parameter of five is performing better, demonstrating how parameter tuning can improve model performance.
Decision Trees are useful, but they often tend to overfit the training data, leading to high variances in the test data. Random Forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a Forest because they are the collection, or ensemble, of several decision trees. One major difference between a Decision Tree and a Random Forest model is how the splits happen. In a Random Forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.
In scikit-learn, the RandomForestRegressor
class is used for building regression trees. The first line of code below instantiates the Random Forest Regression model with an n_estimators
value of 5000. The argument n_estimators
indicates the number of trees in the forest. The second line fits the model to the training data.
The third line of code predicts, while the fourth and fifth lines print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the sixth to eighth lines of code.
1 2 3 4 5 6 7 8 9 10
#RF model model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100) model_rf.fit(X_train, y_train) pred_train_rf= model_rf.predict(X_train) print(np.sqrt(mean_squared_error(y_train,pred_train_rf))) print(r2_score(y_train, pred_train_rf)) pred_test_rf = model_rf.predict(X_test) print(np.sqrt(mean_squared_error(y_test,pred_test_rf))) print(r2_score(y_test, pred_test_rf))
Output:
1 2 3 4
0.5859145906944213 0.9993481291496544 8.717826785121657 0.7831249311061259
The above output shows that the RMSE and R-squared values on the training data are 0.58 and 99.9%, respectively. For the test data, the results for these metrics are 8.7 and 78%, respectively. The performance of the Random Forest model is far superior to the Decision Tree models built earlier.
In this guide, you learned how to perform machine learning on time series data. You learned how to create features from the Date
variable and use them as independent features for model building. You were also introduced to powerful non-linear regression tree algorithms like Decision Trees and Random Forest, which you used to build and evaluate a machine learning model.
To learn more about data science using Python, please refer to the following guides.
46