Machine Learning with Time Series Data in Python

May 18, 2020 • 17 Minute Read

Introduction

Time series algorithms are used extensively for analyzing and forecasting time-based data. However, given the complexity of other factors besides time, machine learning has emerged as a powerful method for understanding hidden complexities in time series data and generating good forecasts.

In this guide, you'll learn the concepts of feature engineering and machine learning from a time series perspective, along with the techniques to implement them in Python.

Data

To begin, get familiar with the data. In this guide, you'll be using a fictitious dataset of daily sales data at a supermarket that contains 3,533 observations and four variables, as described below:

Date: daily sales date
Sales: sales at the supermarket for that day, in thousands of dollars
Inventory: total units of inventory at the supermarket
Class: training and test data class for modeling

Start by loading the required libraries and the data.

      import pandas as pd
import numpy as np 

# Reading the data
df = pd.read_csv("ml_python.csv")
print(df.shape)
print(df.info())
df.head(5)
    

Output:

      (3533, 4)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3533 entries, 0 to 3532
Data columns (total 4 columns):
Date         3533 non-null object
Sales        3533 non-null int64
Inventory    3533 non-null int64
Class        3533 non-null object
dtypes: int64(2), object(2)
memory usage: 110.5+ KB
None    

	Date	      Sales	     Inventory	 Class
0	29-04-2010	   51	        40	      Train
1	30-04-2010	   56	        44	      Train
2	01-05-2010	   93	        74	      Train
3	02-05-2010	   86	        68	      Train
4	03-05-2010	   57	        45	      Train
    

Date Features

Sometimes classical time series algorithms won't suffice for making powerful predictions. In such cases, it's sensible to convert the time series data to a machine learning algorithm by creating features from the time variable. The code below uses the pd.DatetimeIndex() function to create time features like year, day of the year, quarter, month, day, weekdays, etc.

      import datetime
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = df['Date'].dt.strftime('%d.%m.%Y')
df['year'] = pd.DatetimeIndex(df['Date']).year
df['month'] = pd.DatetimeIndex(df['Date']).month
df['day'] = pd.DatetimeIndex(df['Date']).day
df['dayofyear'] = pd.DatetimeIndex(df['Date']).dayofyear
df['weekofyear'] = pd.DatetimeIndex(df['Date']).weekofyear
df['weekday'] = pd.DatetimeIndex(df['Date']).weekday
df['quarter'] = pd.DatetimeIndex(df['Date']).quarter
df['is_month_start'] = pd.DatetimeIndex(df['Date']).is_month_start
df['is_month_end'] = pd.DatetimeIndex(df['Date']).is_month_end
print(df.info())
    

Output:

      <class 'pandas.core.frame.DataFrame'>
RangeIndex: 3533 entries, 0 to 3532
Data columns (total 13 columns):
Date              3533 non-null object
Sales             3533 non-null int64
Inventory         3533 non-null int64
Class             3533 non-null object
year              3533 non-null int64
month             3533 non-null int64
day               3533 non-null int64
dayofyear         3533 non-null int64
weekofyear        3533 non-null int64
weekday           3533 non-null int64
quarter           3533 non-null int64
is_month_start    3533 non-null bool
is_month_end      3533 non-null bool
    

You don’t need the Date variable now, so you can drop it.

      df = df.drop(['Date'], axis = 1)

Dummy Encoding

Some of the variables in the dataset, such as year or quarter, need to be treated as categorical variables. So, you will convert these variables to numeric variables that can be used as factors using a technique called dummy encoding. In this technique, the features are encoded so there is no duplication of the information. This is achieved by passing in the argument drop_first=True to the .get_dummies() function, as done in the code below. The last line prints the information about the data, which indicates that the data now has 37 variables.

      df = pd.get_dummies(df, columns=['year'], drop_first=True, prefix='year')

df = pd.get_dummies(df, columns=['month'], drop_first=True, prefix='month')

df = pd.get_dummies(df, columns=['weekday'], drop_first=True, prefix='wday')
df = pd.get_dummies(df, columns=['quarter'], drop_first=True, prefix='qrtr')

df = pd.get_dummies(df, columns=['is_month_start'], drop_first=True, prefix='m_start')

df = pd.get_dummies(df, columns=['is_month_end'], drop_first=True, prefix='m_end')

df.info()

Output:

      <class 'pandas.core.frame.DataFrame'>
RangeIndex: 3533 entries, 0 to 3532
Data columns (total 37 columns):
Sales           3533 non-null int64
Inventory       3533 non-null int64
Class           3533 non-null object
day             3533 non-null int64
dayofyear       3533 non-null int64
weekofyear      3533 non-null int64
year_2011       3533 non-null uint8
year_2012       3533 non-null uint8
year_2013       3533 non-null uint8
year_2014       3533 non-null uint8
year_2015       3533 non-null uint8
year_2016       3533 non-null uint8
year_2017       3533 non-null uint8
year_2018       3533 non-null uint8
year_2019       3533 non-null uint8
month_2         3533 non-null uint8
month_3         3533 non-null uint8
month_4         3533 non-null uint8
month_5         3533 non-null uint8
month_6         3533 non-null uint8
month_7         3533 non-null uint8
month_8         3533 non-null uint8
month_9         3533 non-null uint8
month_10        3533 non-null uint8
month_11        3533 non-null uint8
month_12        3533 non-null uint8
wday_1          3533 non-null uint8
wday_2          3533 non-null uint8
wday_3          3533 non-null uint8
wday_4          3533 non-null uint8
wday_5          3533 non-null uint8
wday_6          3533 non-null uint8
qrtr_2          3533 non-null uint8
qrtr_3          3533 non-null uint8
qrtr_4          3533 non-null uint8
m_start_True    3533 non-null uint8
m_end_True      3533 non-null uint8
dtypes: int64(5), object(1), uint8(31)
    

Data Partitioning

With the data prepared, you are ready to move to machine learning in the subsequent sections. However, before moving to predictive modeling techniques, it's important to divide the data into training and test sets.

      train = df[df["Class"] == "Train"] 
test = df[df["Class"] == "Test"] 

print(train.shape)
print(test.shape)
    

Output:

      (3442, 37)

(91, 37)

You don’t need the Class variable now, so that can be dropped using the code below.

      train = train.drop(['Class'], axis = 1) 
test = test.drop(['Class'], axis = 1)
    

Creating Arrays for the Features and the Response Variable

With the data partitioned, the next step is to create arrays for the features and response variables. The first line of code creates an object of the target variable called target_column_train. The second line gives us the list of all the features, excluding the target variable Sales. The next two lines create the arrays for the training data, and the last two lines print its shape.

      target_column_train = ['Sales'] 
predictors_train = list(set(list(train.columns))-set(target_column_train))

X_train = train[predictors_train].values
y_train = train[target_column_train].values

print(X_train.shape)
print(y_train.shape)
    

Output:

      (3442, 35)
(3442, 1)
    

Repeat the same process for the test data with the code below.

      target_column_test = ['Sales'] 
predictors_test = list(set(list(test.columns))-set(target_column_test))

X_test = test[predictors_test].values
y_test = test[target_column_test].values

print(X_test.shape)
print(y_test.shape)
    

Output:

      (91, 35)
(91, 1)
    

You are now ready to build machine learning models. Start by loading the libraries and the modules.

      from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from math import sqrt
    

Decision Trees

Decision Trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiator is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.

Create a CART regression model using the DecisionTreeRegressor class. The first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set. The arguments used are max_depth, which indicates the maximum depth of the tree, and min_samples_leaf, which indicates the minimum number of samples required to be at a leaf node.

      dtree = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13, random_state=3)
dtree.fit(X_train, y_train)
    

Output:

      DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=0.13,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=3, splitter='best')
    

Once the model is built on the training set, you can make the predictions. The first line of code below predicts on the training set. The second and third lines of code print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.

      # Code lines 1 to 3
pred_train_tree= dtree.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_tree)))
print(r2_score(y_train, pred_train_tree))

# Code lines 4 to 6
pred_test_tree= dtree.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_tree))) 
print(r2_score(y_test, pred_test_tree))
    

Output:

42649982627121
8952723676224715
797384350596744
4567663254694022
    

The above output shows that the RMSE is 7.4 for the training data and 13.8 for the test data. On the other hand, the R-squared value is 89% for the training data and 46% for the test data. There is a gap between the training and test set results, and more improvement can be done by parameter tuning. Change the values of the parameter max_depth, to see how that affects the model performance.

The first four lines of code below instantiate and fit the regression trees with a max_depth parameter of two and five, respectively. The fifth and sixth lines of code generate predictions on the training data, whereas the seventh and eight lines of code give predictions on the testing data.

      # Code Lines 1 to 4: Fit the regression tree 'dtree1' and 'dtree2' 
dtree1 = DecisionTreeRegressor(max_depth=2)
dtree2 = DecisionTreeRegressor(max_depth=5)
dtree1.fit(X_train, y_train)
dtree2.fit(X_train, y_train)

# Code Lines 5 to 6: Predict on training data
tr1 = dtree1.predict(X_train)
tr2 = dtree2.predict(X_train) 

#Code Lines 7 to 8: Predict on testing data
y1 = dtree1.predict(X_test)
y2 = dtree2.predict(X_test)
    

The code below generates the evaluation metrics—RMSE and R-squared—for the first regression tree, 'dtree1'.

      # Print RMSE and R-squared value for regression tree 'dtree1' on training data
print(np.sqrt(mean_squared_error(y_train,tr1))) 
print(r2_score(y_train, tr1))

# Print RMSE and R-squared value for regression tree 'dtree1' on testing data
print(np.sqrt(mean_squared_error(y_test,y1))) 
print(r2_score(y_test, y1))
    

Output:

146794965406164
9030125411762373
751081527241734
6059522633855321
    

The above output for 'dtree1' model shows that the RMSE is 7.14 for the training data and 11.7 for the test data. The R-squared value is 90% for the training and 61% for the test data. This model is better than the previous model in both the evaluation metrics and the gap between the training and test set results have also come down.

We will now examine the performance of the decision tree model, 'dtree2', by running the following lines of code.

      # Print RMSE and R-squared value for regression tree 'dtree2' on training data
print(np.sqrt(mean_squared_error(y_train,tr2))) 
print(r2_score(y_train, tr2))

# Print RMSE and R-squared value for regression tree 'dtree2' on testing data
print(np.sqrt(mean_squared_error(y_test,y2))) 
print(r2_score(y_test, y2))
    

Output:

13305836695393
9913603049571774
236614430353763
6397001209411287
    

The above output shows significant improvement from the earlier models. The R-squared values for the training and test sets increased to 99% and 64%, respectively. This is better than the earlier models and shows that the gap between the training and test datasets has also decreased. So the regression tree model with a max_depth parameter of five is performing better, demonstrating how parameter tuning can improve model performance.

Random Forest

Decision Trees are useful, but they often tend to overfit the training data, leading to high variances in the test data. Random Forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a Forest because they are the collection, or ensemble, of several decision trees. One major difference between a Decision Tree and a Random Forest model is how the splits happen. In a Random Forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.

In scikit-learn, the RandomForestRegressor class is used for building regression trees. The first line of code below instantiates the Random Forest Regression model with an n_estimators value of 5000. The argument n_estimators indicates the number of trees in the forest. The second line fits the model to the training data.

The third line of code predicts, while the fourth and fifth lines print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the sixth to eighth lines of code.

      #RF model
model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100)
model_rf.fit(X_train, y_train) 
pred_train_rf= model_rf.predict(X_train)
print(np.sqrt(mean_squared_error(y_train,pred_train_rf)))
print(r2_score(y_train, pred_train_rf))

pred_test_rf = model_rf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test,pred_test_rf)))
print(r2_score(y_test, pred_test_rf))
    

Output:

5859145906944213
9993481291496544
717826785121657
7831249311061259
    

The above output shows that the RMSE and R-squared values on the training data are 0.58 and 99.9%, respectively. For the test data, the results for these metrics are 8.7 and 78%, respectively. The performance of the Random Forest model is far superior to the Decision Tree models built earlier.

Conclusion

In this guide, you learned how to perform machine learning on time series data. You learned how to create features from the Date variable and use them as independent features for model building. You were also introduced to powerful non-linear regression tree algorithms like Decision Trees and Random Forest, which you used to build and evaluate a machine learning model.

To learn more about data science using Python, please refer to the following guides.