Author avatar

Deepika Singh

Machine Learning with Time Series Data in Python

Deepika Singh

  • May 18, 2020
  • 17 Min read
  • 56,892 Views
  • May 18, 2020
  • 17 Min read
  • 56,892 Views
Data
Python

Introduction

Time series algorithms are used extensively for analyzing and forecasting time-based data. However, given the complexity of other factors besides time, machine learning has emerged as a powerful method for understanding hidden complexities in time series data and generating good forecasts.

In this guide, you'll learn the concepts of feature engineering and machine learning from a time series perspective, along with the techniques to implement them in Python.

Data

To begin, get familiar with the data. In this guide, you'll be using a fictitious dataset of daily sales data at a supermarket that contains 3,533 observations and four variables, as described below:

  1. Date: daily sales date

  2. Sales: sales at the supermarket for that day, in thousands of dollars

  3. Inventory: total units of inventory at the supermarket

  4. Class: training and test data class for modeling

Start by loading the required libraries and the data.

1import pandas as pd
2import numpy as np 
3
4# Reading the data
5df = pd.read_csv("ml_python.csv")
6print(df.shape)
7print(df.info())
8df.head(5)
python

Output:

1(3533, 4)
2
3<class 'pandas.core.frame.DataFrame'>
4RangeIndex: 3533 entries, 0 to 3532
5Data columns (total 4 columns):
6Date         3533 non-null object
7Sales        3533 non-null int64
8Inventory    3533 non-null int64
9Class        3533 non-null object
10dtypes: int64(2), object(2)
11memory usage: 110.5+ KB
12None    
13
14	Date	      Sales	     Inventory	 Class
150	29-04-2010	   51	        40	      Train
161	30-04-2010	   56	        44	      Train
172	01-05-2010	   93	        74	      Train
183	02-05-2010	   86	        68	      Train
194	03-05-2010	   57	        45	      Train

Date Features

Sometimes classical time series algorithms won't suffice for making powerful predictions. In such cases, it's sensible to convert the time series data to a machine learning algorithm by creating features from the time variable. The code below uses the pd.DatetimeIndex() function to create time features like year, day of the year, quarter, month, day, weekdays, etc.

1import datetime
2df['Date'] = pd.to_datetime(df['Date'])
3df['Date'] = df['Date'].dt.strftime('%d.%m.%Y')
4df['year'] = pd.DatetimeIndex(df['Date']).year
5df['month'] = pd.DatetimeIndex(df['Date']).month
6df['day'] = pd.DatetimeIndex(df['Date']).day
7df['dayofyear'] = pd.DatetimeIndex(df['Date']).dayofyear
8df['weekofyear'] = pd.DatetimeIndex(df['Date']).weekofyear
9df['weekday'] = pd.DatetimeIndex(df['Date']).weekday
10df['quarter'] = pd.DatetimeIndex(df['Date']).quarter
11df['is_month_start'] = pd.DatetimeIndex(df['Date']).is_month_start
12df['is_month_end'] = pd.DatetimeIndex(df['Date']).is_month_end
13print(df.info())
python

Output:

1    
2<class 'pandas.core.frame.DataFrame'>
3RangeIndex: 3533 entries, 0 to 3532
4Data columns (total 13 columns):
5Date              3533 non-null object
6Sales             3533 non-null int64
7Inventory         3533 non-null int64
8Class             3533 non-null object
9year              3533 non-null int64
10month             3533 non-null int64
11day               3533 non-null int64
12dayofyear         3533 non-null int64
13weekofyear        3533 non-null int64
14weekday           3533 non-null int64
15quarter           3533 non-null int64
16is_month_start    3533 non-null bool
17is_month_end      3533 non-null bool

You don’t need the Date variable now, so you can drop it.

1df = df.drop(['Date'], axis = 1) 
python

Dummy Encoding

Some of the variables in the dataset, such as year or quarter, need to be treated as categorical variables. So, you will convert these variables to numeric variables that can be used as factors using a technique called dummy encoding. In this technique, the features are encoded so there is no duplication of the information. This is achieved by passing in the argument drop_first=True to the .get_dummies() function, as done in the code below. The last line prints the information about the data, which indicates that the data now has 37 variables.

1df = pd.get_dummies(df, columns=['year'], drop_first=True, prefix='year')
2
3df = pd.get_dummies(df, columns=['month'], drop_first=True, prefix='month')
4
5df = pd.get_dummies(df, columns=['weekday'], drop_first=True, prefix='wday')
6df = pd.get_dummies(df, columns=['quarter'], drop_first=True, prefix='qrtr')
7
8df = pd.get_dummies(df, columns=['is_month_start'], drop_first=True, prefix='m_start')
9
10df = pd.get_dummies(df, columns=['is_month_end'], drop_first=True, prefix='m_end')
11
12df.info()
python

Output:

1<class 'pandas.core.frame.DataFrame'>
2RangeIndex: 3533 entries, 0 to 3532
3Data columns (total 37 columns):
4Sales           3533 non-null int64
5Inventory       3533 non-null int64
6Class           3533 non-null object
7day             3533 non-null int64
8dayofyear       3533 non-null int64
9weekofyear      3533 non-null int64
10year_2011       3533 non-null uint8
11year_2012       3533 non-null uint8
12year_2013       3533 non-null uint8
13year_2014       3533 non-null uint8
14year_2015       3533 non-null uint8
15year_2016       3533 non-null uint8
16year_2017       3533 non-null uint8
17year_2018       3533 non-null uint8
18year_2019       3533 non-null uint8
19month_2         3533 non-null uint8
20month_3         3533 non-null uint8
21month_4         3533 non-null uint8
22month_5         3533 non-null uint8
23month_6         3533 non-null uint8
24month_7         3533 non-null uint8
25month_8         3533 non-null uint8
26month_9         3533 non-null uint8
27month_10        3533 non-null uint8
28month_11        3533 non-null uint8
29month_12        3533 non-null uint8
30wday_1          3533 non-null uint8
31wday_2          3533 non-null uint8
32wday_3          3533 non-null uint8
33wday_4          3533 non-null uint8
34wday_5          3533 non-null uint8
35wday_6          3533 non-null uint8
36qrtr_2          3533 non-null uint8
37qrtr_3          3533 non-null uint8
38qrtr_4          3533 non-null uint8
39m_start_True    3533 non-null uint8
40m_end_True      3533 non-null uint8
41dtypes: int64(5), object(1), uint8(31)

Data Partitioning

With the data prepared, you are ready to move to machine learning in the subsequent sections. However, before moving to predictive modeling techniques, it's important to divide the data into training and test sets.

1train = df[df["Class"] == "Train"] 
2test = df[df["Class"] == "Test"] 
3
4print(train.shape)
5print(test.shape)
python

Output:

1(3442, 37)
2
3(91, 37)

You don’t need the Class variable now, so that can be dropped using the code below.

1train = train.drop(['Class'], axis = 1) 
2test = test.drop(['Class'], axis = 1) 
python

Creating Arrays for the Features and the Response Variable

With the data partitioned, the next step is to create arrays for the features and response variables. The first line of code creates an object of the target variable called target_column_train. The second line gives us the list of all the features, excluding the target variable Sales. The next two lines create the arrays for the training data, and the last two lines print its shape.

1target_column_train = ['Sales'] 
2predictors_train = list(set(list(train.columns))-set(target_column_train))
3
4X_train = train[predictors_train].values
5y_train = train[target_column_train].values
6
7print(X_train.shape)
8print(y_train.shape)
python

Output:

1(3442, 35)
2(3442, 1)

Repeat the same process for the test data with the code below.

1target_column_test = ['Sales'] 
2predictors_test = list(set(list(test.columns))-set(target_column_test))
3
4X_test = test[predictors_test].values
5y_test = test[target_column_test].values
6
7print(X_test.shape)
8print(y_test.shape)
python

Output:

1(91, 35)
2(91, 1)

You are now ready to build machine learning models. Start by loading the libraries and the modules.

1from sklearn import model_selection
2from sklearn.tree import DecisionTreeRegressor
3from sklearn.ensemble import RandomForestRegressor
4from sklearn.metrics import r2_score
5from sklearn.metrics import mean_squared_error
6from math import sqrt
python

Decision Trees

Decision Trees, also referred to as Classification and Regression Trees (CART), work for both categorical and continuous input and output variables. They work by splitting the data into two or more homogeneous sets based on the most significant splitter among the independent variables. The best differentiator is the one that minimizes the cost metric. The cost metric for a classification tree is often the entropy or the gini index, whereas for a regression tree, the default metric is the mean squared error.

Create a CART regression model using the DecisionTreeRegressor class. The first step is to instantiate the algorithm that is done in the first line of code below. The second line fits the model on the training set. The arguments used are max_depth, which indicates the maximum depth of the tree, and min_samples_leaf, which indicates the minimum number of samples required to be at a leaf node.

1dtree = DecisionTreeRegressor(max_depth=8, min_samples_leaf=0.13, random_state=3)
2dtree.fit(X_train, y_train)
python

Output:

1DecisionTreeRegressor(criterion='mse', max_depth=8, max_features=None,
2           max_leaf_nodes=None, min_impurity_decrease=0.0,
3           min_impurity_split=None, min_samples_leaf=0.13,
4           min_samples_split=2, min_weight_fraction_leaf=0.0,
5           presort=False, random_state=3, splitter='best')

Once the model is built on the training set, you can make the predictions. The first line of code below predicts on the training set. The second and third lines of code print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the fourth to sixth lines.

1# Code lines 1 to 3
2pred_train_tree= dtree.predict(X_train)
3print(np.sqrt(mean_squared_error(y_train,pred_train_tree)))
4print(r2_score(y_train, pred_train_tree))
5
6# Code lines 4 to 6
7pred_test_tree= dtree.predict(X_test)
8print(np.sqrt(mean_squared_error(y_test,pred_test_tree))) 
9print(r2_score(y_test, pred_test_tree))
python

Output:

17.42649982627121
20.8952723676224715
313.797384350596744
40.4567663254694022

The above output shows that the RMSE is 7.4 for the training data and 13.8 for the test data. On the other hand, the R-squared value is 89% for the training data and 46% for the test data. There is a gap between the training and test set results, and more improvement can be done by parameter tuning. Change the values of the parameter max_depth, to see how that affects the model performance.

The first four lines of code below instantiate and fit the regression trees with a max_depth parameter of two and five, respectively. The fifth and sixth lines of code generate predictions on the training data, whereas the seventh and eight lines of code give predictions on the testing data.

1# Code Lines 1 to 4: Fit the regression tree 'dtree1' and 'dtree2' 
2dtree1 = DecisionTreeRegressor(max_depth=2)
3dtree2 = DecisionTreeRegressor(max_depth=5)
4dtree1.fit(X_train, y_train)
5dtree2.fit(X_train, y_train)
6
7# Code Lines 5 to 6: Predict on training data
8tr1 = dtree1.predict(X_train)
9tr2 = dtree2.predict(X_train) 
10
11#Code Lines 7 to 8: Predict on testing data
12y1 = dtree1.predict(X_test)
13y2 = dtree2.predict(X_test)
python

The code below generates the evaluation metrics—RMSE and R-squared—for the first regression tree, 'dtree1'.

1# Print RMSE and R-squared value for regression tree 'dtree1' on training data
2print(np.sqrt(mean_squared_error(y_train,tr1))) 
3print(r2_score(y_train, tr1))
4
5# Print RMSE and R-squared value for regression tree 'dtree1' on testing data
6print(np.sqrt(mean_squared_error(y_test,y1))) 
7print(r2_score(y_test, y1))
python

Output:

17.146794965406164
20.9030125411762373
311.751081527241734
40.6059522633855321

The above output for 'dtree1' model shows that the RMSE is 7.14 for the training data and 11.7 for the test data. The R-squared value is 90% for the training and 61% for the test data. This model is better than the previous model in both the evaluation metrics and the gap between the training and test set results have also come down.

We will now examine the performance of the decision tree model, 'dtree2', by running the following lines of code.

1# Print RMSE and R-squared value for regression tree 'dtree2' on training data
2print(np.sqrt(mean_squared_error(y_train,tr2))) 
3print(r2_score(y_train, tr2))
4
5# Print RMSE and R-squared value for regression tree 'dtree2' on testing data
6print(np.sqrt(mean_squared_error(y_test,y2))) 
7print(r2_score(y_test, y2))
python

Output:

12.13305836695393
20.9913603049571774
311.236614430353763
40.6397001209411287

The above output shows significant improvement from the earlier models. The R-squared values for the training and test sets increased to 99% and 64%, respectively. This is better than the earlier models and shows that the gap between the training and test datasets has also decreased. So the regression tree model with a max_depth parameter of five is performing better, demonstrating how parameter tuning can improve model performance.

Random Forest

Decision Trees are useful, but they often tend to overfit the training data, leading to high variances in the test data. Random Forest algorithms overcome this shortcoming by reducing the variance of the decision trees. They are called a Forest because they are the collection, or ensemble, of several decision trees. One major difference between a Decision Tree and a Random Forest model is how the splits happen. In a Random Forest, instead of trying splits on all the features, a sample of features is selected for each split, thereby reducing the variance of the model.

In scikit-learn, the RandomForestRegressor class is used for building regression trees. The first line of code below instantiates the Random Forest Regression model with an n_estimators value of 5000. The argument n_estimators indicates the number of trees in the forest. The second line fits the model to the training data.

The third line of code predicts, while the fourth and fifth lines print the evaluation metrics—RMSE and R-squared—on the training set. The same steps are repeated on the test dataset in the sixth to eighth lines of code.

1#RF model
2model_rf = RandomForestRegressor(n_estimators=5000, oob_score=True, random_state=100)
3model_rf.fit(X_train, y_train) 
4pred_train_rf= model_rf.predict(X_train)
5print(np.sqrt(mean_squared_error(y_train,pred_train_rf)))
6print(r2_score(y_train, pred_train_rf))
7
8pred_test_rf = model_rf.predict(X_test)
9print(np.sqrt(mean_squared_error(y_test,pred_test_rf)))
10print(r2_score(y_test, pred_test_rf))
python

Output:

10.5859145906944213
20.9993481291496544
38.717826785121657
40.7831249311061259

The above output shows that the RMSE and R-squared values on the training data are 0.58 and 99.9%, respectively. For the test data, the results for these metrics are 8.7 and 78%, respectively. The performance of the Random Forest model is far superior to the Decision Tree models built earlier.

Conclusion

In this guide, you learned how to perform machine learning on time series data. You learned how to create features from the Date variable and use them as independent features for model building. You were also introduced to powerful non-linear regression tree algorithms like Decision Trees and Random Forest, which you used to build and evaluate a machine learning model.

To learn more about data science using Python, please refer to the following guides.