Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Advanced Time Series Modeling (ARIMA) Models in Python

Jun 4, 2020 • 10 Minute Read

Introduction

Time series algorithms are used extensively for analyzing and forecasting time-based data. One set of popular and powerful time series algorithms is the ARIMA class of models, which are based on describing autocorrelations in the data.

ARIMA stands for Autoregressive Integrated Moving Average and has three components, p, d, and q, that are required to build the ARIMA model. These three components are:

p: Number of autoregressive lags

d: Order of differencing required to make the series stationary

q: Number of moving average lags

In this guide, you will learn the core concepts of ARIMA modeling and how to implement it in Python. Let's begin with understanding and loading the data.

Data

This guide uses the fictitious monthly sales data of a supermarket chain containing 564 observations and three variables, as described below:

  1. Date: the first date of every month

  2. Sales: daily sales, in thousands of dollars

  3. Class: the variable denoting the training and test data set partition

The lines of code below import the required libraries and the data.

      import pandas as pd
import numpy as np 

# Reading the data
df = pd.read_csv("data.csv")
print(df.shape)
print(df.info())
    

Output:

      (564, 3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 564 entries, 0 to 563
Data columns (total 3 columns):
Date     564 non-null object
Sales    564 non-null int64
Class    564 non-null object
dtypes: int64(1), object(2)
memory usage: 13.3+ KB
None
    

The next step is to create the training and test datasets for model building and evaluation.

      train = df[df["Class"] == "Train"]
test = df[df["Class"] == "Test"]
print(train.shape)
print(test.shape)
    

Output:

      (552, 3)
    (12, 3)
    

You should also create train and test arrays with the code below.

      train_array = train["Sales"]
print(train_array.shape)

test_array = test["Sales"]
print(test_array.shape)
    

Output:

      (552,)
     
 (12,)
    

With the data prepared, you are ready to move to the forecasting techniques in the subsequent sections. However, before building ARIMA models, it's important to understand the statistical concept of stationarity.

Stationary Series

One of the requirements for ARIMA is that the time series should be stationary. A stationary series is one where the properties do not change over time. There are several methods to check the stationarity of a series. The one you’ll use in this guide is the Augmented Dickey-Fuller test.

Augmented Dickey-Fuller Test

The Augmented Dickey-Fuller test is a type of statistical unit root test. The test uses an autoregressive model and optimizes an information criterion across multiple different lag values.

The null hypothesis of the test is that the time series is not stationary, while the alternate hypothesis (rejecting the null hypothesis) is that the time series is stationary.

The first step is to import the adfuller module from the statsmodels package. This is done in the first line of code below. The second line performs and prints the p-value of the test.

      from statsmodels.tsa.stattools import adfuller

print("p-value:", adfuller(train_array.dropna())[1])
    

Output:

      p-value: 0.3440379665909026
    

The output above shows that the p-value is greater than the significance level of 0.05, so we fail to reject the null hypothesis. The series is not stationary and requires differencing.

The series can be differenced using the diff() function. The first line of code below performs the first order differencing, while the second line performs the Augmented Dickey-Fuller Test.

      diff_1 = train_array.diff().dropna()
print("p-value:", adfuller(diff_1.dropna())[1])
    

Output:

      p-value: 0.001
    

The p-value now is below the significance level, indicating that the series is stationary.

ARIMA Model

You are now ready to build the ARIMA model and make predictions. You will be using the auto_arima function in Python, which automatically discovers the optimal order for an ARIMA model. In simple terms, the function will automatically determine the parameters p, d’, and q of the ARIMA model.

The important parameters of the function are:

  1. The time-series to which you fit the ARIMA model.

  2. start_p: the starting value of p, the order of the auto-regressive (AR) model. This must be a positive integer.

  3. start_q: the starting value of q, the order of the moving-average (MA) model. This must be a positive integer.

  4. d: the order of first-differencing. The default setting is none, and then the value is selected automatically based on the results of the test, in this case the Augmented Dickey-Fuller test.

  5. test: type of unit root test to use in order to detect stationarity if stationary is False and d is none.

You will now build the ARIMA estimator. The first step is to import the pmdarima library that contains the auto_arima function. The second step is to define a function that takes in the time series array and returns the auto-arima model. These steps are done in the code below.

      import pmdarima as pmd

def arimamodel(timeseriesarray):
    autoarima_model = pmd.auto_arima(timeseriesarray, 
                              start_p=1, 
                              start_q=1,
                              test="adf",
                              trace=True)
    return autoarima_model
    

The next step is to use the function defined above and build the ARIMA estimator on the training data.

      arima_model = arimamodel(train_array)
arima_model.summary()
    

Output:

      Fit ARIMA: order=(1, 1, 1); AIC=7974.318, BIC=7991.565, Fit time=0.425 seconds
Fit ARIMA: order=(0, 1, 0); AIC=7975.310, BIC=7983.934, Fit time=0.011 seconds
Fit ARIMA: order=(1, 1, 0); AIC=7973.112, BIC=7986.047, Fit time=0.177 seconds
Fit ARIMA: order=(0, 1, 1); AIC=7973.484, BIC=7986.419, Fit time=0.084 seconds
Fit ARIMA: order=(2, 1, 0); AIC=7974.012, BIC=7991.259, Fit time=0.274 seconds
Fit ARIMA: order=(2, 1, 1); AIC=7973.626, BIC=7995.185, Fit time=0.989 seconds
Total fit time: 2.000 seconds

ARIMA Model Results
Dep. Variable:	D.y	No. Observations:	551
Model:	ARIMA(1, 1, 0)	Log Likelihood	-3983.556
Method:	css-mle	S.D. of innovations	333.866
Date:	Wed, 27 May 2020	AIC	7973.112
Time:	11:37:46	BIC	7986.047
Sample:	1	HQIC	7978.166
coef	std err	z	P>|z|	[0.025	0.975]
const	9.0108	13.085	0.689	0.491	-16.636	34.658
ar.L1.D.y	-0.0871	0.042	-2.053	0.041	-0.170	-0.004
Roots
Real	Imaginary	Modulus	Frequency
AR.1	-11.4797	+0.0000j	11.4797	0.5000
    

The output above shows that the final model fitted was an ARIMA(1,1,0) estimator, where the values of the parameters p, d, and q were one, one, and zero, respectively. The auto_arima functions tests the time series with different combinations of p, d, and q using AIC as the criterion. AIC stands for Akaike Information Criterion, which estimates the relative amount of information lost by a given model. In simple terms, a lower AIC value is preferred. In the above output, the lowest AIC value of 7973.112 was obtained for the ARIMA(1, 1, 0) model, and that is used as the final estimator.

You have trained the model and will now use it make predictions on the test data and perform model evaluation. One step before doing this is to create a utility function that will be used as an evaluation metric. The code below creates a utility function for calculating the mean absolute percentage error (MAPE), which is the metric to be used. The lower the MAPE value, the better the forecasting model performance.

      def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    

The next step is to make predictions on the test data, which is done using the code below. The second line prints the first five observations.

      test['ARIMA'] = automodel.predict(len(test))

test.head(5)
    

Output:

      Date	      Sales	   Class	  ARIMA
552	01-01-2014	  6785        Test      6882.9
553	01-02-2014	  6856        Test      6889.8
554	01-03-2014	  6853        Test      6898.9
555	01-04-2014	  6400       Test      6907.9
556	01-05-2014	  6442        Test      6916.9
    

The final step is to evaluate the predictions on the test data using the utility function as shown below.

      mean_absolute_percentage_error(test.Sales, test.ARIMA)
    

Output:

      9.7846
    

The output above shows that the MAPE for the test data is 9.8%. The low value means that the model results are good.

Conclusion

In this guide, you learned about forecasting time series data using ARIMA. You learned about the stationarity requirement of time series and how to make a non-stationary series stationary through differencing. Finally, you learned how to build and interpret the ARIMA estimator for forecasting using Python.

To learn more about data science using Python, please refer to the following guides.

  1. Scikit Machine Learning

  2. Linear, Lasso, and Ridge Regression with scikit-learn

  3. Non-Linear Regression Trees with scikit-learn

  4. Machine Learning with Neural Networks Using scikit-learn

  5. Validating Machine Learning Models with scikit-learn

  6. Ensemble Modeling with scikit-learn

  7. Preparing Data for Modeling with scikit-learn

  8. Data Science Beginners

  9. Interpreting Data Using Descriptive Statistics with Python

  10. Importing Data in Python