Tuning and Forecasting with (S)ARIMA Models

TL;DR: In this post I go through the process of creating a (S)ARIMA model for a toy dataset. This process involves making the timeseries stationary, tuning the model parameters as well as doing model diagnostics in order to ensure the plausibility of the model. Finally, we use the most promising model to do forecasting for the future.

Note: This notebook assumes that you have some familiarity with the concept of (Seasonal-)ARIMA models already. If you need a quick refresher, take a look at FFP2: Chapter 8, which gives a good overview of important concepts such as stationarity, differencing, autoregressive models (=AR models), moving average models (=MA models), and how to add seasonality to the ARIMA models (=SARIMA models).

So given a timeseries dataset, how would we go about creating a (S)ARIMA model and doing forecasts with it? We will follow the process illustrated in the left part of the diagram below (from FFP2: Chapter 8.7):

As in the previous timeseries notebooks, we will continue using the same toy dataset (cycling data from Seattle).

from itertools import product

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import statsmodels.api as sm
import statsmodels.tsa.api as smts
import statsmodels.tsa.stattools as smtst

from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.stats.diagnostic import acorr_ljungbox

import tqdm

sns.set()
hourly = pd.read_csv('https://raw.githubusercontent.com/jakevdp/SeattleBike/master/FremontHourly.csv', index_col='Date', parse_dates=True)
hourly.columns = ['northbound', 'southbound']
hourly['total'] = hourly['northbound'] + hourly['southbound']
daily = hourly.resample('d').sum()
len(daily)
607
def tsplot(y, lags=None, figsize=(20, 10), title=None):
    ''' Creates 3 subplots: 
        - timeseries plot (with Dickey-Fuller test for stationarity)
        - ACF plot
        - PACF plot
    '''
    fig = plt.figure(figsize=figsize)

    layout = (2, 2)
    ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
    acf_ax = plt.subplot2grid(layout, (1, 0))
    pacf_ax = plt.subplot2grid(layout, (1, 1))

    y.plot(ax=ts_ax)
    p_value = sm.tsa.stattools.adfuller(y)[1]
    if title is None:
        title = 'Time Series Analysis Plots'
    ts_ax.set_title('{0}\n Dickey-Fuller: p={1:.5f}'.format(title, p_value))
    
    smts.graphics.plot_acf(y, lags=lags, ax=acf_ax)
    smts.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
    plt.tight_layout()
series = daily['total']
tsplot(series, lags=60)

png

Here are some observations from the plots above:

  • There are no big outliers. However, towards the more recent days, it seems that the variance is a little higher. One might consider doing a Box-Cox transformation here, but we will omit it for simplicity’s sake.
  • The series is clearly non-stationary as there is a seaonal trend, namely a weekly pattern. This is clearly visible from the ACF plot. This means that we will have apply some differencing in order to make the data more stationary.

Stationarity & Differencing

Seasonal differencing

We will first apply a seasonal differencing (FFP2 recommends to first apply seasonal differencing before doing ‘normal’ differencing, as the normal differencing might turn out to be unnecessary).

daily_sdiff = series - series.shift(7)
daily_sdiff = daily_sdiff.dropna()
tsplot(daily_sdiff, lags=60)

png

The Dickey-Fuller test would reject the null-hypothesis that the seasonally differenced series is not stationary (p-value is 0). The autocorrelations in the ACF plot seem to oscillate around zero for a bit, before dropping to zero. Perhaps we could try a first difference (non-seasonal difference)?

First difference

daily_2diffs = daily_sdiff - daily_sdiff.shift(1)
daily_2diffs = daily_2diffs.dropna()
tsplot(daily_2diffs, lags=60)

png

By looking at the ACF/PACF plots it is not quite clear if the series became ‘more stationary’ (the number of spikes in the plots are similar). But the timeseries plot after first differencing looks more like white noise than the previous one. So we will go with the first-differenced data in the next sections, so $d=1, D=1$, to use the parameter naming convention in (S)ARIMA models.

Candidate models

First candidate model $ARIMA(0, 1, 6)(0, 1, 1)_7$

When choosing candidates based on the ACF/PACF plots, it’s good to remind ourselves of the following (FFP2 Chapter 8.5):

If the data are from an $ARIMA(p, d, 0)$ or $ARIMA(0, d, q)$ model, then the ACF and PACF plots can be helpful in determining the value of $p$ or $q$. If $p$ and $q$ are both positive, then the plots do not help in finding suitable values of $p$ and $q$.

So we’ll start with a candidate model that either has a non-zero $p$ ($P$) or non-zero $q$ ($Q$), but not both.

Let’s first only look at the seasonal lags (7, 14, 21, …) for the seasonal parameters. The spike at lag 7 in the ACF and the decay in the seasonal lags of the PACF (at 7, 14, 21) indicate that we should add a seasonal $MA$ term to our model, i.e. a seasonal $MA(1)_7$ component.

Now let’s consider the (non-seasonal) lags for the non-seasonal parameters. The significant spikes at lags 1 to 6 in the ACF suggest a non-seasonal $MA(6)$ component. Putting in an order higher than the seasonality period ($7$ in our case) will lead to an error as we already have an $MA$ term with lag 7 through the seasonal $MA$ component.

The above two readings would translate to the candidate model $ARIMA(0, 1, 6)(0, 1, 1)_7$, so $q=6, Q=1, d=1, D=1$. We will try out this model, compute its AIC and take a look at the residuals for diagnostics.

(We could consider another candidate model derived for an $AR$ component, but the many spikes in the PACF plot actually make it hard to come up with one single clear candidate.)

candidate_model = sm.tsa.statespace.SARIMAX(
    series, order=(0, 1, 6), seasonal_order=(0, 1, 1, 7)).fit(method='powell', disp=False)
candidate_model.summary()
SARIMAX Results
Dep. Variable: total No. Observations: 607
Model: SARIMAX(0, 1, 6)x(0, 1, [1], 7) Log Likelihood -4671.226
Date: Mon, 14 Sep 2020 AIC 9358.451
Time: 11:57:30 BIC 9393.613
Sample: 10-02-2012 HQIC 9372.140
- 05-31-2014
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ma.L1 -0.3736 0.031 -12.124 0.000 -0.434 -0.313
ma.L2 -0.2536 0.038 -6.718 0.000 -0.328 -0.180
ma.L3 -0.0435 0.034 -1.268 0.205 -0.111 0.024
ma.L4 -0.2356 0.035 -6.642 0.000 -0.305 -0.166
ma.L5 0.0214 0.037 0.575 0.565 -0.051 0.094
ma.L6 0.0504 0.035 1.449 0.147 -0.018 0.119
ma.S.L7 -0.9311 0.015 -61.014 0.000 -0.961 -0.901
sigma2 3.395e+05 1.35e+04 25.221 0.000 3.13e+05 3.66e+05
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 219.60
Prob(Q): 0.95 Prob(JB): 0.00
Heteroskedasticity (H): 2.29 Skew: -0.20
Prob(H) (two-sided): 0.00 Kurtosis: 5.94

Note the AIC of the model in the summary above (9358.451). We will use this metric later on to compare different models.

Model diagnostics and tuning

Let’s take a look at the candidate model’s residuals and see if everything is in order:

candidate_model.plot_diagnostics(figsize=(20,10))
plt.show()

png

The residuals look like white noise and are quite close to the standard normal distribution. There are no significant autocorrelations visible in the correlogram.

Let’s also print both the ACF and PACF plot of the residuals with more lags shown:

lags = 60
fig, axes = plt.subplots(2, 1, figsize=(20, 10))
smts.graphics.plot_acf(candidate_model.resid, lags=lags, ax=axes[0])
smts.graphics.plot_pacf(candidate_model.resid, lags=lags, ax=axes[1])
plt.show()

png

We see very few significant spikes in both the ACF and PACF plot. This indicates that the residuals of our candidate model are indeed likely to be white noise.

In addition to checking the plots, we can also do a so-called portmanteau test, specifically, the Ljung-Box test, which thankfully also has an implementation in the statsmodels package. The null hypothesis of the test is that the data is independently distributed, hence, has no underlying correlations up to a certain lag (FFP2 Chapter 3.3 suggests to take $2m$ as the maximum lag, where $m$ is the period length of the season).

acorr_ljungbox(candidate_model.resid, lags=[2*7], return_df=True, model_df=6)  # model_df = p + q
lb_stat lb_pvalue
14 15.567169 0.049011

We see that the p-value larger than 0.01 (but smaller than 0.05), hence, we cannot reject the null hypothesis that the data is independently distributed at a significance level of 0.01. Therefore, we should be reasonably comfortable with forecasting with this model.

Second candidate model $ARIMA(0, 1, 4)(0, 1, 1)_7$

We could try to manually optimse this starting candidate model by looking at the plots or the coefficients of the model. Since the residuals seem to suggest that we already captured most of the information in the data, let’s focus on the coefficients (shown in the summary of the model above). We can see that we have some non-significant coefficients, e.g. ma.L5 with a p-value of 0.565. We could try to remove some $MA$ terms from our first candidate model. Let’s for example try $ARIMA(0, 1, 4)(0, 1, 1)_7$.

candidate_model2 = sm.tsa.statespace.SARIMAX(
    series, order=(0, 1, 4), seasonal_order=(0, 1, 1, 7)).fit(method='powell')
candidate_model2.summary()
Optimization terminated successfully.
         Current function value: 7.698039
         Iterations: 2
         Function evaluations: 140
SARIMAX Results
Dep. Variable: total No. Observations: 607
Model: SARIMAX(0, 1, 4)x(0, 1, [1], 7) Log Likelihood -4672.709
Date: Mon, 14 Sep 2020 AIC 9357.419
Time: 11:57:33 BIC 9383.790
Sample: 10-02-2012 HQIC 9367.686
- 05-31-2014
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ma.L1 -0.3718 0.030 -12.425 0.000 -0.430 -0.313
ma.L2 -0.2391 0.036 -6.567 0.000 -0.310 -0.168
ma.L3 -0.0410 0.031 -1.322 0.186 -0.102 0.020
ma.L4 -0.2160 0.032 -6.819 0.000 -0.278 -0.154
ma.S.L7 -0.9050 0.016 -56.651 0.000 -0.936 -0.874
sigma2 3.421e+05 1.33e+04 25.675 0.000 3.16e+05 3.68e+05
Ljung-Box (L1) (Q): 0.00 Jarque-Bera (JB): 232.91
Prob(Q): 0.95 Prob(JB): 0.00
Heteroskedasticity (H): 2.29 Skew: -0.22
Prob(H) (two-sided): 0.00 Kurtosis: 6.02

Let’s look at the diagnostics again including the ACF/PACF plots:

candidate_model2.plot_diagnostics(figsize=(20,10))
plt.show()

png

fig, axes = plt.subplots(2, 1, figsize=(20, 10))
smts.graphics.plot_acf(candidate_model2.resid, lags=lags, ax=axes[0])
smts.graphics.plot_pacf(candidate_model2.resid, lags=lags, ax=axes[1])
plt.show()

png

acorr_ljungbox(candidate_model2.resid, lags=[2*7], return_df=True, model_df=4)  # model_df = p + q
lb_stat lb_pvalue
14 15.240807 0.123526

Notice the slight decrease in AIC (now 9357.419), while the autocorrelations in the ACF/PACF plots remain small and the residuals still pass the Ljung-Box test. This second model is likely to be a slightly better model than the first candidate model.

We can try to optimise a bit more manually. But rather than doing these manual optimisation steps, we will take a more automatic approach in the section below, namely, do a grid search over a range of parameters.

We could be happy with the model that we have manually chosen – or, we try to find an even better one by grid-searching over a set of parameters. We take our best candidate model’s parameters as a starting point and search in the parameter space around these parameters. Note that this kind of search doesn’t gurantee any better models, let alone the optimal model, but it might allow us to find an acceptable local optimum in the space we had specified using more computation.

ps = range(0, 2)
qs = range(3, 6)

p = 0
d = 1 
q = 4

Ps = range(0, 2)
Qs = range(0, 3)

P = 0
D = 1 
Q = 1
s = 7

parameters = product(ps, qs, Ps, Qs)
parameters_list = list(parameters)
len(parameters_list)
36
def gs_arima(series, parameters_list, d, D, s, opt_method='powell'):
    """
        Return dataframe with parameters and corresponding AIC, and the best model according to AIC
        
        parameters_list - list with (p, q, P, Q) tuples
        d - integration order in ARIMA model
        D - seasonal integration order 
        s - length of season
    """
    
    results = []
    models = {}
    best_aic = float("inf")

    for param in parameters_list:
        # we need try-except because on some combinations model might fail to converge
        try:
            model = sm.tsa.statespace.SARIMAX(
                series, 
                order=(param[0], d, param[1]), 
                seasonal_order=(param[2], D, param[3], s)).fit(method=opt_method, disp=False)
        except:
            continue

        aic = model.aic
        if aic < best_aic:
            best_model = model
            best_aic = aic
            best_param = param
        results.append([param, model.aic])

    result_table = pd.DataFrame(results)
    result_table.columns = ['parameters', 'aic']
    result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)
    
    return result_table, best_model
result_table, best_model = gs_arima(series, parameters_list, d, D, s)

Top ten models’ AIC values:

result_table.head(10)
parameters aic
0 (0, 4, 1, 1) 9356.886157
1 (0, 4, 0, 1) 9357.418905
2 (0, 4, 0, 2) 9357.463126
3 (0, 5, 0, 1) 9357.876800
4 (0, 5, 1, 1) 9357.878544
5 (0, 5, 0, 2) 9358.202317
6 (1, 4, 0, 1) 9360.101223
7 (1, 5, 0, 1) 9360.784689
8 (1, 5, 1, 1) 9360.925114
9 (1, 4, 0, 2) 9361.035242

Best grid-search model $ARIMA(0, 1, 4)(1, 1, 1)_7$

According to the grid search the best model is actually $ARIMA(0, 1, 4)(1, 1, 1)_7$, as measured by AIC scores (now 9356.886). Let’s take a look at the diagnostic plots of this best model then:

best_model.summary()
SARIMAX Results
Dep. Variable: total No. Observations: 607
Model: SARIMAX(0, 1, 4)x(1, 1, [1], 7) Log Likelihood -4671.443
Date: Mon, 14 Sep 2020 AIC 9356.886
Time: 11:58:42 BIC 9387.653
Sample: 10-02-2012 HQIC 9368.864
- 05-31-2014
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
ma.L1 -0.3618 0.031 -11.843 0.000 -0.422 -0.302
ma.L2 -0.2419 0.037 -6.578 0.000 -0.314 -0.170
ma.L3 -0.0361 0.031 -1.176 0.239 -0.096 0.024
ma.L4 -0.2099 0.032 -6.459 0.000 -0.274 -0.146
ar.S.L7 0.0747 0.040 1.855 0.064 -0.004 0.154
ma.S.L7 -0.9453 0.016 -58.005 0.000 -0.977 -0.913
sigma2 3.392e+05 1.33e+04 25.589 0.000 3.13e+05 3.65e+05
Ljung-Box (L1) (Q): 0.06 Jarque-Bera (JB): 235.39
Prob(Q): 0.80 Prob(JB): 0.00
Heteroskedasticity (H): 2.31 Skew: -0.16
Prob(H) (two-sided): 0.00 Kurtosis: 6.05
best_model.plot_diagnostics(figsize=(20, 10))
plt.show()

png

acorr_ljungbox(best_model.resid, lags=[2*7], return_df=True, model_df=4)  # model_df = p + q
lb_stat lb_pvalue
14 16.872429 0.077235

We don’t see red flags in the residuals of the best model, hence, we can go ahead and use it for forecasting purposes.

Forecasting with best model

Having determined the most promising model through gridsearch we will now use it for forecasting, so answering the question: What is the number of cyclists that we expect in the next $k$ days based on the historical cyclist numbers?

def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def plot_predictions(df, model, n_steps):
    """
        Plots model vs predicted values
        
        series - dataset with timeseries
        model - fitted model
        n_steps - number of steps to predict in the future
    """
    data = df.copy()
    data.columns = ['actual']
    data['arima_model'] = model.fittedvalues

    # forecast values
    data['arima_model'][:s+d] = np.NaN
    forecast = data['arima_model'].append(model.forecast(n_steps))

    # calculate error, again having shifted on s+d steps from the beginning
    error = mean_absolute_percentage_error(data['actual'][s+d:-n_steps], data['arima_model'][s+d:-n_steps])

    plt.figure(figsize=(20, 10))
    plt.title("Mean Absolute Percentage Error: {0:.2f}%".format(error))
    plt.plot(forecast, color='r', label="model")
    plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey')
    plt.plot(data.actual, label="actual", alpha=0.8)
    plt.legend()
    plt.grid(True)
plot_predictions(pd.DataFrame(series), best_model, 60)

png

Summary

In this notebook we went through the process of tuning a (S)ARIMA model for a toy dataset. I specifically wanted to see the parameter tuning and diagnostics process in action, as the usual theoretical guides for these topics seem a bit confusing and lacking intuitive explanations. If you are aware of good guides/tutorials that also give a good idea about the intuition, please do let me know! In the end, even if manual parameter tuning is beyond your understanding, good ol’ gridsearch can still help you to find a reasonable model if you at least make sure that you went through your diagnostics checklist.

We didn’t do a train-test split in this post in order to keep the focus on the parameter tuning. There may be a future post on the evaluation procedure in forecasting as well as adding exogenous predictors to the forecasting model. Stay tuned!

References

  • There are many resources out there giving guidelines about how to come up with the ARIMA parameters. Many of them are confusing and unclear. I found this one to be one of the more helpful ones: https://people.duke.edu/~rnau/arimrule.htm

This notebook follows the book 'Forecasting: Principles and Practice', Chapter 8. You can view and download this post as a Jupyter notebook on Github Gist.