Tuning and Forecasting with (S)ARIMA Models
TL;DR: In this post I go through the process of creating a (S)ARIMA model for a toy dataset. This process involves making the timeseries stationary, tuning the model parameters as well as doing model diagnostics in order to ensure the plausibility of the model. Finally, we use the most promising model to do forecasting for the future.
Note: This notebook assumes that you have some familiarity with the concept of (Seasonal-)ARIMA models already. If you need a quick refresher, take a look at FFP2: Chapter 8, which gives a good overview of important concepts such as stationarity, differencing, autoregressive models (=AR models), moving average models (=MA models), and how to add seasonality to the ARIMA models (=SARIMA models).
So given a timeseries dataset, how would we go about creating a (S)ARIMA model and doing forecasts with it? We will follow the process illustrated in the left part of the diagram below (from FFP2: Chapter 8.7):
As in the previous timeseries notebooks, we will continue using the same toy dataset (cycling data from Seattle).
from itertools import product
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.tsa.api as smts
import statsmodels.tsa.stattools as smtst
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.stats.diagnostic import acorr_ljungbox
import tqdm
sns.set()
hourly = pd.read_csv('https://raw.githubusercontent.com/jakevdp/SeattleBike/master/FremontHourly.csv', index_col='Date', parse_dates=True)
hourly.columns = ['northbound', 'southbound']
hourly['total'] = hourly['northbound'] + hourly['southbound']
daily = hourly.resample('d').sum()
len(daily)
607
def tsplot(y, lags=None, figsize=(20, 10), title=None):
''' Creates 3 subplots:
- timeseries plot (with Dickey-Fuller test for stationarity)
- ACF plot
- PACF plot
'''
fig = plt.figure(figsize=figsize)
layout = (2, 2)
ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
acf_ax = plt.subplot2grid(layout, (1, 0))
pacf_ax = plt.subplot2grid(layout, (1, 1))
y.plot(ax=ts_ax)
p_value = sm.tsa.stattools.adfuller(y)[1]
if title is None:
title = 'Time Series Analysis Plots'
ts_ax.set_title('{0}\n Dickey-Fuller: p={1:.5f}'.format(title, p_value))
smts.graphics.plot_acf(y, lags=lags, ax=acf_ax)
smts.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
plt.tight_layout()
series = daily['total']
tsplot(series, lags=60)
Here are some observations from the plots above:
- There are no big outliers. However, towards the more recent days, it seems that the variance is a little higher. One might consider doing a Box-Cox transformation here, but we will omit it for simplicity’s sake.
- The series is clearly non-stationary as there is a seaonal trend, namely a weekly pattern. This is clearly visible from the ACF plot. This means that we will have apply some differencing in order to make the data more stationary.
Stationarity & Differencing
Seasonal differencing
We will first apply a seasonal differencing (FFP2 recommends to first apply seasonal differencing before doing ‘normal’ differencing, as the normal differencing might turn out to be unnecessary).
daily_sdiff = series - series.shift(7)
daily_sdiff = daily_sdiff.dropna()
tsplot(daily_sdiff, lags=60)
The Dickey-Fuller test would reject the null-hypothesis that the seasonally differenced series is not stationary (p-value is 0). The autocorrelations in the ACF plot seem to oscillate around zero for a bit, before dropping to zero. Perhaps we could try a first difference (non-seasonal difference)?
First difference
daily_2diffs = daily_sdiff - daily_sdiff.shift(1)
daily_2diffs = daily_2diffs.dropna()
tsplot(daily_2diffs, lags=60)
By looking at the ACF/PACF plots it is not quite clear if the series became ‘more stationary’ (the number of spikes in the plots are similar). But the timeseries plot after first differencing looks more like white noise than the previous one. So we will go with the first-differenced data in the next sections, so $d=1, D=1$, to use the parameter naming convention in (S)ARIMA models.
Candidate models
First candidate model $ARIMA(0, 1, 6)(0, 1, 1)_7$
When choosing candidates based on the ACF/PACF plots, it’s good to remind ourselves of the following (FFP2 Chapter 8.5):
If the data are from an $ARIMA(p, d, 0)$ or $ARIMA(0, d, q)$ model, then the ACF and PACF plots can be helpful in determining the value of $p$ or $q$. If $p$ and $q$ are both positive, then the plots do not help in finding suitable values of $p$ and $q$.
So we’ll start with a candidate model that either has a non-zero $p$ ($P$) or non-zero $q$ ($Q$), but not both.
Let’s first only look at the seasonal lags (7, 14, 21, …) for the seasonal parameters. The spike at lag 7 in the ACF and the decay in the seasonal lags of the PACF (at 7, 14, 21) indicate that we should add a seasonal $MA$ term to our model, i.e. a seasonal $MA(1)_7$ component.
Now let’s consider the (non-seasonal) lags for the non-seasonal parameters. The significant spikes at lags 1 to 6 in the ACF suggest a non-seasonal $MA(6)$ component. Putting in an order higher than the seasonality period ($7$ in our case) will lead to an error as we already have an $MA$ term with lag 7 through the seasonal $MA$ component.
The above two readings would translate to the candidate model $ARIMA(0, 1, 6)(0, 1, 1)_7$, so $q=6, Q=1, d=1, D=1$. We will try out this model, compute its AIC and take a look at the residuals for diagnostics.
(We could consider another candidate model derived for an $AR$ component, but the many spikes in the PACF plot actually make it hard to come up with one single clear candidate.)
candidate_model = sm.tsa.statespace.SARIMAX(
series, order=(0, 1, 6), seasonal_order=(0, 1, 1, 7)).fit(method='powell', disp=False)
candidate_model.summary()
Dep. Variable: | total | No. Observations: | 607 |
---|---|---|---|
Model: | SARIMAX(0, 1, 6)x(0, 1, [1], 7) | Log Likelihood | -4671.226 |
Date: | Mon, 14 Sep 2020 | AIC | 9358.451 |
Time: | 11:57:30 | BIC | 9393.613 |
Sample: | 10-02-2012 | HQIC | 9372.140 |
- 05-31-2014 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ma.L1 | -0.3736 | 0.031 | -12.124 | 0.000 | -0.434 | -0.313 |
ma.L2 | -0.2536 | 0.038 | -6.718 | 0.000 | -0.328 | -0.180 |
ma.L3 | -0.0435 | 0.034 | -1.268 | 0.205 | -0.111 | 0.024 |
ma.L4 | -0.2356 | 0.035 | -6.642 | 0.000 | -0.305 | -0.166 |
ma.L5 | 0.0214 | 0.037 | 0.575 | 0.565 | -0.051 | 0.094 |
ma.L6 | 0.0504 | 0.035 | 1.449 | 0.147 | -0.018 | 0.119 |
ma.S.L7 | -0.9311 | 0.015 | -61.014 | 0.000 | -0.961 | -0.901 |
sigma2 | 3.395e+05 | 1.35e+04 | 25.221 | 0.000 | 3.13e+05 | 3.66e+05 |
Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 219.60 |
---|---|---|---|
Prob(Q): | 0.95 | Prob(JB): | 0.00 |
Heteroskedasticity (H): | 2.29 | Skew: | -0.20 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 5.94 |
Note the AIC of the model in the summary above (9358.451
). We will use this metric later on to compare different models.
Model diagnostics and tuning
Let’s take a look at the candidate model’s residuals and see if everything is in order:
candidate_model.plot_diagnostics(figsize=(20,10))
plt.show()
The residuals look like white noise and are quite close to the standard normal distribution. There are no significant autocorrelations visible in the correlogram.
Let’s also print both the ACF and PACF plot of the residuals with more lags shown:
lags = 60
fig, axes = plt.subplots(2, 1, figsize=(20, 10))
smts.graphics.plot_acf(candidate_model.resid, lags=lags, ax=axes[0])
smts.graphics.plot_pacf(candidate_model.resid, lags=lags, ax=axes[1])
plt.show()
We see very few significant spikes in both the ACF and PACF plot. This indicates that the residuals of our candidate model are indeed likely to be white noise.
In addition to checking the plots, we can also do a so-called portmanteau test, specifically, the Ljung-Box test, which thankfully also has an implementation in the statsmodels
package. The null hypothesis of the test is that the data is independently distributed, hence, has no underlying correlations up to a certain lag (FFP2 Chapter 3.3 suggests to take $2m$ as the maximum lag, where $m$ is the period length of the season).
acorr_ljungbox(candidate_model.resid, lags=[2*7], return_df=True, model_df=6) # model_df = p + q
lb_stat | lb_pvalue | |
---|---|---|
14 | 15.567169 | 0.049011 |
We see that the p-value larger than 0.01 (but smaller than 0.05), hence, we cannot reject the null hypothesis that the data is independently distributed at a significance level of 0.01. Therefore, we should be reasonably comfortable with forecasting with this model.
Second candidate model $ARIMA(0, 1, 4)(0, 1, 1)_7$
We could try to manually optimse this starting candidate model by looking at the plots or the coefficients of the model. Since the residuals seem to suggest that we already captured most of the information in the data, let’s focus on the coefficients (shown in the summary of the model above). We can see that we have some non-significant coefficients, e.g. ma.L5
with a p-value of 0.565
. We could try to remove some $MA$ terms from our first candidate model. Let’s for example try $ARIMA(0, 1, 4)(0, 1, 1)_7$.
candidate_model2 = sm.tsa.statespace.SARIMAX(
series, order=(0, 1, 4), seasonal_order=(0, 1, 1, 7)).fit(method='powell')
candidate_model2.summary()
Optimization terminated successfully.
Current function value: 7.698039
Iterations: 2
Function evaluations: 140
Dep. Variable: | total | No. Observations: | 607 |
---|---|---|---|
Model: | SARIMAX(0, 1, 4)x(0, 1, [1], 7) | Log Likelihood | -4672.709 |
Date: | Mon, 14 Sep 2020 | AIC | 9357.419 |
Time: | 11:57:33 | BIC | 9383.790 |
Sample: | 10-02-2012 | HQIC | 9367.686 |
- 05-31-2014 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ma.L1 | -0.3718 | 0.030 | -12.425 | 0.000 | -0.430 | -0.313 |
ma.L2 | -0.2391 | 0.036 | -6.567 | 0.000 | -0.310 | -0.168 |
ma.L3 | -0.0410 | 0.031 | -1.322 | 0.186 | -0.102 | 0.020 |
ma.L4 | -0.2160 | 0.032 | -6.819 | 0.000 | -0.278 | -0.154 |
ma.S.L7 | -0.9050 | 0.016 | -56.651 | 0.000 | -0.936 | -0.874 |
sigma2 | 3.421e+05 | 1.33e+04 | 25.675 | 0.000 | 3.16e+05 | 3.68e+05 |
Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 232.91 |
---|---|---|---|
Prob(Q): | 0.95 | Prob(JB): | 0.00 |
Heteroskedasticity (H): | 2.29 | Skew: | -0.22 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 6.02 |
Let’s look at the diagnostics again including the ACF/PACF plots:
candidate_model2.plot_diagnostics(figsize=(20,10))
plt.show()
fig, axes = plt.subplots(2, 1, figsize=(20, 10))
smts.graphics.plot_acf(candidate_model2.resid, lags=lags, ax=axes[0])
smts.graphics.plot_pacf(candidate_model2.resid, lags=lags, ax=axes[1])
plt.show()
acorr_ljungbox(candidate_model2.resid, lags=[2*7], return_df=True, model_df=4) # model_df = p + q
lb_stat | lb_pvalue | |
---|---|---|
14 | 15.240807 | 0.123526 |
Notice the slight decrease in AIC (now 9357.419
), while the autocorrelations in the ACF/PACF plots remain small and the residuals still pass the Ljung-Box test. This second model is likely to be a slightly better model than the first candidate model.
We can try to optimise a bit more manually. But rather than doing these manual optimisation steps, we will take a more automatic approach in the section below, namely, do a grid search over a range of parameters.
Model selection with grid search
We could be happy with the model that we have manually chosen – or, we try to find an even better one by grid-searching over a set of parameters. We take our best candidate model’s parameters as a starting point and search in the parameter space around these parameters. Note that this kind of search doesn’t gurantee any better models, let alone the optimal model, but it might allow us to find an acceptable local optimum in the space we had specified using more computation.
ps = range(0, 2)
qs = range(3, 6)
p = 0
d = 1
q = 4
Ps = range(0, 2)
Qs = range(0, 3)
P = 0
D = 1
Q = 1
s = 7
parameters = product(ps, qs, Ps, Qs)
parameters_list = list(parameters)
len(parameters_list)
36
def gs_arima(series, parameters_list, d, D, s, opt_method='powell'):
"""
Return dataframe with parameters and corresponding AIC, and the best model according to AIC
parameters_list - list with (p, q, P, Q) tuples
d - integration order in ARIMA model
D - seasonal integration order
s - length of season
"""
results = []
models = {}
best_aic = float("inf")
for param in parameters_list:
# we need try-except because on some combinations model might fail to converge
try:
model = sm.tsa.statespace.SARIMAX(
series,
order=(param[0], d, param[1]),
seasonal_order=(param[2], D, param[3], s)).fit(method=opt_method, disp=False)
except:
continue
aic = model.aic
if aic < best_aic:
best_model = model
best_aic = aic
best_param = param
results.append([param, model.aic])
result_table = pd.DataFrame(results)
result_table.columns = ['parameters', 'aic']
result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)
return result_table, best_model
result_table, best_model = gs_arima(series, parameters_list, d, D, s)
Top ten models’ AIC values:
result_table.head(10)
parameters | aic | |
---|---|---|
0 | (0, 4, 1, 1) | 9356.886157 |
1 | (0, 4, 0, 1) | 9357.418905 |
2 | (0, 4, 0, 2) | 9357.463126 |
3 | (0, 5, 0, 1) | 9357.876800 |
4 | (0, 5, 1, 1) | 9357.878544 |
5 | (0, 5, 0, 2) | 9358.202317 |
6 | (1, 4, 0, 1) | 9360.101223 |
7 | (1, 5, 0, 1) | 9360.784689 |
8 | (1, 5, 1, 1) | 9360.925114 |
9 | (1, 4, 0, 2) | 9361.035242 |
Best grid-search model $ARIMA(0, 1, 4)(1, 1, 1)_7$
According to the grid search the best model is actually $ARIMA(0, 1, 4)(1, 1, 1)_7$, as measured by AIC scores (now 9356.886
). Let’s take a look at the diagnostic plots of this best model then:
best_model.summary()
Dep. Variable: | total | No. Observations: | 607 |
---|---|---|---|
Model: | SARIMAX(0, 1, 4)x(1, 1, [1], 7) | Log Likelihood | -4671.443 |
Date: | Mon, 14 Sep 2020 | AIC | 9356.886 |
Time: | 11:58:42 | BIC | 9387.653 |
Sample: | 10-02-2012 | HQIC | 9368.864 |
- 05-31-2014 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ma.L1 | -0.3618 | 0.031 | -11.843 | 0.000 | -0.422 | -0.302 |
ma.L2 | -0.2419 | 0.037 | -6.578 | 0.000 | -0.314 | -0.170 |
ma.L3 | -0.0361 | 0.031 | -1.176 | 0.239 | -0.096 | 0.024 |
ma.L4 | -0.2099 | 0.032 | -6.459 | 0.000 | -0.274 | -0.146 |
ar.S.L7 | 0.0747 | 0.040 | 1.855 | 0.064 | -0.004 | 0.154 |
ma.S.L7 | -0.9453 | 0.016 | -58.005 | 0.000 | -0.977 | -0.913 |
sigma2 | 3.392e+05 | 1.33e+04 | 25.589 | 0.000 | 3.13e+05 | 3.65e+05 |
Ljung-Box (L1) (Q): | 0.06 | Jarque-Bera (JB): | 235.39 |
---|---|---|---|
Prob(Q): | 0.80 | Prob(JB): | 0.00 |
Heteroskedasticity (H): | 2.31 | Skew: | -0.16 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 6.05 |
best_model.plot_diagnostics(figsize=(20, 10))
plt.show()
acorr_ljungbox(best_model.resid, lags=[2*7], return_df=True, model_df=4) # model_df = p + q
lb_stat | lb_pvalue | |
---|---|---|
14 | 16.872429 | 0.077235 |
We don’t see red flags in the residuals of the best model, hence, we can go ahead and use it for forecasting purposes.
Forecasting with best model
Having determined the most promising model through gridsearch we will now use it for forecasting, so answering the question: What is the number of cyclists that we expect in the next $k$ days based on the historical cyclist numbers?
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def plot_predictions(df, model, n_steps):
"""
Plots model vs predicted values
series - dataset with timeseries
model - fitted model
n_steps - number of steps to predict in the future
"""
data = df.copy()
data.columns = ['actual']
data['arima_model'] = model.fittedvalues
# forecast values
data['arima_model'][:s+d] = np.NaN
forecast = data['arima_model'].append(model.forecast(n_steps))
# calculate error, again having shifted on s+d steps from the beginning
error = mean_absolute_percentage_error(data['actual'][s+d:-n_steps], data['arima_model'][s+d:-n_steps])
plt.figure(figsize=(20, 10))
plt.title("Mean Absolute Percentage Error: {0:.2f}%".format(error))
plt.plot(forecast, color='r', label="model")
plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey')
plt.plot(data.actual, label="actual", alpha=0.8)
plt.legend()
plt.grid(True)
plot_predictions(pd.DataFrame(series), best_model, 60)
Summary
In this notebook we went through the process of tuning a (S)ARIMA model for a toy dataset. I specifically wanted to see the parameter tuning and diagnostics process in action, as the usual theoretical guides for these topics seem a bit confusing and lacking intuitive explanations. If you are aware of good guides/tutorials that also give a good idea about the intuition, please do let me know! In the end, even if manual parameter tuning is beyond your understanding, good ol’ gridsearch can still help you to find a reasonable model if you at least make sure that you went through your diagnostics checklist.
We didn’t do a train-test split in this post in order to keep the focus on the parameter tuning. There may be a future post on the evaluation procedure in forecasting as well as adding exogenous predictors to the forecasting model. Stay tuned!
References
- There are many resources out there giving guidelines about how to come up with the ARIMA parameters. Many of them are confusing and unclear. I found this one to be one of the more helpful ones: https://people.duke.edu/~rnau/arimrule.htm
This notebook follows the book 'Forecasting: Principles and Practice', Chapter 8. You can view and download this post as a Jupyter notebook on Github Gist.