Penalised regressions vs. autoregressive moving average models for forecasting inflation

This article relates the Seasonal Autoregressive Moving Average Models (SARMA) with linear regression. Based on this relation, the paper shows that penalized linear models may surpass the out-of-sample forecast accuracy of the best SARMA models in forecasting inflation base in past values, due to penalization and cross-validation. The paper constructs a minimal working example using ridge regression to compare both competing approaches when forecasting monthly inflation in 35 selected countries of the Organization for Economic Cooperation and Development and in three groups of countries. The results empirically verify the hypothesis that penalized linear regression, and ridge regression in particular, can outperform the best standard SARMA models compute through a grid search when forecasting inflation. Thus, a new and effective technique for forecasting inflation based on past values is provided for use by financial analysts and investors. The results indicate that more attention should be given to machine learning techniques for forecasting inflation time series, even as basic as penalized linear regressions, due of their superior empirical performance.


Resumen
Este artículo relaciona los Modelos Autorregresivos Estacionales de Media Móvil (SARMA) con la regresión lineal. Sobre la base de esta relación, el documento muestra que los modelos lineales penalizados pueden superar la precisión del pronóstico fuera de la muestra de los mejores modelos SARMA al pronosticar la inflación en función de valores pasados, debido a la penalización y a la validación cruzada. El artículo construye un ejemplo funcional mínimo utilizando la regresión de arista para comparar ambos enfoques que compiten al pronosticar la inflación mensual en 35 países seleccionados de la Organización para la Cooperación y el Desarrollo Económico y en tres grupos de países. Los resultados verifican empíricamente la hipótesis de que la regresión lineal penalizada, y la regresión de arista en particular, puede superar a los mejores modelos estándar SARMA calculados a través de una búsqueda de cuadrícula cuando se pronostica la inflación. Así, se proporciona una técnica nueva y efectiva para pronosticar la inflación basada en valores pasados para el uso de analistas financieros e inversores. Los resultados indican que se debe prestar más atención a las técnicas de aprendizaje automático para el pronóstico de series de tiempo de la inflación, incluso tan básicas como las regresiones lineales penalizadas, debido a su rendimiento empírico superior. Palabras clave: Regresión de arista; modelo lineal penalizado; ARMA, SARMA; pronóstico de la inflación JEL: C53, C52 . Ospina-Holguín y Padilla-Ospina / Económicas CUC, vol. 41 no. 1, pp. 65 -80, Enero -Junio, 2020 Penalised regressions vs. autoregressive moving average models for forecasting inflation Regresiones penalizadas vs. modelos autorregresivos de media móvil para pronosticar la inflación IntroductIon When forecasting inflation, central banks tend not to rely exclusively on time series methods because their primary goal is not only short-term forecasting but also longterm inflation control by means of monetary policy. Consequently, they frequently complement (or even avoid) time series methods, favouring instead structural models that are based on macroeconomic variables or alternatives such as indicator analysis (Quinn, Kenny, & Meyler, 1999). Meanwhile, financial market agents often require updated and frequent short-term forecasts without the need for insights into the underlying causes of inflation. For these agents, time series methods are a convenient and expedite option, given their relative simplicity. There are more than a dozen inflation forecasting methods (Faust & Wright, 2013;gómez, Sánchez & Millán, 2019;gil, Castellanos & gonzález, 2019), and here provide one more, related to the recursive autoregression method and the direct method. An method based on the forecasting improvements that are achieved by standard machine learning techniques applied to penalised linear regressions, which relate to the more standard econometric technique of (seasonal) autoregressive moving average models.
This paper argues that a simple penalised linear model can surpass the forecasting performance of standard econometric models for time series, such as ARMA or SARMA models, when forecasting inflation based on past values. To illustrate this, compare a minimal example of a penalised linear regression with standard (seasonal) autorregresive moving average models (SARMA) for forecasting monthly inflation in 35 selected countries of the Organisation for Economic Co-operation and Development (OECD) and 3 groups of countries. In the minimal example, the out-ofsample predictive performance of a ridge regression with one-fold cross-validation is compared with that of the best SARMA model obtained through a grid search. The results support our argument.
Penalised linear regression is shown to be mathematically related to (S)ARMA modeling, yet empirically more effective for forecasting inflation. The increase in effectiveness seems to come from the penalisation procedure and the empirical regularisation hyperparameter tuning through cross-validation, something which (S)ARMA modeling lacks. In empirical tuning, an out-of-sample experiment inside the original sample is created, where one part of the data is fitted and the level of regularisation that leads to the best performance on the other part of the data is chosen (Mullainathan & Spiess, 2017).
The recent popularisation of machine learning has brought with it the introduction of several new algorithms for predicting time series, from general models such as supervised learning models applied to sliding windows of the data to more specific models such as long short-term memory (LSTM) neural networks. The basic idea behind supervised machine learning in the context of prediction lies in improving its performance by using training data during its construction. As expressed by Mullainathan and Spiess (2017) supervised machine learning seeks to predict well out of sample by taking a loss function L(ŷ, y) as an input, where y is an independent variable and ŷ its prediction, to look for a function ƒˆ that has a low expected loss function E (y, x) [L(ƒˆ(x), y)] on new data points of the same distribution, where x is the vector of predictors.
One of the supervised machine learning techniques that has not received so much attention for the prediction of time series are the penalised linear models. In these models, a penalty is added to the loss function in order to decrease the complexity of the model and achieve greater parsimony. The purpose is to attain greater generality in the model in a such way that it will better capture the signal or the implicit pattern in the data above the noise that the data itself contains. The "regularisation" that is performed by adding the penalty should achieve a better fit out of sample even while the fit within the sample deteriorates due to the very same penalty. development and analysIs: relatIon between the sarma model and penalIsed lInear regressIon SARMA models are among those preferred for modelling and forecasting stationary processes that are also seasonal. SARMA models can be expressed as higher order ARMA models (in turn, an ARMA model can be approximated by a higher order AR model). The AR part attempts to model the variable in question as the result of a linear combination of its own past values. The MA part models the error term as a linear combination of contemporaneous and lagging white noise terms. The ARMA model attempts to express the variable in question as a parsimonious combination of an AR part and an MA part.
Penalised regression models, on the other hand, attempt to model and predict any dependent variable as a linear combination of independent variables. But unlike using the least squares estimation of the standard linear regression, the estimation in penalised regressions adds a term in the minimisation that has the general effect of decreasing the values of the coefficients through a method called shrinkage or regularisation. The amount of ideal shrinkage is controlled in practice by cross-validation, that is, by maintaining part of the sample as a validation sample in which the generality of prediction achieved by the method in the remaining (training) sample is examined and corroborated. Penalised regression models are now part of the basic arsenal of machine learning.
The observation that a SARMA model can be approximated by a higher order AR model, and therefore by a linear regression of lagged terms, raises the question of whether the use of penalised regression can improve the predictive capacity of an original SARMA model. This is question address empirically in inflation data from several countries and groups of countries and leaves open the debate of which method is superior in other samples.

SARMA models
SARMA models, introduced by Box and Jenkin (1976), are a generalisation of ARMA models that consider seasonality in the data generating process. Seasonality is defined as a periodic pattern in the time series. Used here the multiplicative SARMA SARMA(p, q) × (P, Q) s , defined as a model for a stationary time series y t which obeys (1): Where (2), Is the lag operator and the AR, MA, seasonal AR and seasonal MA polynomials are defined as (3):

• A SARMA model approximated as an AR model
It should be noted that the SARMA(p, q) × (P, Q) s can always be expressed as an ARMA(p, q) model of the form (4): Where (5), and By the Wold representation theorem, if y t -μ is covariance-stationary, y t -μ can be expressed as (6): Where η t is a deterministic time series. In the context of the ARMA(p, q) model of the form (7): If y t -μ is invertible, it can also introduced (9): In order to write (10): In this way, (y t -μ) can be expressed as an AR(∞) (12): This expression can in turn be truncated to yield an approximation of (y t -μ) as an AR(m) process (13): This proves that the SARMA(p, q) × (P, Q) s can be approximated as an AR(m) process which is no more than a linear regression of m lagged terms. In this sort of approximation, it would be convenient to have m at least as big as s to approximate at least a SARMA(p, q) × (1, 1) s model.

Penalised linear regression
It has just been shown that (y t -μ) can be expressed as a linear regression of m lagged values in approximation (14): The representation of y t as a linear regression motivates the use of machine learning methods to predict y t , in particular the use of a penalised linear regression using regularisation and cross-validation. In this paper, use Tikhonov regularisation based on the L 2 -norm.

• Tikhonov regularisation
Tikhonov regularisation (Hoerl & Kennard, 1970;Tikhonov & Arsenin, 1977) consists in adding a new penalty term to the standard quadratic loss function in the least squares minimisation problem. So, instead of minimising just the sum of squares of the residuals (15): In order to find the vector π of estimated coefficients of {π k } m k=1 , the following minimisation is performed (16): Where λ is a non-negative hyperparameter, ||•|| is the Euclidean or L 2 -norm defined as ||π || = √π 2 1 + • • • + π 2 m , e is a vector of estimated residuals of the linear regression, y is a vector of y t values in time, and x is the matrix of predictors made of the column vectors (1 y -1 • • • y -m ), where 1 is a vector of ones, and y -k is a vector of lagged values y t-1 .
This kind of regression is also referred to as ridge regression (Hoerl & Kennard, 1970;Anzola, Vargas & Morales, 2019). Its purpose is to shrink the size of the coefficients to avoid their being excessively large. This tends to deteriorate the in-sample performance for the sake of a better out-of-sample performance, improving the fit to the signal in the data, instead of its noise.
Other more general regularisation schemes exist, such as elastic net (Zou & Hastie, 2005), but focus on a minimal working example. In the elastic net, the minimisation involves two simultaneous penalty terms (17): Where |•| is the L 1 -norm defined as |π˜|=|π1| + • • • + |πm|. At the same time, the elastic net regularisation contains as a special case the L 1 -regularised linear regression or LASSO (Santosa & Symes, 1986;Tibshirani, 1996) when ρ = 0 which entails a variable selection method, since it attempts to make one or more of the π k equal to zero.

• Cross-validation in the ridge regression
This present paper restricts ourselves to one-fold cross-validation for the choice of the right λ parameter in the ridge regression. The data to estimate the linear model is divided into two subsets: a training subset of data with time indices 71 Ospina-Holguín y Padilla-Ospina / Económicas CUC, vol. 41 no. 1, pp. 71 -80, Enero -Junio, 2020 τ 1 = {1, ..., t 1 } and a validation subset of data with time indices τ 2 = {t 2 = t1 + 1, ..., t 3 -1}. The least squares minimisation problem is progressively done in the training subset with time indices τ 1 with a sequence of values of λ. The performance in terms of each value in the minimisation problem is evaluated both in τ 1 and in τ 2 .
The potential values of λ to be explored are computed using Bayesian optimisation (Mockus, 1989), a kind of optimisation that attempts to minimise the number of evaluations while preserving the breadth of the exploration of the range of possible parameters. It is expected that at first, the minimum found decreases both in τ 1 and in τ 2 . But soon enough, the minimisation progresses only in τ 1 , but not out of sample (in τ 2 ), where the performance deteriorates. At that step, the search for a better λ value is stopped. In effect, the cross-validation mechanism validates that the minimisation problem in τ 1 is generalisable (to the validation sample τ 2 ). This avoids overfitting the parameters and selection bias, since the performance of the vector π for predicting the estimated ŷ t is evaluated in data that was not used to construct the model, just as in real life.

methodology
The inflation rates of 35 OECD countries and three groups of countries (OECD, 2019) were selected for the out-of-sample performance comparison of the SARMA and ridge regression predictions. These are all the inflation rates of the OECD sample which exhibited no seasonal integration, according to the OCSB test (Osborn, Chui, Smith, & Birchenhall, 2009), allowing us to apply a SARMA model more appropriately since a SARIMA model would be more suitable in the case of seasonal integration. The inflation rates were all stationary according to standard tests such as the ADF test (Dickey & Fuller, 1979;MacKinnon, 1996). Since want to include an annual seasonality, at least s = 12 and include at least 12 lagged predictors in the ridge regression.
In order to evaluate completely out-of-sample the performance of the ridge regression, the model is used to generate one step ahead forecasts. A realistic forecaster does not use the forecasting model once for the sample τ 2 ∪ τ 2 as a whole in order to predict a new value for the next period, but every new period he expands τ 1 with the next value in the time sequence and τ 2 is moved one new value to the right in time. This method of evaluating forecasts was reproduced with an expanding window of one-step forecasts with re-estimation. That is, every new period, the whole model is re-estimated in order to produce the one-step ahead forecast. This sequence of onestep ahead forecasts ŷt can be indexed with a new set τ 3 = { τ 3 , ..., T } , so that the true out-of-sample evaluation uses this set of indices τ 3 . The performance of the SARMA model is evaluated similarly, except that the whole sample τ 2 ∪ τ 2 is used to generate the model expanding it one period to the right in time when forming every forecast, since no cross-validation is performed. The initial τ 1 sample starts in February of 2010 and ends in January 2015, while the final τ 1 sample starts in February of 2010 and ends in October 2016. The τ 2 always has a fixed length of 25 months so the initial τ 2 sample starts in February 2015 and ends in February 2016, and the final τ 2 sample starts in October 2016 and ends on November 2017. Finally, the τ 3 sample starts on March 2016 and ends on December 2017. One has to bear in mind that for the ridge regression the twelve past months are used for every one-step ahead forecast, so, for example, for every cross-validation performed in the τ 2 sample 13 ridge regression forecasts are computed.
The predictive performance of both models in all countries and group of countries is assessed via the overall out-of-sample (18) (gu, Kelly, & Xiu, 2018): Where i denotes the country/group of countries and y i,t+1 = ½∑y i,t is the naive forecast made of the average inflation so far for each period since the beginning of τ 1 .
Also calculated the out-of-sample R 2 for each country or group of countries (19) (Kvalseth, 1985): In order to compare each predictive model, used the Diebold and Mariano (2002) test for differences in out-of-sample predictive accuracy between two models. The sample of inflation rates should have strong cross-sectional correlations due to the global macroeconomic factors influencing the OECD countries, so one of the assumptions of weak dependence of the test is violated. Thus, used the version of the test of gu et al. (2018) to compare two methods of forecasting where the test statistic is DM 12 = d  12 /σ1 2 , where (20): ê⁽¹⁾i , t +1 and ê⁽¹⁾i , t +1 are the prediction error for the inflation of country/group of countries i at time t using each method, the first (1) or the second (2), N is the number of countries/groups of countries, and d  12 and σ1 2 are the mean and Newey-West standard errors of d 12, t over the τ 3 sample. According to (gu et al., 2018), due to the low potential autocorrelation in the d 12,t+1 time series, the asymptotic normality of the statistic is more likely to be guaranteed, which in turn allows for appropriate p-values in the comparisons.
In order to select the SARMA model, a grid search was conducted to select the best model. The grid search looked for the SARMA model with minimal Akaike information criterion with a correction for small samples (AICc) (Burnham & Anderson, 2004) among all SARMA models with p ∈ {0, 1, 2, 3, 4, 5}, q ∈ {0, 1, 2, 3, 4, 5}, P ∈ {0, 1, 2} and Q ∈ {0, 1, 2} and p + q + P + Q ≤ 5. It is worth noting that, asymptotically, optimising the AIC criterion is equivalent to minimising the outof-sample one-step forecast mean square error MSE (Hyndman, 2013). The grid search is supposed to be more comprehensive than popular stepwise procedures such the Hyndman & Khandakar (2008) selection procedure of the R auto.arima function of the forecast package, for example.
Two common error measures in the sample countries were also computed. These measures were the Mean Absolute Error (MAE) (21) and Root Mean Square Error (RMSE) (22), defined as: Where n(τ 3 ) is the cardinality of τ 3 . The mean absolute percentage error (MAPE) was not computable because the monthly inflation rate is often zero in different months for several countries.

results
The overall out-of-sample R 2 was greater for the forecasts based on the ridge regression (19.9%) than for the forecasts based on the best SARMA models (10.6%). Furthermore, in 75.8% of the cases that the out-of-sample R 2 oos for each country or group of countries was positive for the ridge regression forecasts, the R 2 oos was better than in the forecasts by the best SARMA models ( Table 1). The value of DM12 was 2.73. Method 1 was the best SARMA model and method 2 was the ridge regression. This rejects the one-sided null hypothesis that the SARMA model has better out-ofsample predictive accuracy than the ridge regression, with p = 0.00311. Although the behaviour of the Diebold-Mariano test is oversized in small samples, the Appendix performs a Monte Carlo analysis which shows that p < 0.02 can even be guaranteed in an oversized scenario. Table 2 and Table 3 show the mean absolute error and the root mean squared error per country or group of countries for both forecasting methods. The mean absolute error of the ridge regression forecasts was better (lower) in 57.9% of the countries/group of countries examined, while the root mean squared error was better in 63.2% of the countries/group of countries. The results clearly show the superiority of the ridge regression in out-of-sample forecasting terms in the illustrative case chosen.   conclusIons The hypothesis of this paper was that a SARMA model is mathematically related to a linear regression, so that using penalised linear regression for forecasting out-ofsample should surpass the forecast performance of the best SARMA models when forecasting inflation. The mathematical relation between both kinds of models was shown and a minimal working example based on ridge regression was built and applied to selected OECD countries and groups of countries as an empirical illustration of our hypothesis. The illustrative case decisively showed the better forecasting performance of the ridge regression for forecasting inflation, introducing a new forecasting method. Our work can also be seen as a new way of estimating SARMA models with machine learning methods, by first expressing the SARMA model as an AR model, and then thinking of it as a penalised regression which uses an optimization penalty in the least squares minimisation and cross-validation to fine-tune the penalty. Our results indicate that more attention should be given to machine learning techniques for time series forecasting of inflation, even as basic as penalised linear regressions, due to their superior empirical performance. appendIx Here replicate the Monte Carlo analysis that was developed by Diebold and Mariano (2002) for gaussian forecast errors for the sample size of the forecast series T = 22. The original Diebold-Mariano DM test was found to be oversized for small sample sizes. Nevertheless, this simulation shows that in our results the Diebold-Mariano test between the best SARMA model forecast and the ridge regression is significant with p < 0.002, even in a bad oversized scenario, if gaussian errors are assumed. After simplification, the original Diebold-Mariano test statistics DM for one-step ahead forecasts is equal to (23): Where d t = (e ¹t ) ² -(e ²t ) ² is the difference between the squares of the residuals of the two kind of forecasts, 1 and 2, the bar represents the sample mean of the loss differential, and Ɣ d (0) is the sample autocovariance of the loss differential at displacement 0 (i.e. its sample variance).
For the Monte Carlo analysis, drawing realisations of {e ¹t , e ²t } T t=1 following a bivariate forecast-error process with different degrees of contemporaneous and serial correlation in the forecast errors. First, generate {v ¹t , v ²t } T t=1 , with v t ~ N(0, R), where (24): Is the desired contemporaneous correlation matrix. Second, introduce MA(1) serial correlation with parameter θ as (25): Where the scalar normalises the unconditional variance to 1. Using v 1 0 = v 2 0 = 0. it can be shown that the correlation between v 1 and v 2 is the same as the correlation between e 1 and e 2 . given this procedure, calculates the empirical size of the DMtest for all combinations of ρ = 0, 0.5 and 0.9, and also θ = 0, 0.5 and 0.9 at the same level of the p value obtained in the Diebold-Mariano test between the best SARMA model forecast and the ridge regression (i.e. at the 0.311% level). This shows us the expected p values for the chosen sample size of T = 22 and the chosen level under different degrees of contemporaneous and serial correlation. Table 4 illustrates that a p value of at least under 0.02 is guaranteed in a bad oversized scenario for the the Diebold-Mariano test between the best SARMA model forecast and the ridge regression. Note: ρ is the contemporaneous correlation between the innovations underlying the forecast errors, and θ is the parameter of the MA(1) orecast error. All of the tests are at the same level as the original Diebold-Mariano test between the best SARMA model forecast and the ridge regression (i.e. at the 0.311% level). A total of 50,000 Monte Carlo replications are performed. references