Source: www.fin.gc.ca/wp/2000-07e.html

An application to real GDP forecasting

Neural Networks for Macroeconomic Forecasting:
A Complementary Approach to Linear Regression Models
By Steven Gonzalez
Working Paper 2000-07

The linear regression model.
The neural network equivalent.
Producing forecasts.
Relative forecasting performance.

An empirical example is perhaps the best way to illustrate the differences between a neural network and a linear regression model. A neural network has therefore been constructed to forecast quarterly growth of Canada's real GDP. This model is compared to a linear regression model developed in the Department of Finance by Lamy (1999). To facilitate the comparison, the neural network uses exactly the same explanatory variables and the same sample period as the linear regression model. Any differences in the results can therefore be attributed solely to the estimation procedure.

The linear regression model.

Lamy (1999) has developed an accurate model for one-quarter ahead forecasts of quarterly growth of Canada's real GDP. The model has performed very well, both in-sample and out-of-sample. Over the period from 1978Q1 to 1998Q2, his model explains 82 per cent of the variance of real GDP growth. The estimated coefficients are also very stable when the model is estimated over different sample periods. In addition, the model is quite parsimonious, as it contains only the following six explanatory variables (with their abbreviation in parenthesis):

The quarterly growth rate of Finance Canada's index of leading indicators of economic activity (one-quarter lag) ( L_t-1 )
Employment growth (contemporaneous) ( E_t )
Employment growth (one-quarter lag) ( E_t-1 )
The Conference Board's index of consumer confidence (contemporaneous) ( C_t )
The first difference of the real long term interest rate (nine-quarter lag) ( R_t-9 )
The first difference of the federal government budgetary balance as a share of GDP(three-quarter lag) ( F_t-3 )

Four dummy variables were added to control for four quarters considered as outliers (The four quarters in question are 1980Q3, 1981Q1, 1986Q4 and 1991Q3). For the purposes of the present illustration and in order to leave some data for out-of-sample forecasts, the linear regression model was estimated using data from 1978Q1 to 1993Q2 (62 observations). The estimation results are given in equation (1):

GDP_t = -1.695 + 0.075•L_t-1 + 0.304•E_t + 0.251•E_t-1 + 0.019•C_t - 0.175•R_t-9
- 0.320•F_t-3 - 1.155•D₁ + 1.168•D₂ - 0.906•D₃ - 0.843•D₄ + e_t (1) ,

where D₁, D₂, D₃ and D₄ are the dummy variables and e_t is the residual term. All coefficients are significantly different from zero at a confidence level of 95 per cent.

To start

The neural network equivalent.

Determining the number of inputs and outputs of the network is a straightforward process. The linear regression model above has six regressors and four dummy variables. Our neural network will therefore contain ten inputs. Since we have only one dependent variable, the network will have a single output. As explained in Section 2.2, the number of hidden units to include in the network is largely the result of trial and error. For the present illustration, an architecture with only two hidden units was selected. Although this architecture is probably too simplistic to capture the complexity of the problem at hand, its simplicity will facilitate the comprehension of the dynamics of the model. The purpose of this section is thus to present a concrete example of a network, not to develop a high-performance forecasting tool. Hence, the network presented here should not be considered as the best possible model that neural networks could offer.

Figure 9 illustrates the architecture of the network. This is a fully connected network, i.e. all inputs are connected to all hidden units. Bias terms have been included for both the hidden units ( BiasH ) and the output unit ( BiasO ). In addition, direct connections have been added between the inputs and outputs (dashed arrows), thus yielding an augmented neural network (cf. Figure 6). As explained above, the augmented neural network nests the linear regression model. The hidden units have a hyperbolic tangent activation function while the output unit has an identity activation function.

The early stopping procedure was used to estimate the 35 weights of the network (The neural network literature suggests that rescaling the data is often beneficial to improve forecasting accuracy because the estimation algorithms tend to perform better when the input values are small and centred around zero. In the network presented here, only the Conference Board index of consumer confidence was modified by dividing all values by 100. Therefore, the base value of the index is 1 instead of 100.) In order to implement this procedure, the sample was divided into three separate parts: a training set (1978Q1 to 1993Q2), a validation set (1993Q3 to 1995Q4) and a test set (1996Q1 to 1998Q2). The training set, which correspond to the in-sample period of the linear regression model in equation (1), is the only portion of the data that the training algorithm used to estimate the network weights.

Figure 9

Figure 10 illustrates the evolution of the mean absolute error (MAE) (The software used to estimate the network weights (MATLAB with the Netlab toolbox) was programmed to provide the mean absolute forecasting error, rather than the mean squared error as discussed in Section 3. This does not have a significant effect on the results. The Netlab toolbox can be downloaded free of charge from http://www.ncrg.aston.ac.uk/netlab/index.html.) in the training and validation sets throughout the iteration process. The MAE in the validation set reaches a minimum after 819 iterations, while the MAE in the training set continues to decline continuously. To reduce the risk of overfitting the network, the procedure was therefore stopped after 819 iterations, with a MAE of 0.118 in the validation set.

Figure 10

In Figure 11, the estimated weights are presented for various sections of the network. The connections from the inputs and Bias_H to the hidden unit H₁ are presented in Panel A. Panel B presents a similar diagram for the connections between the inputs and H₂. Panel C displays the estimated weights between the hidden units and the output unit and Panel D illustrated the direct connections from the inputs to the output.

Figure 11

To start

Producing forecasts.

Suppose we want to produce a forecast of real GDP growth for 1998Q2. The values of the regressors/inputs for 1998Q2 are given below:

L_t-1 = 0.66, E_t = 0.67, E_t-1 = 0.75, C_t = 1.1497, R_t-9 = 1.13, F_t-3 = 0.84,
D₁ = D₂ = D₃ = D₄ = 0.

In the case of the linear regression model, the forecast for 1998Q2 is straightforward:

GDP_t = -1.695 + 0.075•(0.66) + 0.304•(0.67) + 0.251•(0.75) + 0.019•(1.1497) - 0.175•(1.13)
- 0.320•(0.84) - 1.155•(0) + 1.168•(0) - 0.906•(0) - 0.843•(0) = 0.46 (2)

The linear regression model forecasts real GDP growth of 0.46 per cent in 1998Q2. Actual real GDP growth in 1998Q2 was 0.44 per cent.

In the neural network, the first step consists in calculating the value of the hidden units. To obtain a value for H₁, the network must first multiply the value of each input by the corresponding weight, as depicted in Panel A of Figure 11. This value is denoted Z₁.

Z₁ = -0.401 - 0.058•(0.66) + 0.292•(0.67) - 0.207•(0.75) + 0.550•(1.1497) + 0.754•(1.13)
+ 0.294•(0.84) - 1.038•(0) + 0.487•(0) - 0.290•(0) - 0.209•(0) = 1.332 (3)

The value of H₁ is obtained by inserting Z₁ into the hyperbolic tangent activation function:

H₁ = TANH(1.332) = ( e^Z1 - e^-Z1) / ( e^Z1 + e^-Z1)
= ( e^1.332 - e^-1.332) / ( e^1.332 + e^-1.332) = 0.8697 (4)

In a similar fashion, using the weights in Panel B of Figure 11, H2 can be shown to equal:

H₂ = TANH(-0.6999) = ( e^-0.6999 - e^-(-0.6999)) / ( e^-0.6999 + e^-(-0.6999)) = -0.643 (5)

Using the analogy of the human brain and remembering that the values produced by the hyperbolic tangent function are bounded between –1 and 1, it could be said that the neuron H₁ is being strongly stimulated by the specific stimuli provided by the input values, while H₂ is moderately stimulated.

The output unit GDP_t has an identity activation function, meaning that the hyperbolic tangent function will not be used to process the linear combination of the "stimuli" reaching the output unit. The network's forecast will simply be equal to the linear combination of the hidden units (panel C of Figure 11) and the inputs (panel D):

GDP_t = -0.081 + 1.45•H₁ + 0.604•H₂ + 0.084•L_t-1 + 0.335•E_t + 0.751•E_t-1 + 0.339•C_t
- 1.067•R_t-9 - 0.649•F_t-3 + 0.048•D₁ + 0.21•D₂ - 0.256•D₃ - 0.55•D₄ (6)

Evaluating this equation with the values for 1998Q2 yields:

GDP_t = -0.081 + 1.45•(0.8697) + 0.604•(-0.6043) + 0.084•(0.66) + 0.335•(0.67) + 0.751•(0.75) + 0.339•(1.1497)
- 1.067•(1.13) - 0.649•(0.84) + 0.048•(0) + 0.21•(0) - 0.256•(0) - 0.55•(0) = 0.30 (7)

The network therefore forecasts real GDP growth of 0.30 per cent for 1998Q2, which is considerably less accurate than the linear model's forecast. Obviously, the calculations required to make a forecast using a neural network are significantly more complex than in the case of a linear regression model. Fortunately, these calculations can be executed instantaneously by the same software that estimated the network weights.

Equation (6) clearly illustrates how the augmented neural network encompasses the linear regression model. If the second and third terms of the right-hand side of this equation are removed, equation (6) becomes a simple linear combination of the inputs. These two terms, which are nonlinear transformations of the inputs, help the network to capture nonlinear relationships among the variables. If the network had not detected any nonlinearities in the data generating process, the estimated weights for H₁ and H₂ would have been zero and the network would have become a standard linear regression model. In the current example, the non-zero values of the weights for H₁ and H₂ suggest the presence of some nonlinearities.

To start

Relative forecasting performance.

Is all this computational effort worthwhile? A comparison of the forecasting accuracy of both methods provides some insight into this question. Table 1 compares the forecasting performance of this neural network and the linear regression model for one-quarter ahead forecasts. Three common criteria were used to compare the two models: the mean absolute error, the mean squared error and the Theil inequality coefficient. The table suggests that the network was more accurate than the linear regression model, both in-sample and out-of-sample. The network reduced the forecasting errors by between 13 and 25 per cent for the in-sample period and by 20 to 40 per cent for out-of-sample forecasts.

Table 1

1. The Theil inequality coefficient is bounded between 0 and 1. The value of the coefficient approaches zero as forecasting accuracy increases.

From a theoretical point of view, this result is not too surprising. Given that an augmented neural network encompasses the linear regression model, the network should not, in theory, perform worse than a linear model. Two factors could nonetheless have led the network to perform worse than the linear model. First, the optimizing algorithm could have remained trapped in a local minimum that was far from the global minimum. To avoid this problem, the network weights were re-estimated using 800 sets of random starting values. The network with the greatest accuracy in the validation set was retained. Second, the network could have overfit the data despite the use of the early stopping procedure. When the number of observations in the training and validation sets is small, overfitting might still occur when using the early stopping procedure. This does not seem to have occurred in the current example, given the relatively good out-of-sample forecasting performance of the network.

Four statistical tests were used to assess whether the improvement in forecasting accuracy was statistically significant: Wilcoxon's signed-rank test, a non-parametric sign test, the Diebold and Mariano (1995) test and the Ashley, Granger and Schmalensee (1980) test. In each of these tests, the null hypothesis postulates that the forecasting accuracy of both models is the same. All tests were two-tailed and were applied to the mean squared error and the mean absolute error. The results, which are summarized in Table 2, suggest that there is only limited evidence that the improvement in forecasting accuracy was statistically significant. At a 90-per-cent level of confidence, the null hypothesis could only be rejected in four or five of the fourteen possibilities presented in the table. Although the point estimates in Table 1 suggest that the neural network outperformed the linear model, the standard deviations of the test statistics were too large to conclude that the improvement was statistically significant.

Table 2

1. The significance level (p-value) provides the probability of observing a given difference in forecasting accuracy between the two models, if the null hypothesis is true (i.e. both models have the same accuracy).
2. The Diebold and Mariano test might not be accurate for the in-sample mean squared error because this test requires that the difference in the MSE between both models have a normal distribution. A Jarque-Bera test on the in-sample MSE differential led us to strongly reject a normal distribution.
3. The Ashley, Granger and Schmalensee test only applies to the mean squared error.

Additionally, forecast encompassing tests (cf. Chong and Hendry, 1986) were conducted for both the in-sample and out-of-sample periods. The results did not allow us to reject the null that neither model encompassed the other. Hence, based on all our tests, we cannot conclude that the improvement in forecasting accuracy is statistically significant.

The neural network developed in this section should be viewed as a complement to the linear regression model, rather than a substitute, because it used the explanatory variables of the linear model as a starting point. Although this may seem like a trivial point, it is in fact very important. The computational effort required to design a neural network makes it virtually impossible to build the model from scratch, without the help of a linear regression model. As explained in Section 3.2 (cf. Figure 8), the design of a neural network is a lengthy process of trial and error. For a given set of explanatory variables and a given network architecture, a neural network must be re-estimated hundreds or thousands of times with different sets of starting values in order to avoid a local minimum (In the current example, a Pentium 350 MHz took about 25 hours to re-estimate the network with 800 sets of starting values). This entire process of re-estimation must be repeated for each different network architecture under consideration before a conclusion can be made as to the out-of-sample forecasting accuracy associated with a given set of explanatory variables. Thus, each time a change is made to the set of explanatory variables, the network must be re-estimated under several different architectures, each one requiring several hundreds or thousands of starting values. This process would be much too long if it were followed to the letter.

To start

Source: www.fin.gc.ca/wp/2000-07e.html