Neural Network Models

Neural Network Structure

Neural network models are considered “black-box” models because they are complex and hard to decipher relationships between predictor variables and the target variable. However, if the focus is on prediction, these models have the potential to model very complicated patterns in data sets with either continuous or categorical targets.

The concept of neural networks was well received back in the 1980’s. However, it didn’t live up to expectations. Support vector machines (SVM’s) overtook neural networks in the early 2000’s as the popular “black-box” model. Recently there has been a revitalized growth of neural network models in image and visual recognition problems. There is now a lot of research in the area of neural networks and “deep learning” problems - recurrent, convolutional, feedforward, etc.

Neural networks were originally proposed as a structure to mimic the human brain. We have since found out that the human brain is much more complex. However, the terminology is still the same. Neural networks are organized in a network of neurons (or nodes) through layers. The input variables are considered the neurons on the bottom layer. The output variable is considered the neuron on the top layer. The layers in between, called hidden layers, transform the input variables through non-linear methods to try and best model the output variable.

Single Hidden Layer Neural Network

All of the nonlinearities and complication of the variables get added to the model in the hidden layer. Each line in the above figure is a weight that connects one layer to the next and needs to be optimized. For example, the first variable \(x_1\) is connected to all of the neurons (nodes) in the hidden layer with a separate weight.

Let’s look in more detail about what is happening inside the first neuron of the hidden layer.

Each of the variables is weighted coming into the neuron. These weights are combined with each of the variables in a linear combination with each other. With the linear combination we then perform a non-linear transformation.

There are many different nonlinear functions this could be. The main goal is to add complexity to the model. Each of the hidden nodes apply different weights to each of the input variables. This would mean that certain nonlinear relationships are highlighted for certain variables more than others. This is why we can have lots of neurons in the hidden layer so that many nonlinear relationships can be built.

From there, each of the hidden layer neurons passes this nonlinear transformation to the next layer. If that next layer is another hidden layer, then the nonlinear transformations from each neuron in the first hidden layer are combined linearly in a weighted combination and another nonlinear transformation is applied to them. If the output layer is the next layer, then these nonlinear transformations are combined linearly in a weighted combination for a last time.

Now we have the final prediction from our model. All of the weights that we have collected along the way are optimized to minimize sum of squared error. How this optimization is done is through a process called backpropagation.

Backpropagation

Backpropagation is the process that is used to optimize the coefficients (weights) in the neural network. There are two main phases to backpropagation - a forward and backward phase.

In the forward phase we have the following steps:

  1. Start with some initial weights (often random)
  2. Calculations pass through the network
  3. Predicted value computed.

In the backward phase we have the following steps:

  1. Predicted value compared with actual value
  2. Work backward through the network to adjust weights to make the prediction better
  3. Repeat forward and backward until process converges

Let’s look at a basic example with 3 neurons in the input layer, 2 neurons in the hidden layer, and one neuron in the output layer.

Imagine 3 variables that take the values of 3, 4, and 5 with the corresponding weights being assigned to each line in the graph above. For the top neuron in the hidden layer you have \(3\times 1 + 4 \times 0 + 5 \times 1 = 8\). The same process can be taken to get the bottom neuron. The hidden layer node values are then combined together (with no nonlinear transformation here) together to get the output layer.

For the backward phase of backpropagation, let’s imagine the true value of the target variable for this observation was 34. That means we have an error of 6 (\(34-28=6\)). We will now work our way back through the network changing the weights to make this prediction better. To see this process, let’s use an even simpler example.

Imagine you have a very simple equation, \(y = \beta_1 x\). Now let’s imagine you know that \(y = 20\) and \(x = 5\). However, you forgot how to do division! So you need backpropagation to find the value of \(\beta_1\). To start with the forward phase, let’s just randomly guess 3 for \(\beta_1\) - our random starting point. Going through the network, we will use this guess of \(\beta_1\) to get our final prediction of 15 (\(=3 \times 5\)). Since we know the true value of y is 20, we start with an error of \(20-15 = 5\). Now we look at the backward phase of backpropagation. First we need the derivative of our loss function (sum of squared error). Without going through all the calculus details here, the derivative of the squared error at a single point is 2 multiplied by the error itself.

The next step in baackpropagation is to adjust our original value of \(\beta_1\) to account for this error and get a better estimate of \(\beta_1\). To do this we multiply the slope of the error curve by the learning rate (set at some small constant like 0.05 to start) and then adjust the value of \(\beta_1\).

Based on the figure above, our \(\beta_1\) was 3, but has been adjusted to 3.5 based on the learning rate and slope of the loss function. Now we repeat the process and go forward through the network. This makes our prediction 17.5 instead of 15. This reduces our error from 5 to 2.5. The process goes backwards again to adjust the value of \(\beta_1\) again.

We will still multiply the slope of the loss function (with our new error) by the learning rate. This learning rate should be adjusted to be smaller (from 0.05 to 0.04 above) to account for us being closer to the real answer. We will not detail how the backpropagation algorithm adjusts this learning rate here. However, this process will continue until some notion of convergence. In this example, it would continue until the slope estimate is 4 and the error would be 0 (its minimum).

Although easy in idea, in practice this is much more complicated. To start, we have many more than just one single observation. So we have to calculate the error of all of the observations and get a notion of our loss function (sum of squared error in our example) across all the observations. The changing of the slope then impacts all of the observations, not just a single one. Next, we have more than one variable and weight to optimize at each step, making the process all the more complicated. Finally, this gets even more complicated as we add in many neurons in hidden layers so the algorithm has many layers to step backward through in its attempt to optimize the solution. These hidden layers also have complicated nonlinear functions, which make derivative calculations much more complicated than the simple example we had above. Luckily, this is what the computer helps do for us!

Tuning a Neural Network

When it comes to tuning machine learning models like a neural network there are many options for tuning. Cross-validation should be included in each of these approaches. Two of the most common tuning approaches are:

  • Grid search

  • Bayesian optimization

Grid search algorithms look at a fixed grid of values for each parameter you can tune - called hyperparameters. This means that every possible combination of values no matter is they are good or bad as it doesn’t learn from previous results. This is good for small samples with limited number of variables as it can be time consuming.

Bayesian optimization on the other hand is typically less computationally expensive as a grid search approach. The Bayesian optimization approach starts with random values for each hyperparameter you need to tune. It then fits a probabilistic model (a type of sequential model-based optimization) that “learns” the relationship between the parameters and performance. This process tries to estimate which hyperparameters are likely to produce good results and point it in the “correct” direction for values of hyperparameters. The next step in the optimization essentially “learns” from the previous combinations of hyperparameters. This approach is much more valuable for large samples with large numbers of variables.

Let’s see how to build a neural network in each software!

Variable Selection

Neural networks typically do not care about variable selection. All variables are used by default in a complicated and mixed way. However, if you want to do variable selection, you can examine the weights for each variable. If all of the weights for a single variable are low, then you might consider deleting the variable, but again, it is typically not required.

One way to visualize all the weights in a variable would be to use a Hinton diagram. This diagram is really only good for smaller numbers of variables. With hundreds of variables, a Hinton diagram becomes burdensome to view.

From the diagram above we see there are few instances of variables having low weights across all of the inputs to the hidden layers. The only ones we see are specific categories in a larger categorical variable. In this scenario, we would probably keep all of our variables.

Predictions

Models are nothing but potentially complicated formulas or rules. Once we determine the optimal model, we can score any new observations we want with the equation.

Scoring data does not mean that we are re-running the model/algorithm on this new data. It just means that we are asking the model for predictions on this new data - plugging the new data into the equation and recording the output. This means that our data must be in the same format as the data put into the model. Therefore, if you created any new variables, made any variable transformations, or imputed missing values, the same process must be taken with the new data you are about to score.

For this problem we will score our test dataset that we have previously set aside as we were building our model. The test dataset is for comparing final models and reporting final model metrics. Make sure that you do NOT go back and rebuild your model after you score the test dataset. This would no longer make the test dataset an honest assessment of how good your model is actually doing. That also means that we should NOT just build hundreds of iterations of our same model to compare to the test dataset. That would essentially be doing the same thing as going back and rebuilding the model as you would be letting the test dataset decide on your final model.

Let’s score our data in each software!

Summary

In summary, neural network models are good models to use for prediction, but explanation becomes more difficult and complex. Some of the advantages of using neural network models:

  • Used for both categorical and numeric target variables

  • Capable of modeling complex, nonlinear patterns

  • No assumptions about the data distributions

There are some disadvantages though:

  • No insights for variable importance

  • Extremely computationally intensive (very slow to train)

  • Tuning of parameters is burdensome

  • Prone to overfitting training data