## Introduction

The advantage of neural networks over other methods is due to their non-linearity. The non-linearity is caused by the linear combinations of the activation functions used. The activation functions that we will be using here is Sigmoid and ReLU.

Let be the output of a neuron after a linear combination of its input neurons and weights and be a mapping function where the function is the activation function mentioned above. can be

##### Sigmoid

##### ReLU

We will be working with neural networks of 1 and 2 hidden layers.

## Implementation

The following are my implementation of the networks mentioned above. We’ll be using Stochastic Gradient Descent (SGD) to update our weight. Theorectically, input data has to be shuffled for SGD to work, but in my case mine doesn’t work if it’s shuffled. (need some clarification and help here) Since my input data that’s fed into the network is nicely ordered (data of same class are grouped together), it’s easy for the error to fall in a local minima.

As before, we’ll still be using the same dataset (Database of Faces) provided by AT&T Laboratories. We’ll also reduce the dimension of the image to 2 using PCA for better illustration on how the boundaries are drawn. As for PCA, I usually whitten my data (divide by ) before feeding it into the network. (I once forget the square root and my network can’t be trained successfully, I still have no idea why such a minor difference would cause the failure in the training of network) After various attempts, what suprises me is that not performing whittening produces a better result (sometimes). As I observe the output of the activation function of the first layer, not whitteningÂ does indeed produce abrupt changes after softmax for the 1st layer. I also realize that not performing whittening would cause the values of ReLU of 2 layers to overflow.

### Stochastic Gradient Descent (SGD)

SGD works in a way where data is fed into the network one-by-one and the weights are updated for every data sent into the network. Compared to other methods (mini-batch, batch) SGD will be the fastest to converge if tuned correctly. The downside is that it might not converge at all if changes are too abrupt and jumps off a local minima whenever a single data fed is highly influencial.

#### 1-Hidden Layer Network

For the implementation of a network with single hidden layer, the number of neurons that I tried and worked is in the range of 4~50 it seems that the likelihood of overfitting is really low here. Sigmoid activation function is used here.

The training error and validation error is shown in the graph below:

It can be observed that it converges to a local minima and stays there whenever the local minima is found.

The decision boundary for this method is shown below:

Observed the non-linearity of the descision boundary which is different from what we observed before for linear-classification methods.

The accruracy obtained here is 86.67%.

#### 2-Hidden Layer Network

The implementation of 2-Hidden Layer Network using ReLU as an activation funciton is a little tricky. The input value has to be whitten-ed before being fed into the network. This is not important when sigmoid activation is used because large value will be close to 1 after passing through a sigmoid activation function. For ReLU, the value is not clipped and grows linearly. So the only way to deal with this problem is to whitten the data (divide by the square root of the eigenvalue after PCA).

The error function for this method is shown below:

As we can observe, The validation error did fall to a very low value and bounced back up. This is due to the nature of SGD where it’s easy to bounce away from a local minima. To deal with this problem, I’ll always store the weight that produces the smallest validation error.

The decision boundary of this method is shown below:

As expected, we can observe that a 2-layer network tend to produce a much non-linear boundary compared to 1-layer network.

The accuracy obtained for this method is 80% which is a little below 1-layer. From here, we can see that the number of neurons is important.

## Mini-Batch Gradient Descent

This method fall between SGD and Batch Gradient Descent. Let’s say an epoch consist of all the data used for training, mini-batch cuts the data into groups to be fed into the network. The weights are updated on a per-group-fed basis. All steps used here are the same in SGD but on differs on the gradient descent step where the average of the gradient of error is taken for a single batch. This method is comparatively more stable than the previous method (SGD) because it takes many data points into account before making a single update.

#### 1-Hidden Layer Network

The implementation here is the same as before and the only difference is mentioned above, so I’ve dive straight into showing the plots.

The error function of this method is shown below:

It can be observed that both error falls steadily. The error is still falling and would produce a better result if I continue but as long is the idea is clear, I don’t see any point in doing that.

The decision boundary is shown below:

It can be observed that the decision boundary here is similar to the 1-layer implementation of SGD. This is because nothing much can be learnt from a 2D data that is clustered in a clear manner.

The accuracy obtained here is 84%.

#### 2-Layer Neural Network

The error function is shown below:

It can be observed that it converges nicely.

The decision boundary is shown below:

It can be observed that lots of blue points are lost using this decision boundary which is an indicator where a bad local minima is found. The accuracy obtained here is 74.33% which supports our deductions above.

## Reference

Different types of copying in Python