# Linear Models for Classification

## Introduction

There’re 3 major methods on working with classification:

1. Discriminant Function
2. Probabilistic Generative Model
3. Probabilistic Discriminative Model

The first method is brute-force method which is what neural networks uses. It consist of least square classification, Fisher’s linear discriminant and perceptron algorithm.

Our focus will be on the following 2 models which involves a probabilistic viewpoint.

## Probabilistic Generative Model

The main goal of this model is to model the density function entirely from the training data points given. By entirely, we mean $p(C_k , x)$ where $k$ is the class number and it’s the joint PDF of class $k$ and data point $x$. For the binary case and from bayes theorem we know that $p(C_1 | x) = \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)}$. It’ll be $\frac{1}{1 + e^{-a}}$ which is known as sigmoid function for the binary case where $a = ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}$. For multi-class problems, it’ll be $p(C_k | x) = \frac{p(x|C_k)p(C_k)}{\sum_j^{p}p(x|C_j)p(C_j)}$ which is also known as softmax function.

As always, we’ll try to solve for optimal parameters given by the posterior probability  $p(C_1 | x) = \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)}$. We’ll start with  $p(x | C_k) = \frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma^{1/2}|}e^{-\frac{1}{2}(x - \mu_k)^T\Sigma^{-1}(x-\mu_k)}$

For the multi-class case, we are trying to solve for:

$p(C_k|x) = \sigma(w^T x + w_0)$

Where by solving for the maximum likelihood for a Gaussian Distribution, we obtain:

$w_k = \Sigma^{-1}\mu_k$

$w_{k_0} = -\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + ln\ p(C_k)$

$\mu_k = \frac{1}{N_k}\sum_{n=1}^{N}t_n x_n$

$P(C_k) = \pi_k$

For the maximum likelihood solution for a Gaussian Distribution:

$\Sigma = S$

$S = \frac{N_1}{N}S_1 + \frac{N_2}{N}S_2 + ... + \frac{N_k}{N}S_k$

where

$S_k = \frac{1}{N_k}\sum_{n\in C_k}(x_k - \mu_k)(x_k - \mu_k)^T$

## Probabilistic Discriminative Model

For this model, we are trying to reach the posterior probability $p(C_1 | x)$ without going through all the steps mentioned above (obtaining the maximum likelihood and prior) to get there. Instead, we are trying to model the posterior probability directly by finding the optimal $w$. Though we are simplifying the procedure in coming up with a solution to this problem, discriminative models ends up with a better solution than generative models most of the time.  The main reason is that there are fewer predictive parameters needed and due to the limitation of computational resources, we are able to come up with a much non-linear model compare to the generative model.

Our goal in this model is to optimize for $w$ in posterior PDF below: (Logistic Regression)

$p(C_1 | \phi) = y(\phi) = \sigma(w^T\phi)$

$\phi$ is a basis function of $x$ (to increase the non-linearity)

For a binary classification example, our goal is to minimize the cross-entropy error function given below:

$E(w) = -ln\ p(t|w) = - \sum_{n=1}^{N}{t_n ln\ y_n + (1-t_n)ln(1-y_n)}$

The $w$ that we are trying to optimize is responsible for the “placement” of the sigmoid function given on the diagram above. It’s also called the decision boundary.

Due to the non-linearity of the logistic sigmoid funciton, we are unable to solve $w$ using a closed form solution. Thus, there’re 2 approaches in dealing with this problem.

#### Sequential Learning:

$w^{\tau+1} = w^\tau - \eta\nabla E_n$

So the term $\eta$ here is basically the learning rate of the discriminative model. The only parameter that is user-defined is $\eta$ here.

#### Newton-Raphson method:

Sequential learning method is actually a simplified version of Newton-Raphson method. The equation used for Newton-Raphson method is shown below:

$w^{new} = w^{old} - H^{-1}\nabla E(w)$

$H = \nabla\nabla E(w)$    Hessian Matrix

We can see that $\eta$ has been changed to $H^{-1}$. Current implementation for the training of a model is done using the former method because the computation of the Hessian Matrix is relatively expensive and slows down the time needed for the training of the model. Sequential learning is also called Gradient Descent and it is the most commonly used method in the training of neural networks. The parameter $\eta$ and $H^{-1}$ decides the learning rate of the algorithm thus a large value corresponds to the speed on finding the global minima of the model. There’s a catch here, where a learning rate that’s too large might cause the learning algorithm to oscillate and we might never reach the global minima of our model. We’ll be using the Newton-Raphson method for the rest of the implementation, thus a detailed elaboration on this method will be given below.

Gradient of cross-entropy error function

$\nabla E(w) = \sum_{n=1}^{N}(y_n - t_n)\phi_n = \Phi^T(y-t)$

Hessian Matrix

$H = \nabla\nabla E(w) = \sum_{n=1}^{N}\phi_n\phi_n^T = \Phi^TR\Phi$

R: diagonal matrix with $R_{nn} = y_n(1-y_n)$

Since $0 < y_n < 1$, it follows that $u^THu > 0$ for an arbitrary vector $u$. The error function is a convex function of $w$ and there is a unique minimum.

We can substitute all of the equations above into $w^{new} = w^{old} - H^{-1}\nabla E(w)$ and obtain (after some algebraic manipulation):

$w^{new} = (\Phi^T R \Phi)^{-1}\Phi^T Rz$

where

$z = \Phi w^{old} - R^{-1}(y-t)$

For the multi-class case:

$p(C_k|\phi) = y_k(\phi) = \frac{e^{a_k}}{\sum_j e^{a_j}}$

$\frac{\delta y_k}{\delta a_j} = y_k(I_{kj} - y_j)$

where

$a = w^T \phi$

From that, we’re able to get the cross-entropy error function:

$E(w_1,...,w_k) = - ln\ p(T|w_1,...,w_k) = -\sum_{n=1}^{N}\sum_{k=1}^{K}t_{nk}ln\ y_{nk}$

$\nabla_{w_j}E(w_i,...,w_k) = \sum_{n=1}^{N}(y_{nj} - t_{nj})\phi_n$

$\nabla_{w_k}\nabla_{w_j}E(w_1,...,2_k) = -\sum_{n=1}^{N}y_{nk}(I_{kj} - y_{nj})\phi_n\phi_n^T$

There’re a few things that we’ll have to pay attention to. One of them is the $R$ that is in the Hessian Matrix because now we have multiple class. We would end up with multiple $R$ where we have now:

$R_{km}\ where\ 1\leq k,m \leq K -1$

$when\ k=m$,

$R_{kk}=p(C_k|\phi)(1 - p(C_k|\phi))$

$when\ k \neq m$

$R_{km}=-p(C_k|\phi) p(C_m|\phi)$

## Implementation

### Introduction

We will be working with images from Database of faces (AT&T Laboratories Cambridge) which are of size 30×30. We’ll apply both Generative and Discrimininative models to model our data.

### Reading image using Python

We will be using Python 2.7 for the development of this algorithm. To work with the data given, we’ll have to first read an image file (.bmp) using Python. As from here we will need the library “pillow” which enables us to read image files. The next step would be converting read image files into a single numpy array.

The following code will convert the given image into a numpy array:

img = Image.open("Data_Train/Class1/faceTrain1_755.bmp")
data = np.asarray( img, dtype="int32" )

You can also read all image from a directory iteratively using the following method:

imageFolderPath = 'Data_Train/Class1'
imagePath = glob.glob(imageFolderPath+'/*.bmp')
train_size = 100
im_array_1 = np.array( [np.array(Image.open(imagePath[i])) for i in range(train_size)] )

The library glob is used here for finding pathname matching a specified pattern according to rules used by Unix shell.

### Dimension Reduciton using PCA

To ease on the computation power needed to process the images, we will have do reduce the dimension of the given dataset. Thus Principal Component Analysis (PCA) is used for dimensionality reduction.

The implementation for PCA can be refered here

PCA must only be done for training data to ensure that we have a general model. The mean of the training data, the eigenvector used for projection and the corresponding eigenvalue will be reused later on during testing, so it has to be stored for later use. Here’s a detailed explaination on how PCA is implemented when there’re both training and test dataset. When a test dataset comes in, it is first subtracted by the mean, and then it’s projected on the basis (eigenvectors) and the it’s whiten by dividing by the corresponding eigenvalue.

### Probabilistic Generative Model

By using this model, we are able to model the data completely using Gaussian Distributions. An illustration of a scatter plot + Contour plot and scatter plot + surface plot is shown below:

#### Scatter Plot + Surface Plot

It can be observed that red (group 2) and green (group 3) data points are closely clustered together which is a problem when we’re trying to classify them.

We got an accuracy of 81.3%

### Probability Discriminative Model

For this model, we try to solve for the weights directly, so we are unable to come up with the mean and variance of the gaussians that models our data. Though, we are still able to plot the decision boundary for this model. There are two ways to implement this model, one is through Newton-Raphson iterative optimization, the other is through a fixed learning rate by changing the Hessian Matrix here to a fixed value.

#### Newton-Raphson

It can be observed that the boundaries are linear.

Throughout the design of this method, we are able to observe the error function and check for its convergence. Since it’s a convex function, it’ll fall to a global minima.

The final log probability error we got here is: 964.4353

The accuracy we got here is 85%

#### Fixed Learning Rate

The learning rate we used here is 0.002. Finding a good learning rate is really important for this method. By using a bad learning rate, this algorithm might oscillate and will never converge. Even convergence itself takes much longer than the Newton-Raphson method because the slope of the learning rate is now fixed and not optimized by the Hessian Matrix. In the end, it will converge to the same global minima as the previous method if the initial conditions are the same.

### Unbalanced Data

We would now want to try how unbalanced training data affects our performance. Here, we will try to reduce the amount of data points for the 3rd training dataset from 900 to 300 and see how it affects the performance.

#### Probabilistic Generative Model

It can be observed that dataset 2 and 3 (red and green) are closely blended together. This will of course affect the overall performance of this model. The accuracy we got here is 77.3% which is as expected lower.

#### Probability Discriminative Model

The final log probability error we got here is 640.7852 which is lower than before. This is probably caused by the closely clustered dataset 2 and 3 (red and green) where both are assigned a high probability for each posterior classes.

The accuracy we got here is: 76% which is obvious because we didn’t come up with an optimal model.

## References

http://sites.stat.psu.edu/~jiali/course/stat597e/notes2/logit.pdf