# The Softmax Function and Multinomial Logistic Regression

In this post, we will introduce the softmax function and discuss how it can help us in a logistic regression analysis setting with more than two classes. This is known as multinomial logistic regression and should not be confused with multiple logistic regression which describes a scenario with multiple predictors.

## What is the Softmax Function?

In the sigmoid function, you have a probability threshold of 0.5. Those observations with a probability below that threshold go into class A. Those with a probability above the threshold go into class B. Accordingly, you are limited to a prediction between two classes.

In multiple logistic regression we want to classify based on more than two classes.

To be able to classic between more than two classes, you need a function that returns a probability value for every class. The sum of all probabilities needs to sum to one. The softmax function suits these requirements.

**Softmax Function Formula**

softmax(z) = \frac{e^{z(i)}}{\sum^k_{j=0} e^{z(j)}}

where z is a vector of inputs with length equivalent to the number of classes k.

Let’s do an example with the softmax function by plugging in a vector of numbers to get a better intuition for how it works.

z = [1,3,4,7]

If we want to calculate the probability for the second entry, which is 3, we plug our desired values into the formula

softmax(z_2) = \frac{e^3}{e^1+e^3+e^4+e^7} = 0.017

Applying the softmax function to all values in z gives us the following vector which sums to 1:

softmax(z) = [0.002, 0.017, 0.047, 0.934]

As you see, the last entry has an associated probability of more than 90%. In a classification setting, you would assign your observation to the last class.

## Multinomial Logistic Regression

You perform multinomial logistic regression by creating a regression model of the form

z = \beta^tx

and applying the softmax function to it:

\hat y = softmax( \beta^tx)

### Multinomial Logistic Regression Loss Function

The loss function in a multiple logistic regression model takes the general form

Cost(\beta) = -\sum_{i=j}^k y_j log(\hat y_j)

with y being the vector of actual outputs. Since we are dealing with a classification problem, y is a so called one-hot vector. This means all positions in the vector are 0. Only the entry representing the class that the observations falls into is 1.

Let’s illustrate this with an example:

Suppose you want to classify fruits into one of three categories and the actual fruit is a banana. For the sake of simplicity, we will only look at one observation. The vector y would look like this:

y = \begin{bmatrix} 0\\ 1\\ 0 \end{bmatrix} = \begin{bmatrix} apple\\ banana\\ orange \end{bmatrix}

You have some data that you train your logistic regression model on and it returns the following prediction vector of probabilities.

\hat y = \begin{bmatrix} 0.2\\ 0.7\\ 0.1 \end{bmatrix}

Now, we plug this into our cost function:

Cost(\beta) = -(0 \times log(0.2) + 1 \times log(0.7) + 0 \times log(0.1))

A very convenient feature of this function is that due to their entries in y being 0 all terms that do not relate to the actual true class will disappear:

Cost(\beta) = -log(0.7) = 0.36

This function effectively serves the purpose of minimizing the cost. The larger the probability y_hat associated with the true probability, the smaller the cost.

To find the gradient we take the first derivative of the cost with respect to every entry β_j in β. The derivative is quite simple. It turns out to be

\frac{\partial Cost(\beta)}{\partial \beta_j} = \hat y - y

Finally, we can apply gradient descent to iteratively minimize the cost multiplied by a learning rate α.

Cost(\beta) = Cost(\beta) - \alpha \frac{\partial Cost(\beta)}{\partial \beta_j}

That’s it, we now know how to perform multiclass classification with logistic regression. Next, we’ll look at an implementation of logistic regression in Python.