Logistic Regression

Logistic regression is a binary classifier used for categorical labels that builds on linear regression, but instead of predicting a continuous output, it applies the sigmoid function (specifically, the logistic sigmoid) to the linear output to produce a probability. This transformation maps any real-valued input to a value between 0 and 1, which allows logistic regression to output the probability that a given input belongs to a particular class.

As it happens for linear regression and polynomial regression, the logistic regression changes the parameters in order to minimize the loss. The parameters are the following:

Coefficients $β_{1}, β_{2}, \dots, β_{k}$ for each feature in the input vector $x$
Intercept (or bias) $β_{0}$ which represents a constant offset in the decision boundary.

The logistic regression loss is defined as:

ℓ_{Θ} ({x_{i}, y_{i}}) = i = 1 \sum n (y_{i} - σ (a x_{i} + b))^{2}

Where $σ (x) = \frac{1}{1 + e ^{- x}}$ is the logistic sigmoid function

Screenshot 2023-03-15 at 5.58.58 PM.png

The sigmoid function is used to squash the output of $f$ into a value between $(0, 1)$ . This means that the function has a saturation effect as it maps $R \to (0, 1)$ .

Note that this new loss is non-linear with respect to the parameters, since they appear in the non-linear sigmoid function, and it’s also non-convex.

We want the loss to be convex in order to perform Gradient Descent, so we can write a new loss function with this property that will output an high value if the output label is wrong, and a low value if it’s true.

c (x_{i}, y_{i}) = {- ln (σ (a x_{i} + b)) - ln (1 - σ (a x_{i} + b)) y_{i} = 1 y_{i} = 0

This works well because if the prediction $σ (a x_{i} + b)$ is near the $0$ , and the true label is $1$ , then $- ln (σ (a x_{i} + b)) \to \infty$ . On the other hand, if the true label is $0$ , then $- ln (1 - σ (a x_{i} + b)) \to 0$ .

We can rewrite the function on a single line:

c (x_{i}, y_{i}) = - y_{i} ln (σ (a x_{i} + b)) - (1 - y_{i}) ln (1 - σ (a x_{i} + b))

The function is still non-linear with respect to the parameter, but is convex. This is known as Binary Cross Entropy Loss.

The loss for the entire dataset will be:

ℓ_{Θ} ({x_{i}, y_{i}}) = i = 1 \sum n c (x_{i}, y_{i})

Closed form solution for Logistic Regression

As we did for the linear regression case, we want to find a closed form solution by putting the gradient of the loss equal to zero. Since the loss is convex, we will find the global minimum.

Math

Finding theta such as $\nabla_{Θ} ℓ_{Θ} = 0$

We set the gradient to zero and we want to solve it for $Θ$ .
$\nabla_{Θ} i = 1 \sum n y_{i} ln (σ (a x_{i} + b)) + (1 - y_{i}) ln (1 - σ (a x_{i} + b)) = 0$
Since the gradient operation is linear, the gradient of the summation is just the sum of the gradient of each term, so we can consider just the gradient of a single term.
$\nabla_{Θ} y_{i} ln (σ (a x_{i} + b)) + (1 - y_{i}) ln (1 - σ (a x_{i} + b)) \nabla_{Θ} y_{i} ln (σ (a x_{i} + b)) + \nabla_{Θ} (1 - y_{i}) ln (1 - σ (a x_{i} + b)) y_{i} \nabla_{Θ} ln (σ (a x_{i} + b)) + (1 - y_{i}) \nabla_{Θ} ln (1 - σ (a x_{i} + b)) y_{i} \nabla_{Θ} f (g (h (Θ))) ln (σ (a x_{i} + b)) + (1 - y_{i}) \nabla_{Θ} ln (1 - σ (a x_{i} + b))$
We can apply the chain rule in order to take the derivative of the composition of three function.
$\frac{\partial}{\partial a} f (g (h (a, b))) = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial a}$
We solve the first partial derivative:
$\frac{\partial h}{\partial a} = \frac{\partial}{\partial a} a x_{i} + b = x_{i}$
We solve the second partial derivative:
$\frac{\partial g}{\partial h} = \frac{\partial σ ( a x _{i} + b )}{\partial ( a x _{i} + b )} = \frac{\partial}{\partial ( a x _{i} + b )} \frac{1}{1 + e ^{- (a x_{i} + b)}} = \frac{e ^{- (a x_{i} + b)}}{( 1 + e ^{- (a x_{i} + b)} ) ^{2}} = \frac{1}{1 + e ^{- (a x_{i} + b)}} \frac{e ^{- (a x_{i} + b)}}{1 + e ^{- (a x_{i} + b)}} = \frac{1}{1 + e ^{- (a x_{i} + b)}} \frac{( 1 + e ^{- (a x_{i} + b)} ) - 1}{1 + e ^{- (a x_{i} + b)}} = \frac{1}{1 + e ^{- (a x_{i} + b)}} (1 - \frac{1}{1 + e ^{- (a x_{i} + b)}}) = - σ (a x_{i} + b) (1 - σ (a x_{i} + b))$
Finally we solve the third partial derivative:
$\frac{\partial f}{\partial g} = \frac{\partial ln ( σ ( a x _{i} + b ) )}{\partial σ ( a x _{i} + b )} = \frac{1}{σ ( a x _{i} + b )}$
Plugging everything in the equation at $(6)$ :
$\frac{\partial}{\partial a} f (g (h (a, b))) = \frac{1}{σ ( a x _{i} + b )} \cdot σ (a x_{i} + b) (1 - σ (a x_{i} + b)) \cdot x_{i} = (1 - σ (a x_{i} + b)) \cdot x_{i}$
And so:
$\frac{\partial}{\partial a} ln (σ (a x_{i} + b)) = (1 - σ (a x_{i} + b)) x_{i}$
Now this should be done also for the left ther of $(5)$ , and everything should be repeated w.r.t. $b$ in order to compute the full gradient.

We can see from a part of the partial derivative which constitutes the gradient:

\frac{\partial}{\partial a} ln (σ (a x_{i} + b)) = (1 - σ (a x_{i} + b)) x_{i}

that the system of equations that we have when we set the gradient to $0$ wouldn’t be a linear system, since both $a$ and $b$ are involved in a non-linear function and so it cannot be easily solved.

Furthermore, the system is a transcendental equation, because it involves the exponential function inside the sigmoid function, which definition is the sum of an infinite series. Because of that, an analytical solution, and so a closed form solution, doesn’t exist.

In order to find the parameters that minimize the loss, we need to use an iterative method like stochastic gradient descent.

Screenshot 2023-03-15 at 6.11.06 PM.png

Multinomial Logistic Regression

If the output label are not binary, we want to perform multi-class classification, and so we need multinomial logistic regression.

The cost function for more than 2 classes translates to:

c (x_{i}, y_{i}) = - j = 1 \sum K 1_{ij} ln (P (y = j ∣ x_{i}))

Where $P (y = j ∣ x_{i})$ can be substituted with the softmax function, and so the final loss function becomes:

c (x_{i}, y_{i}) = - j = 1 \sum K y_{ij} ln (\frac{e ^{β_{j}^{T} x_{i}}}{\sum _{k = 1}^{K} e ^{β_{k}^{T} x_{i}}})

Where:

$x$ is the feature vector
$K$ is the number of classes
$β_{j}$ is the weighted vector for class $j$
$1_{ij}$ is the indicator function, which returns $1$ if $y_{i} = j$ (the true class for $x_{i}$ is j), and $0$ otherwise.

Note

The softmax function is used instead of the sigmoid since the first can squash into the interval $[0, 1]$ multiple values, considering them all in the process, while the latter only works with two values independently. Using the sigmoid won’t return a vector that sums to 1, hence not a probability distribution.

tags: machine-learning

Quartz 4

Explorer

Logistic Regression

Closed form solution for Logistic Regression

Multinomial Logistic Regression

Graph View

Table of Contents

Backlinks