Mathematical Derivation of Binary Cross-Entropy Loss

1. The Cost Function (Binary Cross-Entropy)

For a single training example:

L(y,y^)=[ylog(y^)+(1y)log(1y^)]L(y, \hat{y}) = -[y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y})]

For mm training examples, the cost function is the average:

J(w,b)=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(w,b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \cdot \log(\hat{y}^{(i)}) + (1-y^{(i)}) \cdot \log(1-\hat{y}^{(i)})]

Where:

  • y{0,1}y \in \{0, 1\} is the true label
  • y^=σ(z)\hat{y} = \sigma(z) is the predicted probability
  • z=wTx+bz = w^T x + b is the linear combination
  • σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} is the sigmoid function

2. The Forward Pass

z=wTx+bz = w^T x + b

y^=σ(z)=11+ez\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}

3. Derivative of Sigmoid

First, let’s derive the sigmoid derivative (we’ll need this):

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

dσdz=ddz[11+ez]\frac{d\sigma}{dz} = \frac{d}{dz}\left[\frac{1}{1 + e^{-z}}\right]

=(1+ez)2(ez)= -(1 + e^{-z})^{-2} \cdot (-e^{-z})

=ez(1+ez)2= \frac{e^{-z}}{(1 + e^{-z})^2}

=11+ezez1+ez= \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}}

=σ(z)[1σ(z)]= \sigma(z) \cdot [1 - \sigma(z)]

Key result:

dσdz=σ(z)(1σ(z))=y^(1y^)\boxed{\frac{d\sigma}{dz} = \sigma(z) \cdot (1 - \sigma(z)) = \hat{y} \cdot (1 - \hat{y})}

4. Derivative of Cost w.r.t. y^\hat{y}

Now let’s find Ly^\frac{\partial L}{\partial \hat{y}} for a single example:

L=[ylog(y^)+(1y)log(1y^)]L = -[y \cdot \log(\hat{y}) + (1-y) \cdot \log(1-\hat{y})]

Ly^=[y1y^+(1y)11y^(1)]\frac{\partial L}{\partial \hat{y}} = -\left[y \cdot \frac{1}{\hat{y}} + (1-y) \cdot \frac{1}{1-\hat{y}} \cdot (-1)\right]

=[yy^1y1y^]= -\left[\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}\right]

=y(1y^)y^(1y)y^(1y^)= -\frac{y(1-\hat{y}) - \hat{y}(1-y)}{\hat{y}(1-\hat{y})}

=yyy^y^+yy^y^(1y^)= -\frac{y - y\hat{y} - \hat{y} + y\hat{y}}{\hat{y}(1-\hat{y})}

=yy^y^(1y^)= -\frac{y - \hat{y}}{\hat{y}(1-\hat{y})}

=y^yy^(1y^)= \frac{\hat{y} - y}{\hat{y}(1-\hat{y})}

Key result:

Ly^=y^yy^(1y^)\boxed{\frac{\partial L}{\partial \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1-\hat{y})}}

5. Chain Rule: Lz\frac{\partial L}{\partial z}

Now apply the chain rule:

Lz=Ly^y^z\frac{\partial L}{\partial z} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}

=y^yy^(1y^)y^(1y^)= \frac{\hat{y} - y}{\hat{y}(1-\hat{y})} \cdot \hat{y}(1-\hat{y})

=y^y= \hat{y} - y

This is the beautiful simplification! The y^(1y^)\hat{y}(1-\hat{y}) terms cancel out perfectly.

Key result:

Lz=y^y\boxed{\frac{\partial L}{\partial z} = \hat{y} - y}

6. Gradient w.r.t. Weights

Using the chain rule again:

Lw=Lzzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial w}

Since z=wTx+bz = w^T x + b, we have zw=x\frac{\partial z}{\partial w} = x

Lw=(y^y)x\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x

For mm examples:

Jw=1mi=1m(y^(i)y(i))x(i)\frac{\partial J}{\partial w} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) \cdot x^{(i)}

=1mXT(y^y)= \frac{1}{m} X^T (\hat{y} - y)

Result:

dw=1mXT(y^y)\boxed{dw = \frac{1}{m} X^T (\hat{y} - y)}

7. Gradient w.r.t. Bias

Similarly:

Lb=Lzzb\frac{\partial L}{\partial b} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial b}

=(y^y)1= (\hat{y} - y) \cdot 1

=y^y= \hat{y} - y

For mm examples:

Jb=1mi=1m(y^(i)y(i))\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})

Result:

db=1mi=1m(y^(i)y(i))\boxed{db = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})}

Summary

The final gradient formulas are remarkably simple:

error=y^y\text{error} = \hat{y} - y

dw=1mXTerrordw = \frac{1}{m} X^T \cdot \text{error}

db=1merrordb = \frac{1}{m} \sum \text{error}

error = y_pred - y
dw = (1/m) * X.T @ error
db = (1/m) * np.sum(error)

The key insight is that binary cross-entropy loss paired with sigmoid activation produces this clean gradient because the derivative of the sigmoid function y^(1y^)\hat{y}(1-\hat{y}) cancels with the denominator from the cross-entropy derivative.

This is one reason why cross-entropy is preferred over MSE for classification—the gradients are cleaner and don’t suffer from the vanishing gradient problem that MSE + sigmoid would have!

Derivation: From Log-Likelihood to Cost Function

Step 1: Start with Log-Likelihood (Maximization)

logP(yX,w)=i=1NlogP(y(i)x(i),w)\log P(y | X, w) = \sum_{i=1}^{N} \log P(y^{(i)} | x^{(i)}, w)

=i=1Nlogσ(y(i)wTx(i))= \sum_{i=1}^{N} \log \sigma(y^{(i)} w^T x^{(i)})

Gradient (for ascent): wlogP(yX,w)=X(σ(yXw)y)\nabla_w \log P(y|X,w) = X(\sigma(-y \odot Xw) \odot y)

Step 2: Define Cost as Negative Log-Likelihood

To convert to a minimization problem:

J(w)=logP(yX,w)J(w) = -\log P(y | X, w)

=i=1Nlogσ(y(i)wTx(i))= -\sum_{i=1}^{N} \log \sigma(y^{(i)} w^T x^{(i)})

Step 3: Take Gradient of Cost Function

wJ(w)=wlogP(yX,w)\nabla_w J(w) = -\nabla_w \log P(y|X,w)

=X(σ(yXw)y)= -X(\sigma(-y \odot Xw) \odot y)

Step 4: Gradient Descent Update Rule

wt+1=wtαwJ(w)w_{t+1} = w_t - \alpha \nabla_w J(w)

=wtα[X(σ(yXw)y)]= w_t - \alpha \cdot [-X(\sigma(-y \odot Xw) \odot y)]

=wt+αX(σ(yXw)y)= w_t + \alpha X(\sigma(-y \odot Xw) \odot y)

Wait, That’s the Same as Gradient Ascent!

Yes! Because:

Gradient Descent on (logP)=Gradient Ascent on (logP)\text{Gradient Descent on } (-\log P) = \text{Gradient Ascent on } (\log P)

They’re mathematically identical operations.

Converting to Standard Form

To get the familiar XT(y^y)X^T(\hat{y} - y) form, you need to work through the algebra. Let me show you:

For binary classification with y{0,1}y \in \{0, 1\}:

logP(y(i)x(i),w)=y(i)log(y^(i))+(1y(i))log(1y^(i))\log P(y^{(i)} | x^{(i)}, w) = y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})

Where y^(i)=σ(wTx(i))\hat{y}^{(i)} = \sigma(w^T x^{(i)})

So the negative log-likelihood is:

J(w)=i=1N[y(i)log(y^(i))+(1y(i))log(1y^(i))]J(w) = -\sum_{i=1}^{N} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]

Taking the gradient (as I showed in my earlier derivation):

wJ(w)=1NXT(y^y)\nabla_w J(w) = \frac{1}{N} X^T(\hat{y} - y)

Gradient Descent Update

wt+1=wtαwJ(w)w_{t+1} = w_t - \alpha \nabla_w J(w)

wt+1=wtαNXT(y^y)\boxed{w_{t+1} = w_t - \frac{\alpha}{N} X^T(\hat{y} - y)}

Summary Table

FrameworkObjectiveGradientUpdate Rule
Log-Likelihood (Maximize)maxlogP(yX,w)\max \log P(y\|X,w)X(σ(yXw)y)X(\sigma(-y \odot Xw) \odot y)ww+αlogPw \leftarrow w + \alpha \nabla \log P
Negative Log-Likelihood (Minimize)minlogP(yX,w)\min -\log P(y\|X,w)X(σ(yXw)y)-X(\sigma(-y \odot Xw) \odot y)wwα(logP)w \leftarrow w - \alpha \nabla(-\log P)
Cross-Entropy Cost (Minimize)minJ(w)\min J(w)1NXT(y^y)\frac{1}{N}X^T(\hat{y} - y)wwαJw \leftarrow w - \alpha \nabla J

All three approaches are mathematically equivalent and will converge to the same solution!

The cross-entropy form is most common in practice because:

  1. The gradient formula XT(y^y)X^T(\hat{y} - y) is very clean
  2. It’s easier to interpret as “error times features”
  3. It generalizes nicely to multi-class problems