Mathematical Derivation of Binary Cross-Entropy Loss
1. The Cost Function (Binary Cross-Entropy)
For a single training example:
L(y,y^)=−[y⋅log(y^)+(1−y)⋅log(1−y^)]
For m training examples, the cost function is the average:
J(w,b)=−m1∑i=1m[y(i)⋅log(y^(i))+(1−y(i))⋅log(1−y^(i))]
Where:
- y∈{0,1} is the true label
- y^=σ(z) is the predicted probability
- z=wTx+b is the linear combination
- σ(z)=1+e−z1 is the sigmoid function
2. The Forward Pass
z=wTx+b
y^=σ(z)=1+e−z1
3. Derivative of Sigmoid
First, let’s derive the sigmoid derivative (we’ll need this):
σ(z)=1+e−z1
dzdσ=dzd[1+e−z1]
=−(1+e−z)−2⋅(−e−z)
=(1+e−z)2e−z
=1+e−z1⋅1+e−ze−z
=σ(z)⋅[1−σ(z)]
Key result:
dzdσ=σ(z)⋅(1−σ(z))=y^⋅(1−y^)
4. Derivative of Cost w.r.t. y^
Now let’s find ∂y^∂L for a single example:
L=−[y⋅log(y^)+(1−y)⋅log(1−y^)]
∂y^∂L=−[y⋅y^1+(1−y)⋅1−y^1⋅(−1)]
=−[y^y−1−y^1−y]
=−y^(1−y^)y(1−y^)−y^(1−y)
=−y^(1−y^)y−yy^−y^+yy^
=−y^(1−y^)y−y^
=y^(1−y^)y^−y
Key result:
∂y^∂L=y^(1−y^)y^−y
5. Chain Rule: ∂z∂L
Now apply the chain rule:
∂z∂L=∂y^∂L⋅∂z∂y^
=y^(1−y^)y^−y⋅y^(1−y^)
=y^−y
This is the beautiful simplification! The y^(1−y^) terms cancel out perfectly.
Key result:
∂z∂L=y^−y
6. Gradient w.r.t. Weights
Using the chain rule again:
∂w∂L=∂z∂L⋅∂w∂z
Since z=wTx+b, we have ∂w∂z=x
∂w∂L=(y^−y)⋅x
For m examples:
∂w∂J=m1∑i=1m(y^(i)−y(i))⋅x(i)
=m1XT(y^−y)
Result:
dw=m1XT(y^−y)
7. Gradient w.r.t. Bias
Similarly:
∂b∂L=∂z∂L⋅∂b∂z
=(y^−y)⋅1
=y^−y
For m examples:
∂b∂J=m1∑i=1m(y^(i)−y(i))
Result:
db=m1i=1∑m(y^(i)−y(i))
Summary
The final gradient formulas are remarkably simple:
error=y^−y
dw=m1XT⋅error
db=m1∑error
error = y_pred - y
dw = (1/m) * X.T @ error
db = (1/m) * np.sum(error)
The key insight is that binary cross-entropy loss paired with sigmoid activation produces this clean gradient because the derivative of the sigmoid function y^(1−y^) cancels with the denominator from the cross-entropy derivative.
This is one reason why cross-entropy is preferred over MSE for classification—the gradients are cleaner and don’t suffer from the vanishing gradient problem that MSE + sigmoid would have!
Derivation: From Log-Likelihood to Cost Function
Step 1: Start with Log-Likelihood (Maximization)
logP(y∣X,w)=∑i=1NlogP(y(i)∣x(i),w)
=∑i=1Nlogσ(y(i)wTx(i))
Gradient (for ascent):
∇wlogP(y∣X,w)=X(σ(−y⊙Xw)⊙y)
Step 2: Define Cost as Negative Log-Likelihood
To convert to a minimization problem:
J(w)=−logP(y∣X,w)
=−∑i=1Nlogσ(y(i)wTx(i))
Step 3: Take Gradient of Cost Function
∇wJ(w)=−∇wlogP(y∣X,w)
=−X(σ(−y⊙Xw)⊙y)
Step 4: Gradient Descent Update Rule
wt+1=wt−α∇wJ(w)
=wt−α⋅[−X(σ(−y⊙Xw)⊙y)]
=wt+αX(σ(−y⊙Xw)⊙y)
Wait, That’s the Same as Gradient Ascent!
Yes! Because:
Gradient Descent on (−logP)=Gradient Ascent on (logP)
They’re mathematically identical operations.
Converting to Standard Form
To get the familiar XT(y^−y) form, you need to work through the algebra. Let me show you:
For binary classification with y∈{0,1}:
logP(y(i)∣x(i),w)=y(i)log(y^(i))+(1−y(i))log(1−y^(i))
Where y^(i)=σ(wTx(i))
So the negative log-likelihood is:
J(w)=−∑i=1N[y(i)log(y^(i))+(1−y(i))log(1−y^(i))]
Taking the gradient (as I showed in my earlier derivation):
∇wJ(w)=N1XT(y^−y)
Gradient Descent Update
wt+1=wt−α∇wJ(w)
wt+1=wt−NαXT(y^−y)
Summary Table
| Framework | Objective | Gradient | Update Rule |
|---|
| Log-Likelihood (Maximize) | maxlogP(y∥X,w) | X(σ(−y⊙Xw)⊙y) | w←w+α∇logP |
| Negative Log-Likelihood (Minimize) | min−logP(y∥X,w) | −X(σ(−y⊙Xw)⊙y) | w←w−α∇(−logP) |
| Cross-Entropy Cost (Minimize) | minJ(w) | N1XT(y^−y) | w←w−α∇J |
All three approaches are mathematically equivalent and will converge to the same solution!
The cross-entropy form is most common in practice because:
- The gradient formula XT(y^−y) is very clean
- It’s easier to interpret as “error times features”
- It generalizes nicely to multi-class problems