Softmax + NLL + Cross-Entropy Classification

Softmax + NLL and Cross-Entropy Theory

1. Key Purpose of Using Softmax + NLL

Main idea:

To convert raw model outputs into a valid probability distribution and then measure how well the model matches the true class — in a numerically stable way.

1.1 Softmax Turns Logits into Probabilities

A neural network outputs raw scores (logits), for example:

[2.3, -1.2, 0.7]

These are not probabilities. Softmax converts them into a probability distribution:

\( p_i = \frac{e^{z_i}}{\sum_{j} e^{z_j}} \)

This is required because NLL operates on probabilities.

1.2 NLL Measures the Likelihood of the Correct Class

NLL looks only at the predicted probability of the true class:

\( \text{NLL} = -\log(p_y) \)

High probability → small loss
Low probability → large loss

Thus, it encourages the model to increase the probability of the correct class.

1.3 Combined Effect: Probabilistic Learning

Softmax + NLL together lead to:

Sharper probability distributions
Higher confidence in correct predictions
Better class separation

1.4 Numerical Stability

Frameworks compute:

log_softmax → NLL

instead of:

softmax → log → NLL

to avoid overflow/underflow issues with exponentials and logarithms.

1.5 One-Sentence Summary

Softmax produces probabilities, and NLL evaluates how well they match the true class — giving a strong and stable learning signal.

2. Theory of Cross-Entropy

2.1 Origin from Information Theory

Cross-entropy measures how different two probability distributions are: a true distribution \( p \) and predicted distribution \( q \).

\( H(p, q) = -\sum_{i} p(i)\,\log q(i) \)

Interpretation: How many bits are needed to encode data from \( p \) when using a code optimized for \( q \)?

2.2 Cross-Entropy in Machine Learning

For classification, the true distribution is usually one-hot. Example:

p = [0, 0, 1, 0]

Cross-entropy simplifies to:

\( H(p,q) = -\log q(\text{true class}) \)

This is exactly the negative log-likelihood (NLL). Thus:

Cross-entropy = Softmax + NLL.

2.3 Why Use Cross-Entropy?

Rewards high probability for the correct class
Penalizes confident wrong predictions

2.4 Relationship with Entropy and KL Divergence

Entropy:

\( H(p) = -\sum_i p(i)\log p(i) \)

Cross-entropy:

\( H(p,q) = -\sum_i p(i)\log q(i) \)

KL divergence:

\( D_{\mathrm{KL}}(p\parallel q) = H(p,q) - H(p) \)

In classification, \( H(p) \) is constant, so minimizing \( H(p,q) \) also minimizes KL divergence.

2.5 Why Cross-Entropy Is the Best Loss for Classification

Correct probabilistic interpretation
Convex for logistic regression
Provides strong, informative gradients
Derived directly from maximum likelihood estimation

2.6 Cross-Entropy as Maximum Likelihood

Softmax model:

\( q(i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \)

Likelihood of true class \( y \):

\( L = q(y) \)

Maximizing likelihood is equivalent to minimizing:

\( -\log L = -\log q(y) \)

This is exactly cross-entropy.

2.7 Summary Table

Concept	Meaning
Softmax	Converts logits into probabilities
NLL	Penalizes low probability for correct class
Cross-entropy	Measures mismatch between true and predicted distributions
Theory	From information theory + maximum likelihood

Search This Blog

Softmax + NLL + Cross-Entropy Classification

Softmax + NLL and Cross-Entropy Theory

1. Key Purpose of Using Softmax + NLL

1.1 Softmax Turns Logits into Probabilities

1.2 NLL Measures the Likelihood of the Correct Class

1.3 Combined Effect: Probabilistic Learning

1.4 Numerical Stability

1.5 One-Sentence Summary

2. Theory of Cross-Entropy

2.1 Origin from Information Theory

2.2 Cross-Entropy in Machine Learning

2.3 Why Use Cross-Entropy?

2.4 Relationship with Entropy and KL Divergence

2.5 Why Cross-Entropy Is the Best Loss for Classification

2.6 Cross-Entropy as Maximum Likelihood

2.7 Summary Table

Further Reading

Parent Topics

Contact Us

Popular Posts

Constellation Diagrams of ASK, PSK, and FSK

Fading : Slow & Fast and Large & Small Scale Fading

Online Simulator for ASK, FSK, and PSK

DFTs-OFDM vs OFDM: Why DFT-Spread OFDM Reduces PAPR Effectively (with MATLAB Code)

Theoretical BER vs SNR for binary ASK, FSK, and PSK

MATLAB Code for ASK, FSK, and PSK

BER vs SNR for M-ary QAM, M-ary PSK, QPSK, BPSK, ...

Theoretical BER vs SNR for m-ary PSK and QAM