What is an LSTM?
LSTM (Long Short-Term Memory) is a special kind of Recurrent Neural Network (RNN) designed to remember important information over long time sequences and forget useless stuff.
A smart notebook that decides what to remember, what to forget, and what to use right now.
Classic RNNs forget quickly. LSTMs were invented to fix that.
Mathematical Intuition
Each LSTM cell has 3 gates + memory.
Let:
xt= input at time tht-1= previous hidden statect-1= previous memory (cell state)
Forget Gate – What should I forget?
f_t = σ(W_f [h_{t-1}, x_t] + b_f)
- Outputs values between 0 and 1
- 0 → forget completely
- 1 → keep completely
Input Gate
i_t = σ(W_i [h_{t-1}, x_t] + b_i)
~c_t = tanh(W_c [h_{t-1}, x_t] + b_c)
Update Memory
c_t = f_t * c_{t-1} + i_t * ~c_t
This is the magic: memory flows almost unchanged, preventing vanishing gradients.
Output Gate
o_t = σ(W_o [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(c_t)
Key idea:
- Cell state
c_t= long-term memory - Hidden state
h_t= short-term output
Why LSTM works
Vanilla RNN problem
- Gradients → 0 (vanishing gradient)
- Can’t learn long-range dependencies
LSTM solution
- Memory path with multiplicative gates
- Gradient flows smoothly
- Can remember events hundreds of steps back
Where LSTMs are used
Time Series
- Inflation forecasting
- Stock prices
- Weather
- Energy demand
- Sales forecasting
NLP
- Language modeling
- Text generation
- Sentiment analysis
- Chatbots (pre-Transformer era)
Signal Processing
- Speech recognition
- ECG / EEG signals
Anomaly Detection
- Network traffic
- Fraud detection
- Sensor failures
LSTM vs Others
LSTM vs Vanilla RNN
| Feature | RNN | LSTM |
|---|---|---|
| Long memory | ❌ | ✅ |
| Vanishing gradient | ❌ | ✅ |
| Complexity | Low | Higher |
| Real use | Rare | Common |
LSTM vs GRU
| Feature | LSTM | GRU |
|---|---|---|
| Gates | 3 | 2 |
| Memory cell | Separate | Combined |
| Speed | Slower | Faster |
| Data needed | More | Less |
Rule of thumb: Small dataset → GRU, Long sequences → LSTM
LSTM vs Transformer
| Feature | LSTM | Transformer |
|---|---|---|
| Sequence handling | Sequential | Parallel |
| Long context | Limited | Excellent |
| Training speed | Slow | Fast |
| Data needed | Less | More |
| Time series | Excellent | Good |
LSTMs are great for small/medium datasets; Transformers require huge data.
When should YOU use LSTM?
Use LSTM if:
- Data is sequential
- Order matters
- Dataset is not massive
- You want temporal patterns
Avoid LSTM if:
- You have millions of samples
- Very long contexts (>1000 steps)
- NLP at scale → use Transformers