Stacked LSTM vs Transformer for Time Series Prediction
1. Stacked LSTM
LSTM is a type of Recurrent Neural Network (RNN)...
Architecture
Input sequence → LSTM layer 1 → LSTM layer 2 → Dense → Output
Mathematics
LSTM uses gates to control information flow:
-
Forget gate \(f_t\):
\[ f_t = σ(W_f · [h_{t-1}, x_t] + b_f) \]
-
Input gate \(i_t\) and candidate state \(C̃_t\):
\[ i_t = σ(W_i · [h_{t-1}, x_t] + b_i) \] \[ C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) \]
-
Cell state update \(C_t\):
\[ C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t \]
-
Output gate \(o_t\) and hidden state \(h_t\):
\[ o_t = σ(W_o · [h_{t-1}, x_t] + b_o) \] \[ h_t = o_t ⊙ tanh(C_t) \]
Where \(x_t\) is the input, \(h_{t-1}\) is the previous hidden state, σ is the sigmoid activation, and ⊙ is element-wise multiplication.
2. Transformer-based Time Series Predictor
Transformers use self-attention to model dependencies across all timesteps simultaneously. They are highly effective for long sequences and allow parallel computation.
Architecture
Input sequence → Positional Encoding → Transformer Encoder → Feedforward → Output
Mathematics
-
Compute queries (Q), keys (K), values (V):
\[ Q = X W_Q, \quad K = X W_K, \quad V = X W_V \]
-
Compute attention scores:
\[ Attention(Q,K,V) = softmax((Q K^T)/\sqrt{d_k}) V \]
-
Multi-head attention:
\[ MultiHead(Q,K,V) = Concat(head1,...,headh) W_O \]
-
Feedforward & LayerNorm:
\[ FFN(x) = ReLU(x W_1 + b_1) W_2 + b_2 \]
3. Key Differences
| Feature | Stacked LSTM | Transformer-based Predictor |
|---|---|---|
| Temporal modeling | Sequential, step-by-step | Global, all timesteps at once |
| Parallelization | Hard to parallelize | Fully parallelizable |
| Long-range dependencies | Hard for very long sequences | Excellent via attention |
| Memory | Hidden states carry info | Attention scores model relationships |
| Training speed | Slower | Faster for long sequences |
| Complexity | Simpler, fewer parameters | Higher parameters, more data needed |
| Use-case | Short-medium sequences | Long sequences, complex patterns |
Summary
- Stacked LSTM: Good for short-term sequential prediction like daily stock prices. Simpler and requires less data.
- Transformer: State-of-the-art for long-term forecasting, captures global dependencies, and allows parallel training. Needs more computation and data.
LSTM vs Encoder-only Transformer
LSTM = reads sequence step-by-step (memory-based)
Transformer = looks at entire sequence at once (attention-based)
How They Process Data
LSTM (Sequential)
x₁ → x₂ → x₃ → x₄ → x₅
h₁ → h₂ → h₃ → h₄ → h₅
Transformer (Parallel)
[x₁, x₂, x₃, x₄, x₅] → processed together
Summary
| Feature | LSTM | Transformer |
|---|---|---|
| Processing | Sequential | Parallel |
| Dependencies | Weak for long sequences | Strong |
| Speed | Slow | Fast |
| Core idea | Memory | Attention |
LSTM = memory over time
Transformer = attention over relationships