Key Concepts: Minibatches, DataLoader, and the Limits of Fully Connected Networks
This document summarizes several fundamental ideas in deep learning training pipelines,
including minibatch gradient descent, PyTorch’s DataLoader,
model capacity, and the limitations of fully connected (dense) networks for image data.
These concepts motivate the transition to convolutional neural networks (CNNs).
1. Minibatch Gradient Descent
Training with minibatches means computing gradients on a small subset of the dataset rather than the full dataset. This introduces noise, which has important benefits:
- Efficiency: Computing gradients on the entire dataset is slow; minibatches make training fast and scalable.
- Useful Noise: Minibatch gradients are noisy approximations. This stochasticity helps avoid local minima and supports stable convergence.
- Learning Rate Requirements: Because minibatch gradients fluctuate, a reasonably small learning rate prevents instability.
Shuffling the dataset each epoch ensures the sequence of minibatches remains representative of the overall data distribution.
2. PyTorch DataLoader
The DataLoader automates:
- Batching of samples
- Shuffling each epoch
- Iterating over data easily within training loops
A typical DataLoader setup:
train_loader = torch.utils.data.DataLoader(
cifar2, batch_size=64, shuffle=True
)
Each iteration returns a minibatch of images and labels, ready for processing in the forward pass.
3. The Training Loop
Each training step consists of:
- Forward pass
- Loss computation
- Zeroing gradients
- Backward propagation
- Optimizer step
Example batch shapes:
imgs: 64 × 3 × 32 × 32labels: 64
After training, accuracy is measured on a separate validation set without tracking gradients.
4. Increasing Model Capacity and Overfitting
Adding more layers or larger layers increases the model’s capacity. This leads to:
- Near-perfect training accuracy
- Limited improvement in validation accuracy
This behavior indicates overfitting: the model memorizes the training set rather than learning generalizable patterns.
You can inspect the number of trainable parameters using p.numel().
Fully connected layers tend to produce extremely large parameter counts.
5. Why Fully Connected Networks Fail for Images
A. They Ignore Spatial Relationships
Flattening an image into a 1D vector removes the natural 2D structure. The network must learn pixel relationships independently for every location:
- An airplane at one position must be learned separately from an airplane shifted by a few pixels.
- The model is not translation invariant.
B. They Require Massive Numbers of Parameters
Every output neuron connects to every input pixel. For image inputs, especially high-resolution ones, this causes exponential growth in parameter count. For example, a single fully connected layer on a 1024×1024 RGB image could require billions of parameters.
This is computationally and memory-wise impractical.
6. Motivation for Convolutional Layers
The shortcomings of fully connected layers lead naturally to the need for convolutional neural networks (CNNs):
- They exploit local patterns through small receptive fields.
- They reuse parameters across spatial positions (weight sharing).
- They are naturally translation invariant.
- They scale efficiently to large images.
Convolutional layers are therefore the standard architecture for image tasks.
Conclusion
- Minibatches provide efficiency and useful randomness during training.
- PyTorch’s
DataLoadersimplifies data handling. - Fully connected networks are prone to overfitting and do not scale well to images.
- CNNs solve these issues by leveraging the 2D structure of images and promoting translation invariance.
These concepts form the foundation for understanding modern deep learning approaches to image classification.
What Is Translation Variance?
Translation variance refers to a model’s tendency to produce different outputs when an input image is shifted (translated) left, right, up, or down.
This is often an undesirable property in image recognition because the meaning of the image does not change if an object shifts a few pixels.
Why Translation Variance Happens
Fully connected (dense) neural networks treat an image as a large 1D vector, ignoring the spatial relationships between neighboring pixels. As a result:
- A feature learned at one location must be relearned at every other location.
- Shifting the object in the input produces a completely different pattern of values.
- The model often fails to recognize the same object in a different position.
This makes the model not generalize well to translated images.
Translation Invariance vs. Translation Variance
| Concept | Meaning | Example Behavior |
|---|---|---|
| Translation Invariance | The model's prediction does not change when the input image is shifted. | A CNN recognizes a cat regardless of whether it appears at the top-left or center. |
| Translation Variance | The model's prediction does change when the image is shifted. | A fully connected network fails to identify the same plane if it moves a few pixels. |
Why CNNs Fix Translation Variance
Convolutional neural networks naturally achieve translation invariance because they use:
- Local receptive fields – small regions of the image are processed at a time.
- Weight sharing – one filter slides across the whole image.
This means the same pattern can be detected anywhere in the image, allowing the model to recognize objects regardless of their position.
Summary
- Translation variance → predictions change when the image is shifted.
- Fully connected networks → translation-variant (bad for images).
- CNNs → translation-invariant (ideal for vision tasks).