Batch Size and Minibatches in Machine Learning

Key Concepts: Minibatches, DataLoader, and the Limits of Fully Connected Networks

This document summarizes several fundamental ideas in deep learning training pipelines, including minibatch gradient descent, PyTorch’s DataLoader, model capacity, and the limitations of fully connected (dense) networks for image data. These concepts motivate the transition to convolutional neural networks (CNNs).

1. Minibatch Gradient Descent

Training with minibatches means computing gradients on a small subset of the dataset rather than the full dataset. This introduces noise, which has important benefits:

Efficiency: Computing gradients on the entire dataset is slow; minibatches make training fast and scalable.
Useful Noise: Minibatch gradients are noisy approximations. This stochasticity helps avoid local minima and supports stable convergence.
Learning Rate Requirements: Because minibatch gradients fluctuate, a reasonably small learning rate prevents instability.

Shuffling the dataset each epoch ensures the sequence of minibatches remains representative of the overall data distribution.

2. PyTorch DataLoader

The DataLoader automates:

Batching of samples
Shuffling each epoch
Iterating over data easily within training loops

A typical DataLoader setup:

train_loader = torch.utils.data.DataLoader(
    cifar2, batch_size=64, shuffle=True
)

Each iteration returns a minibatch of images and labels, ready for processing in the forward pass.

3. The Training Loop

Each training step consists of:

Forward pass
Loss computation
Zeroing gradients
Backward propagation
Optimizer step

Example batch shapes:

imgs: 64 × 3 × 32 × 32
labels: 64

After training, accuracy is measured on a separate validation set without tracking gradients.

4. Increasing Model Capacity and Overfitting

Adding more layers or larger layers increases the model’s capacity. This leads to:

Near-perfect training accuracy
Limited improvement in validation accuracy

This behavior indicates overfitting: the model memorizes the training set rather than learning generalizable patterns.

You can inspect the number of trainable parameters using p.numel(). Fully connected layers tend to produce extremely large parameter counts.

5. Why Fully Connected Networks Fail for Images

A. They Ignore Spatial Relationships

Flattening an image into a 1D vector removes the natural 2D structure. The network must learn pixel relationships independently for every location:

An airplane at one position must be learned separately from an airplane shifted by a few pixels.
The model is not translation invariant.

B. They Require Massive Numbers of Parameters

Every output neuron connects to every input pixel. For image inputs, especially high-resolution ones, this causes exponential growth in parameter count. For example, a single fully connected layer on a 1024×1024 RGB image could require billions of parameters.

This is computationally and memory-wise impractical.

6. Motivation for Convolutional Layers

The shortcomings of fully connected layers lead naturally to the need for convolutional neural networks (CNNs):

They exploit local patterns through small receptive fields.
They reuse parameters across spatial positions (weight sharing).
They are naturally translation invariant.
They scale efficiently to large images.

Convolutional layers are therefore the standard architecture for image tasks.

Conclusion

Minibatches provide efficiency and useful randomness during training.
PyTorch’s DataLoader simplifies data handling.
Fully connected networks are prone to overfitting and do not scale well to images.
CNNs solve these issues by leveraging the 2D structure of images and promoting translation invariance.

These concepts form the foundation for understanding modern deep learning approaches to image classification.

What Is Translation Variance?

Translation variance refers to a model’s tendency to produce different outputs when an input image is shifted (translated) left, right, up, or down.

This is often an undesirable property in image recognition because the meaning of the image does not change if an object shifts a few pixels.

Why Translation Variance Happens

Fully connected (dense) neural networks treat an image as a large 1D vector, ignoring the spatial relationships between neighboring pixels. As a result:

A feature learned at one location must be relearned at every other location.
Shifting the object in the input produces a completely different pattern of values.
The model often fails to recognize the same object in a different position.

This makes the model not generalize well to translated images.

Translation Invariance vs. Translation Variance

Concept	Meaning	Example Behavior
Translation Invariance	The model's prediction does not change when the input image is shifted.	A CNN recognizes a cat regardless of whether it appears at the top-left or center.
Translation Variance	The model's prediction does change when the image is shifted.	A fully connected network fails to identify the same plane if it moves a few pixels.

Why CNNs Fix Translation Variance

Convolutional neural networks naturally achieve translation invariance because they use:

Local receptive fields – small regions of the image are processed at a time.
Weight sharing – one filter slides across the whole image.

This means the same pattern can be detected anywhere in the image, allowing the model to recognize objects regardless of their position.

Summary

Translation variance → predictions change when the image is shifted.
Fully connected networks → translation-variant (bad for images).
CNNs → translation-invariant (ideal for vision tasks).

Search This Blog

Menu

Batch Size and Minibatches in Machine Learning

Key Concepts: Minibatches, DataLoader, and the Limits of Fully Connected Networks

1. Minibatch Gradient Descent

2. PyTorch DataLoader

3. The Training Loop

4. Increasing Model Capacity and Overfitting

5. Why Fully Connected Networks Fail for Images

A. They Ignore Spatial Relationships

B. They Require Massive Numbers of Parameters

6. Motivation for Convolutional Layers

Conclusion

What Is Translation Variance?

Why Translation Variance Happens

Translation Invariance vs. Translation Variance

Why CNNs Fix Translation Variance

Summary

Further Reading

Parent Topics

Contact Us

Popular Posts

BER vs SNR for M-ary QAM, M-ary PSK, QPSK, BPSK, ...

Constellation Diagram of ASK in Detail

Coherence Bandwidth and Coherence Time

Constellation Diagrams of ASK, PSK, and FSK

MATLAB Code for ASK, FSK, and PSK

UGC-NET Electronic Science Previous Year Question Papers with Answer Keys and Full Explanations

Comparisons among ASK, PSK, and FSK | And the definitions of each

Online Simulator for ASK, FSK, and PSK