Skip to main content

Batch Size and Minibatches in Machine Learning


Key Concepts: Minibatches, DataLoader, and the Limits of Fully Connected Networks

This document summarizes several fundamental ideas in deep learning training pipelines, including minibatch gradient descent, PyTorch’s DataLoader, model capacity, and the limitations of fully connected (dense) networks for image data. These concepts motivate the transition to convolutional neural networks (CNNs).

1. Minibatch Gradient Descent

Training with minibatches means computing gradients on a small subset of the dataset rather than the full dataset. This introduces noise, which has important benefits:

  • Efficiency: Computing gradients on the entire dataset is slow; minibatches make training fast and scalable.
  • Useful Noise: Minibatch gradients are noisy approximations. This stochasticity helps avoid local minima and supports stable convergence.
  • Learning Rate Requirements: Because minibatch gradients fluctuate, a reasonably small learning rate prevents instability.

Shuffling the dataset each epoch ensures the sequence of minibatches remains representative of the overall data distribution.

2. PyTorch DataLoader

The DataLoader automates:

  • Batching of samples
  • Shuffling each epoch
  • Iterating over data easily within training loops

A typical DataLoader setup:

train_loader = torch.utils.data.DataLoader(
    cifar2, batch_size=64, shuffle=True
)
        

Each iteration returns a minibatch of images and labels, ready for processing in the forward pass.

3. The Training Loop

Each training step consists of:

  • Forward pass
  • Loss computation
  • Zeroing gradients
  • Backward propagation
  • Optimizer step

Example batch shapes:

  • imgs: 64 × 3 × 32 × 32
  • labels: 64

After training, accuracy is measured on a separate validation set without tracking gradients.

4. Increasing Model Capacity and Overfitting

Adding more layers or larger layers increases the model’s capacity. This leads to:

  • Near-perfect training accuracy
  • Limited improvement in validation accuracy

This behavior indicates overfitting: the model memorizes the training set rather than learning generalizable patterns.

You can inspect the number of trainable parameters using p.numel(). Fully connected layers tend to produce extremely large parameter counts.

5. Why Fully Connected Networks Fail for Images

A. They Ignore Spatial Relationships

Flattening an image into a 1D vector removes the natural 2D structure. The network must learn pixel relationships independently for every location:

  • An airplane at one position must be learned separately from an airplane shifted by a few pixels.
  • The model is not translation invariant.

B. They Require Massive Numbers of Parameters

Every output neuron connects to every input pixel. For image inputs, especially high-resolution ones, this causes exponential growth in parameter count. For example, a single fully connected layer on a 1024×1024 RGB image could require billions of parameters.

This is computationally and memory-wise impractical.

6. Motivation for Convolutional Layers

The shortcomings of fully connected layers lead naturally to the need for convolutional neural networks (CNNs):

  • They exploit local patterns through small receptive fields.
  • They reuse parameters across spatial positions (weight sharing).
  • They are naturally translation invariant.
  • They scale efficiently to large images.

Convolutional layers are therefore the standard architecture for image tasks.

Conclusion

  • Minibatches provide efficiency and useful randomness during training.
  • PyTorch’s DataLoader simplifies data handling.
  • Fully connected networks are prone to overfitting and do not scale well to images.
  • CNNs solve these issues by leveraging the 2D structure of images and promoting translation invariance.

These concepts form the foundation for understanding modern deep learning approaches to image classification.

What Is Translation Variance?

Translation variance refers to a model’s tendency to produce different outputs when an input image is shifted (translated) left, right, up, or down.

This is often an undesirable property in image recognition because the meaning of the image does not change if an object shifts a few pixels.

Why Translation Variance Happens

Fully connected (dense) neural networks treat an image as a large 1D vector, ignoring the spatial relationships between neighboring pixels. As a result:

  • A feature learned at one location must be relearned at every other location.
  • Shifting the object in the input produces a completely different pattern of values.
  • The model often fails to recognize the same object in a different position.

This makes the model not generalize well to translated images.

Translation Invariance vs. Translation Variance

Concept Meaning Example Behavior
Translation Invariance The model's prediction does not change when the input image is shifted. A CNN recognizes a cat regardless of whether it appears at the top-left or center.
Translation Variance The model's prediction does change when the image is shifted. A fully connected network fails to identify the same plane if it moves a few pixels.

Why CNNs Fix Translation Variance

Convolutional neural networks naturally achieve translation invariance because they use:

  • Local receptive fields – small regions of the image are processed at a time.
  • Weight sharing – one filter slides across the whole image.

This means the same pattern can be detected anywhere in the image, allowing the model to recognize objects regardless of their position.

Summary

  • Translation variance → predictions change when the image is shifted.
  • Fully connected networks → translation-variant (bad for images).
  • CNNs → translation-invariant (ideal for vision tasks).

Further Reading


People are good at skipping over material they already know!

View Related Topics to







Contact Us

Name

Email *

Message *

Popular Posts

Constellation Diagrams of ASK, PSK, and FSK

📘 Overview of Energy per Bit (Eb / N0) 🧮 Online Simulator for constellation diagrams of ASK, FSK, and PSK 🧮 Theory behind Constellation Diagrams of ASK, FSK, and PSK 🧮 MATLAB Codes for Constellation Diagrams of ASK, FSK, and PSK 📚 Further Reading 📂 Other Topics on Constellation Diagrams of ASK, PSK, and FSK ... 🧮 Simulator for constellation diagrams of m-ary PSK 🧮 Simulator for constellation diagrams of m-ary QAM BASK (Binary ASK) Modulation: Transmits one of two signals: 0 or -√Eb, where Eb​ is the energy per bit. These signals represent binary 0 and 1.    BFSK (Binary FSK) Modulation: Transmits one of two signals: +√Eb​ ( On the y-axis, the phase shift of 90 degrees with respect to the x-axis, which is also termed phase offset ) or √Eb (on x-axis), where Eb​ is the energy per bit. These signals represent binary 0 and 1.  BPSK (Binary PSK) Modulation: Transmits one of two signals...

Fading : Slow & Fast and Large & Small Scale Fading

📘 Overview 📘 LARGE SCALE FADING 📘 SMALL SCALE FADING 📘 SLOW FADING 📘 FAST FADING 🧮 MATLAB Codes 📚 Further Reading LARGE SCALE FADING The term 'Large scale fading' is used to describe variations in received signal power over a long distance, usually just considering shadowing.  Assume that a transmitter (say, a cell tower) and a receiver  (say, your smartphone) are in constant communication. Take into account the fact that you are in a moving vehicle. An obstacle, such as a tall building, comes between your cell tower and your vehicle's line of sight (LOS) path. Then you'll notice a decline in the power of your received signal on the spectrogram. Large-scale fading is the term for this type of phenomenon. SMALL SCALE FADING  Small scale fading is a term that describes rapid fluctuations in the received signal power on a small time scale. This includes multipath propagation effects as well as movement-induced Doppler fr...

Online Simulator for ASK, FSK, and PSK

Try our new Digital Signal Processing Simulator!   Start Simulator for binary ASK Modulation Message Bits (e.g. 1,0,1,0) Carrier Frequency (Hz) Sampling Frequency (Hz) Run Simulation Simulator for binary FSK Modulation Input Bits (e.g. 1,0,1,0) Freq for '1' (Hz) Freq for '0' (Hz) Sampling Rate (Hz) Visualize FSK Signal Simulator for BPSK Modulation ...

DFTs-OFDM vs OFDM: Why DFT-Spread OFDM Reduces PAPR Effectively (with MATLAB Code)

DFT-spread OFDM (DFTs-OFDM) has lower Peak-to-Average Power Ratio (PAPR) because it "spreads" the data in the frequency domain before applying IFFT, making the time-domain signal behave more like a single-carrier signal rather than a multi-carrier one like OFDM. Deeper Explanation: Aspect OFDM DFTs-OFDM Signal Type Multi-carrier Single-carrier-like Process IFFT of QAM directly QAM → DFT → IFFT PAPR Level High (due to many carriers adding up constructively) Low (less fluctuation in amplitude) Why PAPR is High Subcarriers can add in phase, causing spikes DFT "pre-spreads" data, smoothing it Used in Wi-Fi, LTE downlink LTE uplink (as SC-FDMA) In OFDM, all subcarriers can...

Theoretical BER vs SNR for binary ASK, FSK, and PSK

📘 Overview & Theory 🧮 MATLAB Codes 📚 Further Reading Theoretical BER vs SNR for Amplitude Shift Keying (ASK) The theoretical Bit Error Rate (BER) for binary ASK depends on how binary bits are mapped to signal amplitudes. For typical cases: If bits are mapped to 1 and -1, the BER is: BER = Q(√(2 × SNR)) If bits are mapped to 0 and 1, the BER becomes: BER = Q(√(SNR / 2)) Where: Q(x) is the Q-function: Q(x) = 0.5 × erfc(x / √2) SNR : Signal-to-Noise Ratio N₀ : Noise Power Spectral Density Understanding the Q-Function and BER for ASK Bit '0' transmits noise only Bit '1' transmits signal (1 + noise) Receiver decision threshold is 0.5 BER is given by: P b = Q(0.5 / σ) , where σ = √(N₀ / 2) Using SNR = (0.5)² / N₀, we get: BER = Q(√(SNR / 2)) Theoretical BER vs ...

MATLAB Code for ASK, FSK, and PSK

📘 Overview & Theory 🧮 MATLAB Code for ASK 🧮 MATLAB Code for FSK 🧮 MATLAB Code for PSK 🧮 Simulator for binary ASK, FSK, and PSK Modulations 📚 Further Reading ASK, FSK & PSK HomePage MATLAB Code MATLAB Code for ASK Modulation and Demodulation % The code is written by SalimWireless.Com % Clear previous data and plots clc; clear all; close all; % Parameters Tb = 1; % Bit duration (s) fc = 10; % Carrier frequency (Hz) N_bits = 10; % Number of bits Fs = 100 * fc; % Sampling frequency (ensure at least 2*fc, more for better representation) Ts = 1/Fs; % Sampling interval samples_per_bit = Fs * Tb; % Number of samples per bit duration % Generate random binary data rng(10); % Set random seed for reproducibility binary_data = randi([0, 1], 1, N_bits); % Generate random binary data (0 or 1) % Initialize arrays for continuous signals t_overall = 0:Ts:(N_bits...

BER vs SNR for M-ary QAM, M-ary PSK, QPSK, BPSK, ...

📘 Overview of BER and SNR 🧮 Online Simulator for BER calculation of m-ary QAM and m-ary PSK 🧮 MATLAB Code for BER calculation of M-ary QAM, M-ary PSK, QPSK, BPSK, ... 📚 Further Reading 📂 View Other Topics on M-ary QAM, M-ary PSK, QPSK ... 🧮 Online Simulator for Constellation Diagram of m-ary QAM 🧮 Online Simulator for Constellation Diagram of m-ary PSK 🧮 MATLAB Code for BER calculation of ASK, FSK, and PSK 🧮 MATLAB Code for BER calculation of Alamouti Scheme 🧮 Different approaches to calculate BER vs SNR What is Bit Error Rate (BER)? The abbreviation BER stands for Bit Error Rate, which indicates how many corrupted bits are received (after the demodulation process) compared to the total number of bits sent in a communication process. BER = (number of bits received in error) / (total number of tran...

Theoretical BER vs SNR for m-ary PSK and QAM

Relationship Between Bit Error Rate (BER) and Signal-to-Noise Ratio (SNR) The relationship between Bit Error Rate (BER) and Signal-to-Noise Ratio (SNR) is a fundamental concept in digital communication systems. Here’s a detailed explanation: BER (Bit Error Rate): The ratio of the number of bits incorrectly received to the total number of bits transmitted. It measures the quality of the communication link. SNR (Signal-to-Noise Ratio): The ratio of the signal power to the noise power, indicating how much the signal is corrupted by noise. Relationship The BER typically decreases as the SNR increases. This relationship helps evaluate the performance of various modulation schemes. BPSK (Binary Phase Shift Keying) Simple and robust. BER in AWGN channel: BER = 0.5 × erfc(√SNR) Performs well at low SNR. QPSK (Quadrature...