Skip to main content

Batch Size and Minibatches in Machine Learning


Key Concepts: Minibatches, DataLoader, and the Limits of Fully Connected Networks

This document summarizes several fundamental ideas in deep learning training pipelines, including minibatch gradient descent, PyTorch’s DataLoader, model capacity, and the limitations of fully connected (dense) networks for image data. These concepts motivate the transition to convolutional neural networks (CNNs).

1. Minibatch Gradient Descent

Training with minibatches means computing gradients on a small subset of the dataset rather than the full dataset. This introduces noise, which has important benefits:

  • Efficiency: Computing gradients on the entire dataset is slow; minibatches make training fast and scalable.
  • Useful Noise: Minibatch gradients are noisy approximations. This stochasticity helps avoid local minima and supports stable convergence.
  • Learning Rate Requirements: Because minibatch gradients fluctuate, a reasonably small learning rate prevents instability.

Shuffling the dataset each epoch ensures the sequence of minibatches remains representative of the overall data distribution.

2. PyTorch DataLoader

The DataLoader automates:

  • Batching of samples
  • Shuffling each epoch
  • Iterating over data easily within training loops

A typical DataLoader setup:

train_loader = torch.utils.data.DataLoader(
    cifar2, batch_size=64, shuffle=True
)
        

Each iteration returns a minibatch of images and labels, ready for processing in the forward pass.

3. The Training Loop

Each training step consists of:

  • Forward pass
  • Loss computation
  • Zeroing gradients
  • Backward propagation
  • Optimizer step

Example batch shapes:

  • imgs: 64 × 3 × 32 × 32
  • labels: 64

After training, accuracy is measured on a separate validation set without tracking gradients.

4. Increasing Model Capacity and Overfitting

Adding more layers or larger layers increases the model’s capacity. This leads to:

  • Near-perfect training accuracy
  • Limited improvement in validation accuracy

This behavior indicates overfitting: the model memorizes the training set rather than learning generalizable patterns.

You can inspect the number of trainable parameters using p.numel(). Fully connected layers tend to produce extremely large parameter counts.

5. Why Fully Connected Networks Fail for Images

A. They Ignore Spatial Relationships

Flattening an image into a 1D vector removes the natural 2D structure. The network must learn pixel relationships independently for every location:

  • An airplane at one position must be learned separately from an airplane shifted by a few pixels.
  • The model is not translation invariant.

B. They Require Massive Numbers of Parameters

Every output neuron connects to every input pixel. For image inputs, especially high-resolution ones, this causes exponential growth in parameter count. For example, a single fully connected layer on a 1024×1024 RGB image could require billions of parameters.

This is computationally and memory-wise impractical.

6. Motivation for Convolutional Layers

The shortcomings of fully connected layers lead naturally to the need for convolutional neural networks (CNNs):

  • They exploit local patterns through small receptive fields.
  • They reuse parameters across spatial positions (weight sharing).
  • They are naturally translation invariant.
  • They scale efficiently to large images.

Convolutional layers are therefore the standard architecture for image tasks.

Conclusion

  • Minibatches provide efficiency and useful randomness during training.
  • PyTorch’s DataLoader simplifies data handling.
  • Fully connected networks are prone to overfitting and do not scale well to images.
  • CNNs solve these issues by leveraging the 2D structure of images and promoting translation invariance.

These concepts form the foundation for understanding modern deep learning approaches to image classification.

What Is Translation Variance?

Translation variance refers to a model’s tendency to produce different outputs when an input image is shifted (translated) left, right, up, or down.

This is often an undesirable property in image recognition because the meaning of the image does not change if an object shifts a few pixels.

Why Translation Variance Happens

Fully connected (dense) neural networks treat an image as a large 1D vector, ignoring the spatial relationships between neighboring pixels. As a result:

  • A feature learned at one location must be relearned at every other location.
  • Shifting the object in the input produces a completely different pattern of values.
  • The model often fails to recognize the same object in a different position.

This makes the model not generalize well to translated images.

Translation Invariance vs. Translation Variance

Concept Meaning Example Behavior
Translation Invariance The model's prediction does not change when the input image is shifted. A CNN recognizes a cat regardless of whether it appears at the top-left or center.
Translation Variance The model's prediction does change when the image is shifted. A fully connected network fails to identify the same plane if it moves a few pixels.

Why CNNs Fix Translation Variance

Convolutional neural networks naturally achieve translation invariance because they use:

  • Local receptive fields – small regions of the image are processed at a time.
  • Weight sharing – one filter slides across the whole image.

This means the same pattern can be detected anywhere in the image, allowing the model to recognize objects regardless of their position.

Summary

  • Translation variance → predictions change when the image is shifted.
  • Fully connected networks → translation-variant (bad for images).
  • CNNs → translation-invariant (ideal for vision tasks).

Further Reading


People are good at skipping over material they already know!

View Related Topics to







Contact Us

Name

Email *

Message *

Popular Posts

Simulation of ASK, FSK, and PSK using MATLAB Simulink (with Online Simulator)

📘 Overview 🧮 How to use MATLAB Simulink 🧮 Simulation of ASK using MATLAB Simulink 🧮 Simulation of FSK using MATLAB Simulink 🧮 Simulation of PSK using MATLAB Simulink 🧮 Simulator for ASK, FSK, and PSK 🧮 Digital Signal Processing Simulator 📚 Further Reading ASK, FSK & PSK HomePage MATLAB Simulation Simulation of Amplitude Shift Keying (ASK) using MATLAB Simulink In Simulink, we pick different components/elements from MATLAB Simulink Library. Then we connect the components and perform a particular operation. Result A sine wave source, a pulse generator, a product block, a mux, and a scope are shown in the diagram above. The pulse generator generates the '1' and '0' bit sequences. Sine wave sources produce a specific amplitude and frequency. The scope displays the modulated signal as well as the original bit sequence created by the pulse generator. Mux i...

BER vs SNR for M-ary QAM, M-ary PSK, QPSK, BPSK, ...(MATLAB Code + Simulator)

Bit Error Rate (BER) & SNR Guide Analyze communication system performance with our interactive simulators and MATLAB tools. 📘 Theory 🧮 Simulators 💻 MATLAB Code 📚 Resources BER Definition SNR Formula BER Calculator MATLAB Comparison 📂 Explore M-ary QAM, PSK, and QPSK Topics ▼ 🧮 Constellation Simulator: M-ary QAM 🧮 Constellation Simulator: M-ary PSK 🧮 BER calculation for ASK, FSK, and PSK 🧮 Approaches to BER vs SNR What is Bit Error Rate (BER)? The BER indicates how many corrupted bits are received compared to the total number of bits sent. It is the primary figure of merit for a...

Antenna Gain-Combining Methods - EGC, MRC, SC, and RMSGC

📘 Overview 🧮 Equal gain combining (EGC) 🧮 Maximum ratio combining (MRC) 🧮 Selective combining (SC) 🧮 Root mean square gain combining (RMSGC) 🧮 Zero-Forcing (ZF) Combining 🧮 MATLAB Code 📚 Further Reading  There are different antenna gain-combining methods. They are as follows. 1. Equal gain combining (EGC) 2. Maximum ratio combining (MRC) 3. Selective combining (SC) 4. Root mean square gain combining (RMSGC) 5. Zero-Forcing (ZF) Combining  1. Equal gain combining method Equal Gain Combining (EGC) is a diversity combining technique in which the receiver aligns the phase of the received signals from multiple antennas (or channels) but gives them equal amplitude weight before summing. This means each received signal is phase-corrected to be coherent with others, but no scaling is applied based on signal strength or channel quality (unlike MRC). Mathematically, for received signa...

Constellation Diagrams of ASK, PSK, and FSK (with MATLAB Code + Simulator)

Constellation Diagrams: ASK, FSK, and PSK Comprehensive guide to signal space representation, including interactive simulators and MATLAB implementations. 📘 Overview 🧮 Simulator ⚖️ Theory 📚 Resources Definitions Constellation Tool Key Points MATLAB Code 📂 Other Topics: M-ary PSK & QAM Diagrams ▼ 🧮 Simulator for M-ary PSK Constellation 🧮 Simulator for M-ary QAM Constellation BASK (Binary ASK) Modulation Transmits one of two signals: 0 or -√Eb, where Eb​ is the energy per bit. These signals represent binary 0 and 1. BFSK (Binary FSK) Modulation Transmits one ...

Coherence Bandwidth and Coherence Time (with MATLAB + Simulator)

🧮 Coherence Bandwidth 🧮 Coherence Time 🧮 MATLAB Code s 📚 Further Reading For Doppler Delay or Multi-path Delay Coherence time T coh ∝ 1 / v max (For slow fading, coherence time T coh is greater than the signaling interval.) Coherence bandwidth W coh ∝ 1 / Ï„ max (For frequency-flat fading, coherence bandwidth W coh is greater than the signaling bandwidth.) Where: T coh = coherence time W coh = coherence bandwidth v max = maximum Doppler frequency (or maximum Doppler shift) Ï„ max = maximum excess delay (maximum time delay spread) Notes: The notation v max −1 and Ï„ max −1 indicate inverse proportionality. Doppler spread refers to the range of frequency shifts caused by relative motion, determining T coh . Delay spread (or multipath delay spread) determines W coh . Frequency-flat fading occurs when W coh is greater than the signaling bandwidth. Coherence Bandwidth Coherence bandwidth is...

OFDM Symbols and Subcarriers Explained

This article explains how OFDM (Orthogonal Frequency Division Multiplexing) symbols and subcarriers work. It covers modulation, mapping symbols to subcarriers, subcarrier frequency spacing, IFFT synthesis, cyclic prefix, and transmission. Step 1: Modulation First, modulate the input bitstream. For example, with 16-QAM , each group of 4 bits maps to one QAM symbol. Suppose we generate a sequence of QAM symbols: s0, s1, s2, s3, s4, s5, …, s63 Step 2: Mapping Symbols to Subcarriers Assume N sub = 8 subcarriers. Each OFDM symbol in the frequency domain contains 8 QAM symbols (one per subcarrier): Mapping (example) OFDM symbol 1 → s0, s1, s2, s3, s4, s5, s6, s7 OFDM symbol 2 → s8, s9, s10, s11, s12, s13, s14, s15 … OFDM sym...

BER performance of QPSK with BPSK, 4-QAM, 16-QAM, 64-QAM, 256-QAM, etc (MATLAB + Simulator)

📘 Overview 📚 QPSK vs BPSK and QAM: A Comparison of Modulation Schemes in Wireless Communication 📚 Real-World Example 🧮 MATLAB Code 📚 Further Reading   QPSK provides twice the data rate compared to BPSK. However, the bit error rate (BER) is approximately the same as BPSK at low SNR values when gray coding is used. On the other hand, QPSK exhibits similar spectral efficiency to 4-QAM and 16-QAM under low SNR conditions. In very noisy channels, QPSK can sometimes achieve better spectral efficiency than 4-QAM or 16-QAM. In practical wireless communication scenarios, QPSK is commonly used along with QAM techniques, especially where adaptive modulation is applied. Modulation Bits/Symbol Points in Constellation Usage Notes BPSK 1 2 Very robust, used in weak signals QPSK 2 4 Balanced speed & reliability 4-QAM ...

ASK, FSK, and PSK (with MATLAB + Online Simulator)

📘 ASK Theory 📘 FSK Theory 📘 PSK Theory 📊 Comparison 🧮 MATLAB Codes 🎮 Simulator ASK or OFF ON Keying ASK is a simple (less complex) Digital Modulation Scheme where we vary the modulation signal's amplitude or voltage by the message signal's amplitude or voltage. We select two levels (two different voltage levels) for transmitting modulated message signals. Example: "+5 Volt" (upper level) and "0 Volt" (lower level). To transmit binary bit "1", the transmitter sends "+5 Volts", and for bit "0", it sends no power. The receiver uses filters to detect whether a binary "1" or "0" was transmitted. Fig 1: Output of ASK, FSK, and PSK modulation using MATLAB for a data stream "1 1 0 0 1 0 1 0" ( Get MATLAB Code ) ...