Image Classification: Data Scarcity, Augmentation, and Underfitting/Overfitting

1. Preamble

In many specialized fields, such as medical imaging analysis, industrial defect detection, or specific scientific research projects, we often face a common challenge: small-scale image classification. Unlike public datasets with millions of samples, the data volume in these scenarios might only be a few hundred or a few thousand images, making it extremely precious.

Under such data scarcity, model training is prone to two extreme symptoms. The most common is over-fitting, where the model excessively learns the details and noise of the training data, leading to poor performance on unseen new data. Occasionally, if the data itself contains significant noise or the chosen model is too simple, under-fitting may also occur, where the model fails to adequately capture the underlying patterns in the data. Our core objective is to effectively enhance the model’s generalization performance using a series of technical means, all while controlling training resources and development costs.

2. Under-fitting vs. Over-fitting

To apply the right remedy, we must first accurately understand, diagnose, and differentiate between under-fitting and over-fitting. They represent two forms of imbalance between a model’s learning capacity and the complexity of the data.

Dimension	Under-fitting	Over-fitting
Definition	The model fails to capture the underlying patterns in the data, exhibiting high bias.	The model learns the noise in the training data as if it were a pattern, exhibiting high variance.
Learning Curve	Both training and validation loss are high and converge early.	Training loss decreases continuously, while validation loss starts to increase after a certain point.
Analogy	Trying to fit a complex “S”-shaped curve with a straight line.	Trying to draw a complex path that passes through every single training point.
Trigger	The model’s capacity is much lower than the complexity of the data.	The model’s capacity is much higher than the amount of information in the data.
Diagnosis	Learning curves are parallel and high; bias cannot be reduced.	The gap between training and validation accuracy is large (>5-10%); predictions are often wrong but with high confidence.
Solutions	① Increase model capacity or add features. ② Train longer or tune hyperparameters. ③ Reduce regularization.	① Data augmentation or collect more data. ② Use regularization (L1/L2, Dropout, Early Stopping). ③ Simplify the model. ④ Use model ensembling or distillation.

3. Data Augmentation under Data Scarcity

Among all strategies for mitigating over-fitting, data augmentation is one of the most direct and effective methods. It expands the information content of the dataset without additional labeling costs by applying a series of random transformations to the existing training data, creating more diverse samples that the model has “never seen” before.

3.1 Geometric Transformations

These transformations simulate variations in an object’s pose, scale, and position that might occur in the real world. Common operations include: random rotation (±15° to ±30°), random scaling (0.8× to 1.2×), and random translation (≤10% of image dimensions). More complex transformations like Elastic Distortion and Perspective Transformation can provide even stronger generalization capabilities.

3.2 Pixel-level Perturbations

These transformations aim to improve the model’s robustness to changes in image quality. For instance, adding Gaussian Noise (e.g., σ=0.01-0.05 × 255) or Salt-and-Pepper Noise (e.g., ratio=0.3-0.5%) to the image, and applying GaussianBlur (kernel size=3 or 5) or MotionBlur.

3.3 Color Space Transformations

Transformations in the color space enhance the model’s adaptability to variations in lighting, contrast, and color. ColorJitter is a common tool that randomly adjusts an image’s brightness, contrast, and saturation (e.g., between 0.8 and 1.2 times the original). Converting an image to grayscale and then duplicating it across three channels is another effective technique to force the model to focus on shape rather than color.

3.4 Synthetic/Mixing Methods

In recent years, augmentation methods that mix information from multiple images have become very popular. They create samples that do not exist in the real world but are highly beneficial for model regularization.

Method	Core Idea	Recommended Probability (p)	Notes
Cutout/GridMask	Randomly erases one or more rectangular patches from an image.	`p=0.3`	Be careful not to occlude key objects in detection/segmentation tasks.
Mixup	Linearly interpolates two images and their labels using a ratio λ sampled from a Beta distribution.	`1.0` (as a separate epoch)	α=0.2 is a common hyperparameter.
CutMix	Cuts a patch from one image and pastes it onto another, with label weights adjusted by the patch area.	`p=0.5`	α=1.0 is a common hyperparameter.

3.5 Auto-Augmentation

Manually designing augmentation policy combinations is time-consuming. Thus, auto-augmentation methods were developed. RandAugment is a simple yet effective method that randomly selects N transformations from a predefined pool and applies them with a uniform magnitude M. It often yields significant gains on small datasets (e.g., with N=2, M=9). The more complex AutoAugment uses reinforcement learning to search for the optimal policy combination but is computationally expensive.

4. Implementation in PyTorch

Here, we demonstrate how to build a transform pipeline in PyTorch that includes various augmentation strategies and implement a plug-and-play Mixup function.

4.1 Transform Pipeline

import torchvision.transforms as T
import torch
import numpy as np

# Custom Salt-and-Pepper noise class
class SaltPepperNoise:
    def __init__(self, ratio=0.003):
        self.ratio = ratio
    def __call__(self, img):
        # ... (implementation omitted)
        return img

train_tf = T.Compose([
    T.RandomResizedCrop(224, scale=(0.8, 1.0)),
    T.RandomHorizontalFlip(p=0.5),
    T.RandomRotation(20),
    T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    T.GaussianBlur(kernel_size=5, sigma=(0.1, 2.0)),
    # SaltPepperNoise(0.003), # Custom transform
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
])

This pipeline integrates various geometric and color transformations. Note that the Normalize step is typically placed at the end, and its mean and standard deviation should be set based on the dataset used for the pre-trained model (e.g., ImageNet).

4.2 Plug-and-Play Mixup/CutMix

def mixup_data(x, y, alpha=0.2, use_cuda=True):
    '''Returns mixed inputs, pairs of targets, and lambda'''
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1

    batch_size = x.size()[0]
    if use_cuda:
        index = torch.randperm(batch_size).cuda()
    else:
        index = torch.randperm(batch_size)

    mixed_x = lam * x + (1 - lam) * x[index, :]
    y_a, y_b = y, y[index]
    return mixed_x, y_a, y_b, lam

def mixup_criterion(criterion, pred, y_a, y_b, lam):
    return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

In your training loop, you can fetch inputs and targets from the DataLoader, call mixup_data to generate mixed data, and then compute the mixed loss using mixup_criterion.

4.3 Learning Curve Monitoring

Throughout the training process, it is crucial to continuously monitor the accuracy and loss curves for both the training and validation sets. A key practice is Early Stopping: when the validation loss stops decreasing and starts to rise for N consecutive epochs, training should be halted, and the model weights with the best previous validation performance should be saved. This effectively prevents the model from over-fitting in the later stages of training.

5. Integrated Tuning Workflow

A systematic tuning process can help you maximize the effectiveness of data augmentation:

Establish Baseline: First, train the model without any data augmentation and record its accuracy and loss as a baseline.
Phase 1: Add basic geometric and color transformations; the learning rate can be kept the same or slightly increased.
Phase 2: Introduce Mixup or CutMix, often in conjunction with Label Smoothing for better results.
Phase 3: If computational resources permit, try using RandAugment or AutoAugment to automatically search for the optimal augmentation policy.
Phase 4: Finally, fine-tune the model’s capacity (e.g., network depth) and regularization parameters (e.g., Dropout rate, weight decay) based on the augmented data distribution.

6. Common Pitfalls & Practical Tips

Over-augmentation: Excessive augmentation can cause a significant distribution shift between the training set and the actual test set, thereby hurting performance. It’s advisable to start with a small application probability (e.g., p=0.3).
Label Synchronization: In object detection or semantic segmentation tasks, when applying geometric transformations to an image, the exact same transformations must be applied to the bounding boxes or masks.
Validation Set Purity: The validation set should be kept as “clean” as possible to accurately reflect the model’s performance on the original data distribution. Typically, only necessary resizing and center cropping are applied, without heavy augmentation.
Mixup with Small Batches: Using Mixup with a very small batch size can excessively dilute the original signal. In such cases, consider increasing the batch size or adjusting the learning rate.

7. Conclusion & Further Exploration

When faced with the challenge of data scarcity, our core objective is to expand the diversity of the data through augmentation, not just to increase its quantity. The struggle against under-fitting and over-fitting is essentially about finding a delicate balance between the model’s capacity and the data’s information content.

The strategies introduced in this article are a classic starting point for solving this problem. Building on this foundation, you can further explore more advanced directions, such as: using unlabeled data for Self-supervised Pre-training, synthesizing new data with generative models (like Diffusion Models) or 3D rendering, and applying Semi-supervised Learning techniques.

Ge Yuxu • AI & Engineering

脱敏说明：本文所有出现的表名、字段名、接口地址、变量名、IP地址及示例数据等均非真实，仅用于阐述技术思路与实现步骤，示例代码亦非公司真实代码。示例方案亦非公司真实完整方案，仅为本人记忆总结，用于技术学习探讨。
    • 文中所示任何标识符并不对应实际生产环境中的名称或编号。
    • 示例 SQL、脚本、代码及数据等均为演示用途，不含真实业务数据，也不具备直接运行或复现的完整上下文。
    • 读者若需在实际项目中参考本文方案，请结合自身业务场景及数据安全规范，使用符合内部命名和权限控制的配置。

Data Desensitization Notice: All table names, field names, API endpoints, variable names, IP addresses, and sample data appearing in this article are fictitious and intended solely to illustrate technical concepts and implementation steps. The sample code is not actual company code. The proposed solutions are not complete or actual company solutions but are summarized from the author's memory for technical learning and discussion.
    • Any identifiers shown in the text do not correspond to names or numbers in any actual production environment.
    • Sample SQL, scripts, code, and data are for demonstration purposes only, do not contain real business data, and lack the full context required for direct execution or reproduction.
    • Readers who wish to reference the solutions in this article for actual projects should adapt them to their own business scenarios and data security standards, using configurations that comply with internal naming and access control policies.

版权声明：本文版权归原作者所有，未经作者事先书面许可，任何单位或个人不得以任何方式复制、转载、摘编或用于商业用途。
    • 若需非商业性引用或转载本文内容，请务必注明出处并保持内容完整。
    • 对因商业使用、篡改或不当引用本文内容所产生的法律纠纷，作者保留追究法律责任的权利。

Copyright Notice: The copyright of this article belongs to the original author. Without prior written permission from the author, no entity or individual may copy, reproduce, excerpt, or use it for commercial purposes in any way.
    • For non-commercial citation or reproduction of this content, attribution must be given, and the integrity of the content must be maintained.
    • The author reserves the right to pursue legal action against any legal disputes arising from the commercial use, alteration, or improper citation of this article's content.

Copyright © 1989–Present Ge Yuxu. All Rights Reserved.