Small sample size problems: learning from very few training examples
By Patrick Glauner  |  Dec 02, 2022
Small sample size problems: learning from very few training examples
Image courtesy of and under license from Shutterstock.com
Conventional wisdom often holds that huge amounts of data are necessary to train models and algorithms to make them more accurate. However, this is not always feasible. Sometimes one must figure out how to learn and do the best one can from data that are available, even if they are not ideal.

DEGGENDORF, GERMANY - The Big Data paradigm followed in modern machine learning (ML) reflects the desire to derive better conclusions by simply analyzing more data. Big Data has enabled the field of ML to achieve breakthroughs in computer vision, natural language processing, and other disciplines in recent years. These include high-accuracy image detectors such as Yolo1 and powerful sequence-to-sequence models such as GPT-3,2 to name just a few. However, in many real-world problems, there is often only a small number of training examples available.

These include biometric identification, analyzing computed tomography (CT) scans, or microarray experiments. The reasons for this lack of data might be the cost of collecting such data, as well as regulatory constraints like data protection.


Small sample sizes

In a typical Big Data-driven ML use case, such as learning to classify images of the ImageNet database,3 the number of training examples m is much greater than the number of features n per example. In small sample sizes, however, this is not the case: m is smaller than n, or possibly even much smaller than n. Other definitions of small sample size problems also refer to the number of training examples mc for the class c for which mcis smaller than n.


Consequences and mitigation

Models trained on small sample size training sets may overfit on the training set and thus generalize poorly for unseen data. Some models, however, may not even be trainable at all on small sample size training sets! This will be shown in detail in the following paragraph for linear regression. The most obvious mitigation strategy might seem to be to simply collect more data but looking at the examples mentioned earlier of biometric identifica

The content herein is subject to copyright by The Yuan. All rights reserved. The content of the services is owned or licensed to The Yuan. The copying or storing of any content for anything other than personal use is expressly prohibited without prior written permission from The Yuan, or the copyright holder identified in the copyright notice contained in the content.
Continue reading
Sign up now to read this story for free.
- or -
Continue with Linkedin Continue with Google
Comments
Share your thoughts.
The Yuan wants to hear your voice. We welcome your on-topic commentary, critique, and expertise. All comments are moderated for civility.