Small sample size problems: learning from very few training examples

The Yuan requests your support! Our content will now be available free of charge for all registered subscribers, consistent with our mission to make AI a human commons accessible to all. We are therefore requesting donations from our readers so we may continue bringing you insightful reportage of this awesome technology that is sweeping the world. Donate now

By Patrick Glauner | Dec 02, 2022

Image courtesy of and under license from Shutterstock.com

Conventional wisdom often holds that huge amounts of data are necessary to train models and algorithms to make them more accurate. However, this is not always feasible. Sometimes one must figure out how to learn and do the best one can from data that are available, even if they are not ideal.

DEGGENDORF, GERMANY - The Big Data paradigm followed in modern machine learning (ML) reflects the desire to derive better conclusions by simply analyzing more data. Big Data has enabled the field of ML to achieve breakthroughs in computer vision, natural language processing, and other disciplines in recent years. These include high-accuracy image detectors such as Yolo¹ and powerful sequence-to-sequence models such as GPT-3,² to name just a few. However, in many real-world problems, there is often only a small number of training examples available.

These include biometric identification, analyzing computed tomography (CT) scans, or microarray experiments. The reasons for this lack of data might be the cost of collecting such data, as well as regulatory constraints like data protection.

Small sample sizes

In a typical Big Data-driven ML use case, such as learning to classify images of the ImageNet database,³ the number of training examples m is much greater than the number of features n per example. In small sample sizes, however, this is not the case: m is smaller than n, or possibly even much smaller than n. Other definitions of small sample size problems also refer to the number of training examples mc for the class c for which mcis smaller than n.

Consequences and mitigation

Models trained on small sample size training sets may overfit on the training set and thus generalize poorly for unseen data. Some models, however, may not even be trainable at all on small sample size training sets! This will be shown in detail in the following paragraph for linear regression. The most obvious mitigation strategy might seem to be to simply collect more data but looking at the examples mentioned earlier of biometric identifica

The content herein is subject to copyright by The Yuan. All rights reserved. The content of the services is owned or licensed to The Yuan. Such content from The Yuan may be shared and reprinted but must clearly identify The Yuan as its original source. Content from a third-party copyright holder identified in the copyright notice contained in such third party’s content appearing in The Yuan must likewise be clearly labeled as such.

GET STARTED

- or -