More data is better? Bias in big datasets impacts ML models
By Patrick Glauner  |  Oct 24, 2022
More data is better? Bias in big datasets impacts ML models
Image courtesy of and under license from
AI professor and entrepreneur Patrick Glauner takes a look at why more data is not always better, suggesting addressing biases and other underlying shortcomings is more important than simply broadening sample sizes or increasing the amount of data analyzed.

DEGGENDORF, GERMANY - The underlying paradigm of big data-driven machine learning (ML) reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. However, the question remains: Is it always helpful to simply have more data?

This article discusses the issue of bias, which occurs whenever the distributions of a training set and test set are different. Such biases often result in a cascade of consequences, since models trained on biased datasets will also likely be biased, and thus end up making biased predictions too. Examining the issue of bias in datasets will help increase awareness of this topic among researchers and practitioners, and therefore be able to derive more reliable models for their learning problems.

For approximately the last two decades, the big data paradigm that has dominated research in ML can be summarized as follows: It is not who has the best algorithm that wins, but who has the most data.

Examples of bias

Biases can appear in many different forms. One can begin by looking at a simple one: A spam filter is trained on a big dataset that consists of positive and negative examples. However, that training set was created a few years ago. Recent spam emails are different in two ways: the content of spam emails is different and the proportion of spam among all emails sent out has also changed. The outcome of this will be that this the spam filter not only does not detect spam reliably, but it also becomes even less reliable over time.

‘Bias’ can mean many different things

More generally, in the field of ML the term ‘bias’ is multifaceted and describes different matters:

The content herein is subject to copyright by The Yuan. All rights reserved. The content of the services is owned or licensed to The Yuan. The copying or storing of any content for anything other than personal use is expressly prohibited without prior written permission from The Yuan, or the copyright holder identified in the copyright notice contained in the content.
Continue reading
Sign up now to read this story for free.
- or -
Continue with Linkedin Continue with Google
Share your thoughts.
The Yuan wants to hear your voice. We welcome your on-topic commentary, critique, and expertise. All comments are moderated for civility.