More data is better? Bias in big datasets impacts ML models

The Yuan requests your support! Our content will now be available free of charge for all registered subscribers, consistent with our mission to make AI a human commons accessible to all. We are therefore requesting donations from our readers so we may continue bringing you insightful reportage of this awesome technology that is sweeping the world. Donate now

By Patrick Glauner | Oct 24, 2022

Image courtesy of and under license from Shutterstock.com

AI professor and entrepreneur Patrick Glauner takes a look at why more data is not always better, suggesting addressing biases and other underlying shortcomings is more important than simply broadening sample sizes or increasing the amount of data analyzed.

DEGGENDORF, GERMANY - The underlying paradigm of big data-driven machine learning (ML) reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. However, the question remains: Is it always helpful to simply have more data?

This article discusses the issue of bias, which occurs whenever the distributions of a training set and test set are different. Such biases often result in a cascade of consequences, since models trained on biased datasets will also likely be biased, and thus end up making biased predictions too. Examining the issue of bias in datasets will help increase awareness of this topic among researchers and practitioners, and therefore be able to derive more reliable models for their learning problems.

For approximately the last two decades, the big data paradigm that has dominated research in ML can be summarized as follows: It is not who has the best algorithm that wins, but who has the most data.

Examples of bias

Biases can appear in many different forms. One can begin by looking at a simple one: A spam filter is trained on a big dataset that consists of positive and negative examples. However, that training set was created a few years ago. Recent spam emails are different in two ways: the content of spam emails is different and the proportion of spam among all emails sent out has also changed. The outcome of this will be that this the spam filter not only does not detect spam reliably, but it also becomes even less reliable over time.

‘Bias’ can mean many different things

More generally, in the field of ML the term ‘bias’ is multifaceted and describes different matters:

Please feel free to share your thoughts on this story
The content herein is subject to copyright by The Yuan. All rights reserved. The content of the services is owned or licensed to The Yuan. Such content from The Yuan may be shared and reprinted but must clearly identify The Yuan as its original source. Content from a third-party copyright holder identified in the copyright notice contained in such third party’s content appearing in The Yuan must likewise be clearly labeled as such.

GET STARTED

- or -