DEGGENDORF, GERMANY - The underlying paradigm of big data-driven machine learning (ML) reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. However, the question remains: Is it always helpful to simply have more data?
This article discusses the issue of bias, which occurs whenever the distributions of a training set and test set are different. Such biases often result in a cascade of consequences, since models trained on biased datasets will also likely be biased, and thus end up making biased predictions too. Examining the issue of bias in datasets will help increase awareness of this topic among researchers and practitioners, and therefore be able to derive more reliable models for their learning problems.
For approximately the last two decades, the big data paradigm that has dominated research in ML can be summarized as follows: It is not who has the best algorithm that wins, but who has the most data.
Examples of bias
Biases can appear in many different forms. One can begin by looking at a simple one: A spam filter is trained on a big dataset that consists of positive and negative examples. However, that training set was created a few years ago. Recent spam emails are different in two ways: the content of spam emails is different and the proportion of spam among all emails sent out has also changed. The outcome of this will be that this the spam filter not only does not detect spam reliably, but it also becomes even less reliable over time.
‘Bias’ can mean many different things
More generally, in the field of ML the term ‘bias’ is multifaceted and describes different matters:
The content herein is subject to copyright by The Yuan. All rights reserved. The content of the services is owned or licensed to The Yuan. Such content from The Yuan may be shared and reprinted but must clearly identify The Yuan as its original source. Content from a third-party copyright holder identified in the copyright notice contained in such third party’s content appearing in The Yuan must likewise be clearly labeled as such.