Dataset’s Critical Role in Creating Correct Predictive Models

The Yuan requests your support! Our content will now be available free of charge for all registered subscribers, consistent with our mission to make AI a human commons accessible to all. We are therefore requesting donations from our readers so we may continue bringing you insightful reportage of this awesome technology that is sweeping the world. Donate now

By Jan Sevcik | Oct 21, 2021

Image courtesy of and under license from Shutterstock.com

The digitization of electronic medical records, new machine learning techniques, and more robust hardware are creating opportunities to solve many unanswered questions in healthcare, but using the correct dataset is critical to creating predictive models which draw the correct conclusions. Even the most advanced AI techniques will yield poor results if the wrong dataset is utilized. Jan Sevcik discusses the variables’ conundrum and the problems scientists face.

CHATTANOOGA, TENNESSEE - Important considerations when choosing a dataset are the number of variables and volume of data. Because current artificial intelligence (AI) techniques allow more efficient analysis of datasets with many variables, in healthcare, data from electronic health records (EHRs) are becoming the norm for creating clinical models versus less robust datasets like claims databases.

A dataset has to contain a sufficient volume of data to make a predictive decision, but generally more important is for the dataset to contain robust variables. If the data do not contain correct variables a higher volume will not solve the problem. For a team to have clinical expertise in addition to data sources is also important.

An example of decisions made on incomplete data is something anyone is familiar with who has used a navigation application in a large metropolitan area during rush hour. A navigation app relies on many different data points, such as the distance between points, travel speed, accidents, or construction projects to suggest the most efficient route. Data points like weather are not included, however.

If a large storm with heavy rain is set to cross an interstate in a metropolitan area, it will reduce traffic speed and statistically result in more accidents, both of which will increase travel time. If a navigation app’s user happens to be at a decision point where a local road might be an option versus the interstate, but the storm has not yet crossed the interstate but already slowed traffic speeds, the app will still consider the interstate to be the most efficient route because at that time that is the correct decision based on available data.

If the app had incorporated a weather radar into the decision, it might have suggested a more efficient route base

The content herein is subject to copyright by The Yuan. All rights reserved. The content of the services is owned or licensed to The Yuan. Such content from The Yuan may be shared and reprinted but must clearly identify The Yuan as its original source. Content from a third-party copyright holder identified in the copyright notice contained in such third party’s content appearing in The Yuan must likewise be clearly labeled as such.

GET STARTED

- or -