According to a Stanford study published in Nature Medicine last week, some AI-powered medical devices approved by the U.S. Food and Drug Administration (FDA) are susceptible to data shifts and bias against underrepresented patients.
While the academic community has begun the development of guidelines for AI clinical trials, practices haven’t been established for commercial algorithms. The FDA in the U.S. has the responsibility of approving AI-powered medical devices and releasing information on them.
The coauthors of the Stanford research created a database of FDA-approved medical AI devices and examined how they were tested before they were approved. According to the researchers, 126 out of 130 devices approved between January 2015 and December 2020 only went through retrospective studies at their submission. None of the 54 approved high-risk devices were examined by prospective studies which means that test data was obtained before the approval of the devices, rather than simultaneously with their deployment.
The coauthors argue that the prospective studies are particularly necessary for AI medical devices because they may deviate from their intended use. For example, a prospective study may show that clinicians are misusing a particular device for diagnosing, which means that the results would be different from what should be expected.
Evidence exists to suggest that these deviations could lead to errors. The Pennsylvania Patient Safety Authority in Harrisburg tracked and found that from January 2016 to December 2017, EHR systems were responsible for 775 problems in laboratory testing in the state. Human-computer interactions were responsible for 54.7% of events and the remaining 45.3% was caused by a computer. Also, a draft report issued in 2018 by the U.S. government revealed that clinicians commonly miss alerts ranging from minor issues about drug interactions to those which have considerable risks.
The Stanford researchers found a lacking of patient diversity in the FDA-approved devices as well. Among the 130 devices, 93 did not undergo a multisite assessment, while 4 were tested at only one site and 8 devices in only two sites. The reports for 59 devices didn’t mention the sample size of the studies. Of the 71 device studies with this information, the median size was 300 and just 17 device studies considered the algorithm’s performance on different patient groups.
Partly due to a reticence to release code, datasets and methods, most of the data used to train algorithms today for the diagnosis of diseases may perpetuate inequalities. A team of UK scientists discovered that almost all eye disease datasets come from patients in China, Europe and North America, which implies that eye disease-diagnosing algorithms are less certain to perform efficiently for racial groups from underrepresented countries. Researchers from the University of Toronto, the Vector Institute and MIT in another study showed that popularly used chest X-ray datasets encode gender, racial and socioeconomic bias.
Apart from basic dataset challenges, models with inadequate peer review can come in contact with obstacles when deployed in the real world. Scientists at Harvard found that algorithms rained to identify and categorize CT scans could be biased toward scan formats from certain CT machine manufacturers.
Also, a whitepaper published by Google showed challenges in the implementation of an eye disease-predicting system in Thailand hospitals, which include issues with scan accuracy. Studies conducted by companies such as Babylon Health which alleges its ability to triage a range of diseases from text messages have been questioned.
The coauthors argue that information about the amount of sites in an evaluation must be ‘consistently reported’ so that clinicians, researchers and patients may make informed decisions about how reliable an AI medical device is. Multisite examinations are essential for comprehending algorithmic bias and reliability and can also aid in accounting for variations in disease prevalence, image storage formats, technician standards, demographic makeup and equipment.
‘Evaluating the performance of AI devices in multiple clinical sites is important for ensuring that the algorithms perform well across representative populations’, the coauthors wrote. ‘Encouraging prospective studies with comparison to standard of care reduces the risk of harmful overfitting and more accurately captures true clinical outcomes. Postmarket surveillance of AI devices is also needed for understanding and measurement of unintended outcomes and biases that are not detected in prospective, multicenter trial.’
By Marvellous Iwendi.