Holding artificial intelligence to account

Lancet Digital Health
May 2022 Volume 4 Number 5 e290-e397


Holding artificial intelligence to account
The Lancet Digital Health
In this issue of The Lancet Digital Health, Xiaoxuan Liu and colleagues give their perspective on global auditing of medical artificial intelligence (AI). They call for the focus to shift from demonstrating the strengths of AI in health care to proactively discovering its weaknesses.

Machines make unpredictable mistakes in medicine, which differ significantly from those made by humans. Liu and colleagues state that errors made by AI tools can have far-reaching consequences because of the complex and opaque relationships between the analysis and the clinical output. Given that there is little human control over how an AI generates results and that clinical knowledge is not a prerequisite in AI development, there is a risk of an AI learning spurious correlations that seem valid during training but are unreliable when applied to real-world situations.

Lauren Oakden-Rayner and colleagues analysed the performance of an AI across a range of relevant features for hip fracture detection. This preclinical algorithmic audit identified barriers to clinical use, including a decrease in sensitivity at the prespecified operating point. This study highlighted several “failure modes”, which is the propensity of an AI to fail recurrently in certain conditions. Oakden-Rayner told The Lancet Digital Health that their study showed that “the failure modes of AI systems can look bizarre from a human perspective. Take, for example, in the hip fracture audit (figure 5), the recognition that the AI missed an extremely displaced fracture … the sort of image even a lay person would recognise as completely abnormal.” These errors can drastically affect clinician and patient trust in AI. Another example demonstrating the need for auditing was highlighted last month in an investigation by STAT and the Massachusetts Institute of Technology, which found that an EPIC health algorithm used to predict sepsis risk in the USA deteriorated sharply in performance, from 0·73 AUC to 0·53 AUC, over 10 years. This deterioration over time was caused by changes in the hospital coding system, increased diversity and volume of patient data, and changes in operational behaviours of caregivers. There was little to no oversight of the AI tool once it hit the market, potentially causing harm to patients in hospital. Liu commented, “without the ability to observe and learn from algorithmic errors, the risk is that it will continue to happen and there’s no accountability for any harm that results.”

Auditing medical AI is essential; but whose responsibility is it to ensure that AI is safe to use? Some experts think that AI developers are responsible for providing guidance on managing their tools, including how and when to check the system’s performance, and identifying vulnerabilities that might emerge after they are put into practice. Others argue that not all the responsibility lies with AI developers, and health providers must test AI models on other data to verify their utility and assess potential vulnerabilities. Liu says, “we need clinical teams to start playing an active role in algorithmic safety oversight. They are best placed to define what success and failure looks like for their health institution and their patient cohort.”

There are three challenges to overcome to ensure AI auditing is successfully implemented.

First, in practice, auditing will require professionals with clinical and technical expertise to investigate and prevent AI errors and to thoughtfully interrogate errors before and during real-world deployment. However, experts with computational and clinical skill sets are not yet commonplace. Health-care institutes, AI companies, and governments must invest in upskilling health-care workers so that these experts can become an integral part of the medical AI development process.

Second, industry-wide standards for monitoring medical AI tools over time must be enforced by key regulatory bodies. Tools to identify when an algorithm becomes miscalibrated because of changes in data or environment are being developed by researchers, but these tools must be endorsed in a sustained and standardised way, led by regulators, health systems, and AI developers.

Third, the main issue that can exacerbate errors in AI is the lack of transparency of the data, code, and parameters due to intellectual property concerns. Liu and colleagues emphasise that much of the benefit that software and data access would provide can be instead obtained through a web portal with the ability to test the model on new data and receive model outputs. Oakden-Rayner said, “AI developers have a responsibility to make auditing easier for clinicians, especially by providing clear details of how their system works and how it was built.”