
AI startup Mendel and the University of Massachusetts Amherst (UMass Amherst) have jointly published a study detecting hallucinations in AI-generated medical summaries.
The study evaluated medical summaries generated by two large language models (LLMs), GPT-4o and Llama-3. It categorises the hallucinations into five categories based on where they occur in the structure of medical notes 鈥 patient information; patient history; symptoms, diagnosis, surgical procedures; medicine-related instructions; and follow-up.
The study found that summaries created by AI models can 鈥済enerate content that is incorrect or too general according to information in the source clinical notes鈥, which is called faithfulness hallucination.
AI hallucinations are a well-documented phenomenon. Google鈥檚 use of AI in its search engine has prompted some absurd responses, such as 鈥渆ating one small rock per day鈥 and 鈥渁dding non-toxic glue to pizza to stop it from sticking鈥. However, in cases of medical summaries, these hallucinations can undermine the reliability and accuracy of the medical records.
The pilot study prompted GPT-4o and Llama-3 to create 500-word summaries of 50 detailed medical notes. Research found that GPT-4o had 21 summaries with incorrect information and 50 summaries with generalised information, while Llama-3 had 19 and 47, respectively. The researchers noted that Llama-3 tended to report details 鈥渁s is鈥 in its summaries whilst GPT-40 made 鈥渂old, two-step reasoning statements鈥 that can lead to hallucinations.
The use of AI has been increasing in recent years, GlobalData expects the global revenue for AI platforms across healthcare to . There have also been calls to integrate AI with electronic health records to support clinical decision-making.

US Tariffs are shifting - will you react or anticipate?
Don鈥檛 let policy changes catch you off guard. Stay proactive with real-time data and expert analysis.
By GlobalDataThe UMass Amherst and Mendel study establishes a need for a hallucination detection system to boost the reliability and accuracy of AI-generated summaries. The research found that it took 92 minutes on average for a well-trained clinician to label an AI-generated summary, which can be expensive. To overcome this, the research team employed Mendel鈥檚 Hypercube system to detect hallucinations.
It also found that while Hypercube tended to overestimate the number of hallucinations. Furthermore, it detected hallucinations that are otherwise missed by human experts. The research team proposed the use of the Hypercube system as 鈥渁n initial hallucination detection step, which can then be integrated with human expert review to enhance overall detection accuracy鈥.