Transformer-based large language models (LLMs), such as ChatGPT-4, are increasingly used to streamline clinical practice, of which radiology reporting is a prominent aspect. However, their performance in interpreting complex anatomical regions from MRI data remains largely unexplored. This study investigates the capability of ChatGPT-4 to produce clinically reliable reports based on orbital MR images, applying a multimetric, quantitative evaluation framework in 25 patients with orbital lesions. Due to inherent limitations of current version of GPT-4, the model was not fed with MR volumetric data, but key 2D images only. For each case, ChatGPT-4 generated a free-text report, which was then compared to the corresponding ground-truth report authored by a board-certified radiologist. Evaluation included established NLP metrics (BLEU-4, ROUGE-L, BERTScore), clinical content recognition scores (RadGraph F1, CheXbert), and expert human judgment. Among the automated metrics, BERTScore demonstrated the highest language similarity, while RadGraph F1 best captured clinical entity recognition. Clinician assessment revealed moderate agreement with the LLM outputs, with performance decreasing in complex or infiltrative cases. The study highlights both the promise and current limitations of LLMs in radiology, particularly regarding their inability to process volumetric data and maintain spatial consistency. These findings suggest that while LLMs may assist in structured reporting, effective integration into diagnostic imaging workflows will require coupling with advanced vision models capable of full 3D interpretation.
Assessment of ChatGPT performance in orbital MRI reporting with multimetric evaluation of transformer based language models
Tel A.;Robiony M.
2025-01-01
Abstract
Transformer-based large language models (LLMs), such as ChatGPT-4, are increasingly used to streamline clinical practice, of which radiology reporting is a prominent aspect. However, their performance in interpreting complex anatomical regions from MRI data remains largely unexplored. This study investigates the capability of ChatGPT-4 to produce clinically reliable reports based on orbital MR images, applying a multimetric, quantitative evaluation framework in 25 patients with orbital lesions. Due to inherent limitations of current version of GPT-4, the model was not fed with MR volumetric data, but key 2D images only. For each case, ChatGPT-4 generated a free-text report, which was then compared to the corresponding ground-truth report authored by a board-certified radiologist. Evaluation included established NLP metrics (BLEU-4, ROUGE-L, BERTScore), clinical content recognition scores (RadGraph F1, CheXbert), and expert human judgment. Among the automated metrics, BERTScore demonstrated the highest language similarity, while RadGraph F1 best captured clinical entity recognition. Clinician assessment revealed moderate agreement with the LLM outputs, with performance decreasing in complex or infiltrative cases. The study highlights both the promise and current limitations of LLMs in radiology, particularly regarding their inability to process volumetric data and maintain spatial consistency. These findings suggest that while LLMs may assist in structured reporting, effective integration into diagnostic imaging workflows will require coupling with advanced vision models capable of full 3D interpretation.| File | Dimensione | Formato | |
|---|---|---|---|
|
s41598-025-19669-1.pdf
accesso aperto
Tipologia:
Versione Editoriale (PDF)
Licenza:
Creative commons
Dimensione
2.66 MB
Formato
Adobe PDF
|
2.66 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


