Transformer-based large language models (LLMs), such as ChatGPT-4, are increasingly used to streamline clinical practice, of which radiology reporting is a prominent aspect. However, their performance in interpreting complex anatomical regions from MRI data remains largely unexplored. This study investigates the capability of ChatGPT-4 to produce clinically reliable reports based on orbital MR images, applying a multimetric, quantitative evaluation framework in 25 patients with orbital lesions. Due to inherent limitations of current version of GPT-4, the model was not fed with MR volumetric data, but key 2D images only. For each case, ChatGPT-4 generated a free-text report, which was then compared to the corresponding ground-truth report authored by a board-certified radiologist. Evaluation included established NLP metrics (BLEU-4, ROUGE-L, BERTScore), clinical content recognition scores (RadGraph F1, CheXbert), and expert human judgment. Among the automated metrics, BERTScore demonstrated the highest language similarity, while RadGraph F1 best captured clinical entity recognition. Clinician assessment revealed moderate agreement with the LLM outputs, with performance decreasing in complex or infiltrative cases. The study highlights both the promise and current limitations of LLMs in radiology, particularly regarding their inability to process volumetric data and maintain spatial consistency. These findings suggest that while LLMs may assist in structured reporting, effective integration into diagnostic imaging workflows will require coupling with advanced vision models capable of full 3D interpretation.

Assessment of ChatGPT performance in orbital MRI reporting with multimetric evaluation of transformer based language models

Tel A.;Robiony M.
2025-01-01

Abstract

Transformer-based large language models (LLMs), such as ChatGPT-4, are increasingly used to streamline clinical practice, of which radiology reporting is a prominent aspect. However, their performance in interpreting complex anatomical regions from MRI data remains largely unexplored. This study investigates the capability of ChatGPT-4 to produce clinically reliable reports based on orbital MR images, applying a multimetric, quantitative evaluation framework in 25 patients with orbital lesions. Due to inherent limitations of current version of GPT-4, the model was not fed with MR volumetric data, but key 2D images only. For each case, ChatGPT-4 generated a free-text report, which was then compared to the corresponding ground-truth report authored by a board-certified radiologist. Evaluation included established NLP metrics (BLEU-4, ROUGE-L, BERTScore), clinical content recognition scores (RadGraph F1, CheXbert), and expert human judgment. Among the automated metrics, BERTScore demonstrated the highest language similarity, while RadGraph F1 best captured clinical entity recognition. Clinician assessment revealed moderate agreement with the LLM outputs, with performance decreasing in complex or infiltrative cases. The study highlights both the promise and current limitations of LLMs in radiology, particularly regarding their inability to process volumetric data and maintain spatial consistency. These findings suggest that while LLMs may assist in structured reporting, effective integration into diagnostic imaging workflows will require coupling with advanced vision models capable of full 3D interpretation.
File in questo prodotto:
File Dimensione Formato  
s41598-025-19669-1.pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 2.66 MB
Formato Adobe PDF
2.66 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1318229
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact