The Southern Medical Journal (SMJ) is the official, peer-reviewed journal of the Southern Medical Association. It has a multidisciplinary and inter-professional focus that covers a broad range of topics relevant to physicians and other healthcare specialists.

SMJ // Article

Original Article

Performance of Large Language Models on Diagnostic Radiology Board–Style Questions: A Comparative Evaluation of GPT-4o, Perplexity AI, and OpenEvidence

Authors: Randall Aziz, BS, Sydney Stewart, BS, Rebecca Liscomb, MS, Beecher Baldwin, BS, Katie Bailey, MD, Karim Hanna, MD

Abstract

Objective: The objective of this study was to compare the diagnostic accuracy and internal consistency of GPT-4o (Generative Pre-Trained Transformer-4 omni), Perplexity AI (artificial intelligence), and OpenEvidence when applied to text-based, specialty-level radiology board questions.

Methods: A total of 161 text-based multiple-choice questions from the American College of Radiology (ACR) Diagnostic Radiology In-Training Examination were administered across three independent runs for each large language model (LLM). Questions containing images were excluded. All three models were accessed through their respective public Web interfaces. A final answer was assigned to each model based on majority vote across the three runs (two out of three). If all three responses differed, the third (last) response was selected. Our selected answer was then compared with the ACR reference key. Internal consistency as well as agreement between each model’s final answer and the ACR reference key was assessed using Cohen’s kappa. In addition, descriptive statistics were used to analyze performance by radiology subspecialty. SPSS version 30 was used for all statistical analyses, and P<0.05 were considered statistically significant.

Results: Perplexity AI demonstrated the highest agreement with the ACR reference key (κ=0.883, P<0.001), followed by OpenEvidence (κ=0.858, P<0.001), and GPT-4o (κ=0.709, P<0.001). All models showed high internal consistency; however OpenEvidence was the only LLM to demonstrate absolute internal consistency (κ=1.00 for all three runs). Perplexity AI showed the least variability across the 14 radiology subspecialties.

Conclusion: Emerging LLMs such as Perplexity AI and OpenEvidence may offer greater diagnostic reliability than general-purpose models in radiology-specific contexts.

This content is limited to qualifying members.

Existing members, please login first

If you have an existing account please login now to access this article or view purchase options.

Purchase only this article ($25)

Create a free account, then purchase this article to download or access it online for 24 hours.

Purchase an SMJ online subscription ($75)

Create a free account, then purchase a subscription to get complete access to all articles for a full year.

Purchase a membership plan (fees vary)

Premium members can access all articles plus recieve many more benefits. View all membership plans and benefit packages.

References

1. Chen T, Multala E, Kearns P, et al. Assessment of ChatGPT’s performance on neurology written board examination questions. BMJ Neurol Open 2023;5:e000530.
 
2. Andrade NS, Donty SV. Comparison of large language models’ performance on neurosurgical board examination questions. medRxiv 2025;48:320; Preprint posted online February 24. doi:10.1101/ 2025.02.20.25322623.
 
3. Patel J, Robinson P, Illing E, et al. Is ChatGPT 3.5 smarter than otolaryngology trainees? A comparison study of board style exam questions. PLoS One 2024;19:e0306233.
 
4. Shay D, Kumar B, Bellamy D, et al. Assessment of ChatGPT success with specialty medical knowledge using anaesthesiology board examination practice questions. Br J Anaesth 2023;131:e31–e34.
 
5. Pastrak M, Kajitani S, Goodings AJ, et al. Evaluation of ChatGPT performance on emergency medicine board exam questions: observational study. JMIR AI 2025;4:e67696.
 
6. Hsieh CH, Hsieh HY, Lin HP. Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination. Heliyon 2024;10:e34851.
 
7. Gotta J, Le A, Koch V, et al. Large language models in radiology exams for medical students: performance and consequences. Rofo 2025;197:1057–1067.
 
8. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 2023;307:e230582.
 
9. Sadeq MA, Ghorab RMF, Ashry MH, et al. AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study. Sci Rep 2024;14:18333.
 
10. Wang L, Li J, Zhuang B, et al. Accuracy of large language models when answering clinical research questions: systematic review and network meta-analysis. J Med Internet Res 2025;27:e64486.
 
11. Workum JD, Volkers BWS, Van De Sande D, et al. Comparative evaluation and performance of large language models on expert-level critical care questions: a benchmark study. Crit Care 2025;29:52.
 
12. Bettmann MA. The ACR Appropriateness Criteria: an evidence-based approach to imaging. J Am Coll Radiol 2004;1:921–926.
 
13. Lee CI, Seltzer SE, Forman HP. Appropriateness criteria: the next generation. J Am Coll Radiol 2007;4:865–870.
 
14. Wu V, Casauay J. OpenEvidence. Fam Med 2024;57:232–233.
 
15. American Medical Association. 2 in 3 physicians are using health AI —up 78% from 2023. 2025. Accessed August 5, 2025. https://www.ama-assn.org/practice-management/digital-health/2-3-physicians-are-using-health-ai-78-2023.
 
16. Meyer AN, Payne VL, Meeks DW, et al. Physicians’ diagnostic accuracy, confidence, and resource requests: a vignette study. JAMA Intern Med 2013;173:1952–1958.