The Southern Medical Journal (SMJ) is the official, peer-reviewed journal of the Southern Medical Association. It has a multidisciplinary and inter-professional focus that covers a broad range of topics relevant to physicians and other healthcare specialists.
SMJ // Article
Original Article
Performance of Large Language Models on Diagnostic Radiology Board–Style Questions: A Comparative Evaluation of GPT-4o, Perplexity AI, and OpenEvidence
Abstract
Objective: The objective of this study was to compare the diagnostic accuracy and internal consistency of GPT-4o (Generative Pre-Trained Transformer-4 omni), Perplexity AI (artificial intelligence), and OpenEvidence when applied to text-based, specialty-level radiology board questions.Methods: A total of 161 text-based multiple-choice questions from the American College of Radiology (ACR) Diagnostic Radiology In-Training Examination were administered across three independent runs for each large language model (LLM). Questions containing images were excluded. All three models were accessed through their respective public Web interfaces. A final answer was assigned to each model based on majority vote across the three runs (two out of three). If all three responses differed, the third (last) response was selected. Our selected answer was then compared with the ACR reference key. Internal consistency as well as agreement between each model’s final answer and the ACR reference key was assessed using Cohen’s kappa. In addition, descriptive statistics were used to analyze performance by radiology subspecialty. SPSS version 30 was used for all statistical analyses, and P<0.05 were considered statistically significant.
Results: Perplexity AI demonstrated the highest agreement with the ACR reference key (κ=0.883, P<0.001), followed by OpenEvidence (κ=0.858, P<0.001), and GPT-4o (κ=0.709, P<0.001). All models showed high internal consistency; however OpenEvidence was the only LLM to demonstrate absolute internal consistency (κ=1.00 for all three runs). Perplexity AI showed the least variability across the 14 radiology subspecialties.
Conclusion: Emerging LLMs such as Perplexity AI and OpenEvidence may offer greater diagnostic reliability than general-purpose models in radiology-specific contexts.
This content is limited to qualifying members.
Existing members, please login first
If you have an existing account please login now to access this article or view purchase options.
Purchase only this article ($25)
Create a free account, then purchase this article to download or access it online for 24 hours.
Purchase an SMJ online subscription ($75)
Create a free account, then purchase a subscription to get complete access to all articles for a full year.
Purchase a membership plan (fees vary)
Premium members can access all articles plus recieve many more benefits. View all membership plans and benefit packages.
