The Southern Medical Journal (SMJ) is the official, peer-reviewed journal of the Southern Medical Association. It has a multidisciplinary and inter-professional focus that covers a broad range of topics relevant to physicians and other healthcare specialists.

Original Article

Comparing Speed and Accuracy of Artificial Intelligence Large Language Models on the Orthopedic In-Training Examination

Authors: Fahad Nadeem, BS, Saad Ibrahim, BS, Sean Taylor, MS, Saurabh Rawall, MBBS, Zuhair Mohammad, MD, Humza Pirzadah, BS, José Ayala-Ortiz, MD, Sakthivel Rajaram, MD

Abstract

Objectives: Large language models (LLMs), such as Open AI's Chat Generative Pre-Trained Transformer (GPT)-4 and Google Gemini, have gained significant attention for their ability to process complex language patterns and are being used increasingly in fields such as medicine, where they assist in learning, collaboration, and patient care. Although prior studies have evaluated LLMs on medical licensing examinations, limited research compares their performance on orthopedic-specific assessments. This study aims to assess the accuracy and response speed of ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Gemini on the Orthopedic In-Training Examination (OITE).

Methods: Questions from the 2020-2022 OITE were extracted from the American Academy of Orthopaedic Surgeons' question bank. Each question, along with four answer choices, was manually input into the LLMs without response prompts or feedback. Response accuracy and speed were recorded, with timing measured from the moment the question was submitted until an answer was generated.

Results: Out of 1582 prompts, ChatGPT-4 demonstrated the highest accuracy (67.09% ± 0.08%), significantly outperforming ChatGPT-3.5, Microsoft Copilot, and Gemini (P < 0.001). ChatGPT-3.5 was the fastest, with an average response time of 5.41 ± 0.10 seconds. Both ChatGPT-3.5 and ChatGPT-4 responded significantly faster than Gemini and Microsoft Copilot (P < 0.001).

Conclusions: ChatGPT-4 exhibited the highest accuracy on OITE questions, and ChatGPT-3.5 was the fastest. Gemini and Copilot were generally less accurate in their responses and had a slower response time. These findings highlight the potential of LLMs in orthopedic education and emphasize the need for further research to explore their broader applications in medical training and decision making.

Posted in: Rheumatology and Orthopedics32

This content is limited to qualifying members.

Existing members, please login first

If you have an existing account please login now to access this article or view purchase options.

Purchase only this article ($25)

Create a free account, then purchase this article to download or access it online for 24 hours.

Purchase an SMJ online subscription ($75)

Create a free account, then purchase a subscription to get complete access to all articles for a full year.

Purchase a membership plan (fees vary)

Premium members can access all articles plus recieve many more benefits. View all membership plans and benefit packages.

References

1. Sarumi OA, Heider D. Large language models and their applications in bioinformatics. Comput Struct Biotechnol J 2024;23:3498-3505.

2. Egli A. ChatGPT, GPT-4, and other large language models: the next revolution for clinical microbiology? Clin Infect Dis 2023;77:1322-1328.

3. Clusmann J, Kolbinger FR, Muti HS, et al. The future landscape of large language models in medicine. Commun Med 2023;3:141.

4. Meng X, Yan X, Zhang K, et al. The application of large language models in medicine: a scoping review. iScience 2024;27:109713.

5. Gordon W. Growing use and confidence in artificial intelligence for care delivery. NEJM Catal Innov Care Deliv 2022;3:1 CAT.22.0095 5.

6. Cascella M, Montomoli J, Bellini V, et al. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst 2023;47:33.

7. Kim TW. Application of artificial intelligence chatbots, including ChatGPT, in education, scholarly work, programming, and content generation and its prospects: a narrative review. J Educ Eval Health Prof 2023;20:38.

8. Shen Y, Heacock L, Elias J, et al. ChatGPT and other large language models are double-edged swords. Radiology 2023;307:e230163.

9. Liu CL, Ho CT, Wu TC. Custom GPTs enhancing performance and evidence compared with GPT-3.5, GPT-4, and GPT-4o? A study on the Emergency Medicine Specialist Examination. Healthcare 2024;12:1726.

10. Jo E, Song S, Kim JH, et al. Assessing GPT-4’ performance in delivering medical advice: comparative analysis with human experts. JMIR Med Educ 2024;10:e51282.

11. Bazzari AH, Bazzari FH. Assessing the ability of GPT-4o to visually recognize medications and provide patient education. Sci Rep 2024;14:26749.

12. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198.

13. Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German Medical Licensing Examination: observational study. JMIR Med Educ 2024;10:e50965.

14. Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on medical challenge problems. arXiv 2023;2:1-33.

15. Jiao C, Edupuganti NR, Patel PA, et al. Evaluating the artificial intelligence performance growth in ophthalmic knowledge. Cureus 2023;15:e45700.

16. Joly-Chevrier M, Nguyen AXL, Lesko-Krleza M, et al. Performance of ChatGPT on a practice dermatology board certification examination. J Cutan Med Surg 2023;27:407-409.

17. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 2023;307:e230582.

18. Kung JE, Marshall C, Gauthier C, et al. Evaluating ChatGPT performance on the Orthopaedic In-Training Examination. JB JS Open Access 2023;8:e23.00056.

19. Jain N, Gottlich C, Fisher J, et al. Assessing ChatGPT’ orthopedic in-service training exam performance and applicability in the field. J Orthop Surg Res 2024;19:27.

20. Hayes DS, Foster BK, Makar G, et al. Artificial intelligence in orthopaedics: performance of ChatGPT on text and image questions on a complete AAOS Orthopaedic In-Training Examination (OITE). J Surg Educ 2024;81:1645-1649.

21. Urman A, Makhortykh M. The silence of the LLMs: cross-lingual analysis of guardrail-related political bias and false information prevalence in ChatGPT, Google Bard (Gemini), and Bing Chat. Telemat Inform 2025;96:102211.

22. Casagrande D, Gobira M. Evaluating the accuracy of Gemini 2.0 Advanced and ChatGPT 4o in cataract knowledge: a performance analysis using Brazilian Council of Ophthalmology board exam questions. Cureus 2025;17:e79565.

23. Rakauskas TR, Da Costa A, Moriconi C, et al. Evaluation of Chat Generative Pre-trained Transformer and Microsoft Copilot performance on the American Society of Surgery of the Hand self-assessment examinations. J Hand Surg Glob Online 2025;7:23-28.

24. Xu AY, Singh M, Balmaceno-Criss M, et al. Comparitive performance of artificial intelligence-based large language models on the Orthopedic In-Training Examination. J Orthop Surg (Hong Kong) 2025;33:10225536241268789.

25. Rossettini G, Rodeghiero L, Corradi F, et al. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Med Educ 2024;24:694.

26. Le HV, Wick JB, Haus BM, et al. Orthopaedic In-Training Examination: history, perspective, and tips for residents. J Am Acad Orthop Surg 2021;29:e427-e437.

27. Nawari A, Zahir J, Kumar S, et al. Artificial intelligence large language models are nearly equivalent to fourth-year orthopaedic residents on the Orthopaedic In-Training Examination: a cause for concern or excitement? J Orthopaed Exp Innov 2025;6:001c.124070.

28. Chen CJ, Bilolikar VK, VanNest D, et al. Artificial intelligence in orthopaedic education: a comparative analysis of ChatGPT and Bing AI’ Orthopaedic In-Training Examination performance. Med Adv 2024;2:284-290.

29. Lubitz M, Latario L. Performance of two artificial intelligence generative language models on the Orthopaedic In-Training Examination. Orthopedics 2024;47:e146-e150.

30. Ghanem D, Covarrubias O, Raad M, et al. ChatGPT performs at the level of a third-year orthopaedic surgery resident on the Orthopaedic In-Training Examination. JB JS Open Access 2023;8:e23.00103.

31. Ozdag Y, Hayes DS, Makar GS, et al. Comparison of artificial intelligence to resident performance on upper-extremity Orthopaedic In-Training Examination questions. J Hand Surg Glob Online 2024;6:164-168.

32. Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg 2023;31:1173-1179.

33. Rizzo MG, Cai N, Constantinescu D. The performance of ChatGPT on orthopaedic in-service training exams: a comparative study of the GPT-3.5 Turbo and GPT-4 models in orthopaedic education. J Orthop 2024;50:70-75.

34. Hofmann HL, Guerra GA, Le JL, et al. The rapid development of artificial intelligence: GPT-4’ performance on orthopedic surgery board questions. Orthopedics 2024;47:e85-e89.

35. Vaishya R, Iyengar KP, Patralekh MK, et al. Effectiveness of AI-powered chatbots in responding to orthopaedic postgraduate exam questions-an observational study. Int Orthop 2024;48:1963-1969.

36. Salman IM, Ameer OZ, Khanfar MA, et al. Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology. Front Med 2025;12:1495378.

37. Tepe M, Emekli E. Assessing the responses of large language models (ChatGPT-4, Gemini, and Microsoft Copilot) to frequently asked questions in breast imaging: a study on readability and accuracy. Cureus 2024;16:e59960.

38. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med 2023;29:1930-1940.

39. Dao T, Fu DY, Ermon S, et al. FlashAttention: fast and memory-efficient exact attention with IO-awareness. arXiv 2022; :arXiv:2205.14135.

40. Ali R, Tang OY, Connolly ID, et al. Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery 2023;93:1090-1098.

Original Article

Comparing Speed and Accuracy of Artificial Intelligence Large Language Models on the Orthopedic In-Training Examination

Abstract

This content is limited to qualifying members.

Existing members, please login first

Purchase only this article ($25)

Purchase an SMJ online subscription ($75)

Purchase a membership plan (fees vary)

References

Issue

Article

Tools

SMJ // Article

Original Article

Comparing Speed and Accuracy of Artificial Intelligence Large Language Models on the Orthopedic In-Training Examination

Abstract

This content is limited to qualifying members.

Existing members, please login first

Purchase only this article ($25)

Purchase an SMJ online subscription ($75)

Purchase a membership plan (fees vary)

References

Share

Issue

Article

Tools

The Southern Medical Association is a Non Profit Organization.

Your support is critical to our success.