Original Article

Comparison of the Usability and Reliability of Answers to Clinical Questions: AI-Generated ChatGPT versus a Human-Authored Resource

Authors: Farrin A. Manian, MD, MPH, Katherine Garland, MD, Jimin Ding, PhD

Abstract

Objectives: Our aim was to compare the usability and reliability of answers to clinical questions posed of Chat-Generative Pre-Trained Transformer (ChatGPT) compared to those of a human-authored Web source (www.Pearls4Peers.com) in response to “real-world” clinical questions raised during the care of patients.

Methods: Two domains of clinical information quality were studied: usability, based on organization/readability, relevance, and usefulness, and reliability, based on clarity, accuracy, and thoroughness. The top 36 most viewed real-world questions from a human-authored Web site (www.Pearls4Peers.com [P4P]) were posed to ChatGPT 3.5. Anonymized answers by ChatGPT and P4P (without literature citations) were separately assessed for usability by 18 practicing physicians (“clinician users”) in triplicate and for reliability by 21 expert providers (“content experts”) on a Likert scale (“definitely yes,” “generally yes,” or “no”) in duplicate or triplicate. Participants also directly compared the usability and reliability of paired answers.

Results: The usability and reliability of ChatGPT answers varied widely depending on the question posed. ChatGPT answers were not considered useful or accurate in 13.9% and 13.1% of cases, respectively. In within-individual rankings for usability, ChatGPT was inferior to P4P in organization/readability, relevance, and usefulness in 29.6%, 28.3%, and 29.6% of cases, respectively, and for reliability, inferior to P4P in clarity, accuracy, and thoroughness in 38.1%, 34.5%, and 31% of cases, respectively.

Conclusions: The quality of ChatGPT responses to real-world clinical questions varied widely, with nearly one-third or more answers considered inferior to a human-authored source in several aspects of usability and reliability. Caution is advised when using ChatGPT in clinical decision making.

Full Article

Having trouble viewing the article content below? Click here to open it directly.

Images

Download Image

Download Image

Download Image

Download Image

Download Image

References

1. Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the ChatGPT model. Res Sq Preprint 2023 February 28;rs.3.rs-2566942.
 
2. Haupt CE, Marks M. AI-generated medical advice-GPT and beyond. JAMA 2023;329:1349–1350.
 
3. Li R, Kumar A, Chen JH. How chatbots and large language model artificial intelligence systems will reshape modern medicine. JAMA Intern Med 2023; 183:596–597.
 
4. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digital Health 2023;2:e0000198.
 
5. Johnson SB, King A, Warner EL, et al. Using ChatGPT to evaluate cancer myths and misconceptions: artificial intelligence and cancer information. JNCI Cancer Spectr 2023;7:pkad015.
 
6. Ayers JW, Pollak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot response to patient questions posted to a public social media forum. JAMA Intern Med 2023;183:589–596.
 
7. Manian FA, Hsu F. Writing to learn on the wards: scholarly blog posts by medical students and housestaff at a teaching hospital. Med Ed Online 2019;24:1565044.
 
8. Manian FA. The case for writing critical thinking reports as a teaching strategy on today’s hospital wards. J Med Educ Curric Dev 2020;7: 2382120520948879.
 
9. Vedder A, Wachbroit R. Reliability of information on the Internet: some distinctions. Ethics Info Technol 2003;4:211–215.
 
10. Freeman GH, Halton, JH. Note on exact treatment of contingency, goodness of fit and other problems of significance. Biometrika 1951;38:141–149.
 
11. Mehta CR, Patel NR. A network algorithm for performing Fisher’s exact test in r x c contingency tables. J Am Stat Assoc 1983;78:427–434.
 
12. Bauer DF. Constructing confidence sets using rank statistics. J Am Stat Assoc 1972;67:87–690.
 
13. Hollander M, Wolfe DA. Nonparametric Statistical Methods, 2nd ed. New York: John Wiley & Sons, 1999.
 
14. Howard G, Lubbe S, Klopper R. The impact of information quality on information research. Alternation 2011;4(Spec Ed):288–305.
 
15. Zheng H, Zhan H. ChatGPT in scientific writing: a cautionary tale. Am J Med 2023;136:725–726.e6.
 
16. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023;9:e45312.