Supplementary Material for: Trends in accuracy and appropriateness of alopecia areata information obtained from a popular online large language model, ChatGPT
datasetposted on 2023-09-18, 08:18 authored by O'Hagan R., Kim R.H., Abittan B.J., Caldas S., Ungar, Ungar B.
Patients with alopecia areata (AA) may access a wide range of sources for information about AA, including the recently developed ChatGPT. Assessing the quality of health information provided by these sources is crucial, as patients are utilizing them in increasing numbers. This study aimed to evaluate appropriateness and accuracy of responses to common patient questions about AA generated by ChatGPT. Responses generated by ChatGPT 3.5 and ChatGPT 4.0 to 25 questions addressing common patient concerns were assessed by multiple attending dermatologists in an academic center for appropriateness and accuracy. Appropriateness of responses by both models for use in two hypothetical contexts: 1) for patient-facing general information websites, and 2) for electronic health record (EHR) message drafts were measured. This study found the accuracy across all responses was 4.41 out of 5. Accuracy scores of responses ChatGPT 3.5 responses had a mean of 4.29, whereas those generated by ChatGPT 4.0 had mean accuracy score of 4.53. Assessments ranged from 100% of responses rated as appropriate for the general question category to 79% questions about management for an EHR message draft. Raters largely preferred responses generated by ChatGPT 4.0 vs. ChatGPT 3.5. Reviewer agreement was found to be moderate across all questions, with a 53.7% agreement and Fleiss’ κ co-efficient of 0.522 (p-value < 0.001). The large language model ChatGPT outputted mostly appropriate information for common patient concerns. While not all responses were accurate, the trend toward improvement with newer iterations suggests potential future utility for patients and dermatologists.