Article Text
Abstract
Background Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care.
Objectives To compare the performance of ChatGPT, version GPT-4, with that of real doctors.
Design and setting A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared.
Participants Anonymous responses from the Swedish family medicine specialist examination 2017–2022 were used.
Outcome measures Primary: the mean difference in scores between GPT-4’s responses and randomly selected responses by human doctors, as well as between GPT-4’s responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories.
Results The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p<0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p<0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044).
Conclusion In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.
- Artificial Intelligence
- Primary Health Care
- Health informatics
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information. The scores are published in the Swedish National Data Service’s Data Organisation and Information System repository. Three examples of cases and their corresponding scoring guides and GPT-4 responses have been translated into English and included as supplemental file 1. The original cases, evaluation guides and top-tier responses are publicly available in Swedish on SFAM’s website, from where they were used in this study with permission.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
STRENGTHS AND LIMITATIONS OF THIS STUDY
Each response was scored by two independent, blinded reviewers.
Detailed scoring keys provided excellent inter-rater correlation.
Evaluation of long-form free-text responses to complex cases, relevant for primary care.
The result may not be fully generalisable to other countries and languages.
There was no penalty for the presence of extra, unnecessary, information in the responses when scoring.
Background
Artificial intelligence (AI) in medicine has been the subject of increasing research, even though real-world applications are relatively few.1–3 Over the last few years, large AI models called generative pretrained transformers (GPT) have demonstrated remarkable abilities beyond simple text generation, such as answering questions and participating in chat conversations. ChatGPT from OpenAI is arguably one of the most well-known models. At the time of this study, their two latest AI models are GPT-3.5 and GPT-4, with GPT-4 being the most advanced.
Countless clinical applications could be envisioned for an AI system that can accurately answer questions from healthcare staff and patients. The impact could be enormous in primary healthcare, where healthcare staff need to keep themselves up-to-date on a broad spectrum of medical conditions.
GPT-3.5 and GPT-4 have demonstrated human-level performance on several professional benchmarks4 and achieved moderate to excellent results in various medical examinations5–10 but did not pass the general practice licensing examinations of Taiwan and the UK.11 12 However, the medical questions in these assessments have typically been multiple-choice questions, which differ from a clinician asking the chatbot for advice on managing real patient cases. Additionally, the studies focusing on general practice have tested GPT-3.5, which may perform significantly worse than GPT-4.6 9 At the time of writing, research has not explored GPT-4’s ability to provide comprehensive free-text assessments of primary care cases.
The Swedish family medicine specialist examination is not mandatory, but it is a valuable credential taken by resident doctors in general medicine as they become certified specialists. One part of the examination is a written test with eight complex cases that often involve intricate symptoms combined with social or behavioural factors, requiring comprehensive long-form responses. Our research question investigates how GPT-4 performs in comparison to real doctors taking the examination.
Methods
Study design
This study compared the performance of GPT-4 with responses from human doctors on cases from the Swedish family medicine specialist examination. The responses from three distinct groups were scored and compared: (A) randomly selected doctor responses, (B) top-tier doctor responses and (C) responses generated by GPT-4.
Objective and outcome measures
The objective was to compare GPT-4 to real doctors, regarding their ability to write comprehensive assessments of complex cases from primary care.
Primary outcome measure
The mean difference in scores between GPT-4 and randomly selected responses by human doctors, as well as between GPT-4 and top-tier responses.
Secondary outcome measures
The correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories.
Data collection
Sourcing of medical cases
All cases from the Swedish family medicine specialist examination from 2017 to 2022 were used for this study, totalling 48 cases (see online supplemental file 1 for examples). These examinations are publicly available on the Swedish Association of General Practice (SFAM)’s website.13 The cases require comprehensive responses, typically consisting of several paragraphs of free text. They are often complex, involving symptoms indicative of various diseases and complicating factors such as social problems, addiction, poor compliance, legal aspects and patients near the end of life. Table 1 provides a summary of the number of cases addressing different topics.
Supplemental material
Number of cases addressing different topics
Sourcing of doctor responses groups A and B
Anonymous responses from past examinations were used. Group A: A digital random choice function was used to draw a single anonymous response for each case, from all the human responses submitted to the examination when it took place. Group B: The Swedish Association of General Practice, SFAM, has published an example of a top-tier response for each case. These responses were chosen arbitrarily by the examination reviewers as the best response for each question, in their opinion, and were used for Group B.13
Obtaining GPT-4 responses, group C
Medical cases were sent to GPT-4 in an automated manner through OpenAI’s application programming interface,14 using the version of GPT-4 released on 3 August 2023. Apart from the case itself, additional instructions were sent along with each case to provide some context, based on the written instructions included in the 2022 examination (See online supplemental file 2). A single response was collected for each case, without any follow-up questions (See online supplemental file 1). A separate chat session was created for each case.
Supplemental material
Scoring the responses
For each case, SFAM has published an evaluation guide that includes a few main points which should be included in a good answer, although the precise scoring guide used for the examination is not public. To quantify the performance of each examination response, the published evaluation guide for each case was adapted into a criteria-based scoring guide. Each scoring guide could award a score ranging from 0 to 10 points. This adaptation involved rephrasing each evaluation guide into a set of true-or-false criteria. The original evaluation guide was followed as closely as possible, but in cases where it was vaguely phrased, official Swedish medical guidelines were consulted to help formulate the criteria. For each criterion met, a specific number of points was awarded (see online supplemental file 1). A group of three medical doctors, blinded to the origins of the responses, rated the responses using the scoring guide. Each response was scored by two of the three raters, and the average of their scores was used for the statistical analysis. The same pair of raters assessed all responses pertaining to the same case. The doctor creating the scoring guide is a specialist in general practice, whereas two of the reviewers are residents nearing the end of their residency, and one is a licensed doctor working in general practice. The evaluators were selected based on their expertise and availability.
During the review process for this paper, OpenAI released GPT-4o, its latest flagship model. The experiment was subsequently repeated to include responses from GPT-4o. Due to limited availability, it was not possible to reassemble the original panel of evaluators; instead, a single evaluator scored the responses across all groups, including the new GPT-4o group.
Statistical analysis
Sample size calculation
In the primary research question, we aimed to make two group comparisons, each producing a p value. Using the Bonferroni approach to adjust for multiple testing, the level of significance was set to 0.025. The power was set to 0.8 and minimal difference between groups to be detected was set to one point, which resulted in a required sample size of 48 cases.
Data analysis
After scoring the responses to all 48 cases, the difference between each doctor group and GPT-4 was calculated for each case. A paired t-test was used to compare each doctor group with GPT-4, pairing the scores by question.
To assess the reliability of the averaged scores derived from the raters’ use of the scoring guide, we conducted an intraclass correlation coefficient (ICC) analysis, specifically employing the two-way mixed-effects model for the mean of k raters, using the psych package in R.15 16
In addition, we examined the differences in response length (number of words) between the top-tier and GPT-4 responses. A paired t-test was used paired by question. As a measure of the information density, we divided the score by the number of words for each response. Finally, a linear regression analysis was performed to explore the relationship between the difference in lengths and the difference in scores. The latter was set as the dependent variable and the former as the independent variable. The OLS function from the statsmodels library was employed for this analysis.17
Each individual true-or-false scoring criterion was assigned to a category by the author RA, such as ‘suggest diagnosis’ for points awarded for mentioning a possible diagnosis, and ‘patient history inquiry’' for points awarded for mentioning questions that should be asked of the patient. For more details and definitions of the categories, see online supplemental file 3. The top nine most common categories were used, and the rest were grouped under ‘other’'. These categories were then used to compare performance across different subject areas. For each category, we calculated the maximum score and the percentage of that score achieved by each group. The Wilcoxon signed-rank test was used to assess the significance of the difference between top-tier and random doctor responses, as well as between GPT-4 and random doctor responses, using the differences in scores paired by scoring criteria.
Supplemental material
Results
GPT-4 scored lower than any doctor group (table 2). The differences between groups were statistically significant (table 3). For examples of responses, see online supplemental file 1. The complete scores are available in a public repository.18
Mean score, length and points per 100 words of each group
Differences in scores between GPT-4 and the doctor groups
The intraclass correlation coefficient for the scores from the three raters was 0.92 (95% CI 0.90 to 0.94, p<0.001), demonstrating the excellent reliability of the scoring guide.
The results of the repeated experiment with GPT-4o are not included in the above tables, as a single evaluator scored all groups, making these scores not directly comparable with the original results. However, the original findings were confirmed. Additionally, GPT-4o scored an average of 0.7 points higher than GPT-4 (p=0.024), though random doctor responses continued to outperform GPT-4o, with an average of 0.7 points higher (p=0.044).
The top-tier responses were on average 60 words longer than GPT-4’s (p<0.001, 95% CI 30 to 97). The correlation between differences in length and differences in scores of responses between GPT-4 and the top-tier answers was not statistically significant (p=0.11).
The percentage of the total maximum score for each subject category achieved by each group is illustrated in figure 1. More details about the definition of each category, as well as illustrative examples, are available in online supplemental file 3.
The percentage of the maximum score for each subject category achieved by each group. Statistically significant differences (p<0.05) compared with group A, the random doctor responses, are marked by an asterisk (*).
Discussion
The main finding was that GPT-4 scored significantly lower than any group of doctors on the Swedish family medicine specialist examination, with top-tier responses scoring almost three points higher (table 3). This statistically significant difference indicates that graduating specialists in general practice perform better than GPT-4 in writing comprehensive assessments of complex primary care cases.
What such a difference corresponds to in practice differs a lot from case to case. For example, in one case, GPT-4 scored 2.75 points lower than the top-tier response due to mentioning one fewer important differential diagnosis and two fewer aspects of treatment and follow-up. Generally, it appears that GPT-4 significantly lags behind the random doctor responses in critical areas such as suggesting relevant diagnoses, laboratory tests, physical examinations, referrals and addressing legal matters. For any general practitioners currently using GPT-4, this finding is concerning, as these are precisely the areas where one might seek guidance. For patients and the general public, these findings underscore the importance of maintaining human oversight in medical decision-making.
The information density was higher for the two doctor groups than for GPT-4, indicating that human doctors are better at conveying relevant information concisely. Despite these limitations, GPT-4’s performance is impressive, considering it is not a registered medical device and has not been specifically trained for medical use. The repeated experiment with GPT-4o demonstrates a meaningful advancement, suggesting that the performance of general-purpose chatbots is approaching that of graduating specialists in general medicine, though it has not yet reached equivalent levels.
There was also a significant difference between the top-tier and randomly selected doctor responses, raising the question of what requirements should be met by a medical chatbot. Is it enough for it to perform better than the average doctor, or should it aim to match or exceed the best responses from a group of doctors?
Comparison with the existing literature
In one study, GPT-4 passed every test in a series of dermatology licensing examinations, achieving over 80% for the English version (pass level: 60%).6 No data were presented on the performance of real dermatologists for comparison. On the other hand, the average score of GPT-3.5 was only 60.17% on the general practice licensing examination of the UK (pass level ≈ 70%),12 and it scored 41.6% on the corresponding Taiwanese licensing examination (pass level=60%).11 This aligns well with our results, even though we used GPT-4. Both these studies, and several similar studies in other medical disciplines,7–10 used multiple choice questions, which is a task very different from providing free-text responses to complex clinical cases. Providing free-text answers more closely resembles the requirements of a chatbot used for decision support in clinical practice. Many used GPT-3.5, which may perform significantly worse than GPT-4.
One study examined questions posted by patients online, on a forum to which volunteering doctors responded.19 In the study, three licensed healthcare professionals evaluated the free-text responses. In 79% of the cases, they favoured GPT-3.5 responses over the doctors and the quality score was 21% lower for doctors on average, as scored on a five-category ordinal scale. These findings are opposite to the findings of our study, where the randomly selected doctors’ responses scored higher in 71% of the cases, even though GPT-4 was used. The questions and responses in the patient forum were typically shorter and simpler than the primary care cases used in our study, and the responses were not assessed on specific medical criteria. In a recent preprint, a novel chatbot AI, named AMIE, has been fine tuned to perform a diagnostic interview with a patient through chat.20 It was compared with general practitioners on objective structured clinical examination cases and outperformed general practitioners on most metrics, including suggesting relevant differential diagnoses. This suggests that higher performance is already possible from AI models, but evaluating GPT-4 is still highly relevant since it is widely accessible and may hypothetically already be used by patients and clinicians.
Strengths and limitations
This is the first study of GPT-4 performance on complex primary care cases with long-form free-text responses, rather than multiple choice. As such, it mimics the scenario where a clinician posts a case summary of a real patient in order to get input on the management. The scoring system was a relatively clear way to quantify the amount of useful content in each answer and demonstrated excellent reliability. No penalty was given to superfluous content, however, which could favour respondents writing longer, but less relevant, responses. The cases used in our study are representative of Swedish primary care, which may differ somewhat from other countries.21 This should be taken into account when generalising our results to other countries.
The set of instructions sent to GPT-4 with each case, sometimes called the ‘prompt’, may influence the quality of responses.22 This is its own area of research, and optimising the prompt was beyond the scope of this study (see online supplemental file 2 for the rationale behind our choice of prompt). The cases used in the study are publicly available online and could have been part of GPT-4’s training data, but the correct answers are not available in direct association with the questions, so we find it unlikely that this would have affected the result. In some cases, the reviewers could guess which answer was written by GPT-4, which may have introduced some bias. However, the impact of this bias was likely reduced by the use of the scoring guide, which focused on the presence and absence of specific criteria rather than an overall subjective assessment of the answer quality.
The categorisation of the scoring criteria was conducted by a single researcher. While the extensive number of individual criteria may have mitigated the impact of any potential misclassification, it remains a limitation. Alternative categorisation methods, such as organising criteria by the field of medicine or broader categories like ‘diagnostics’, might have highlighted different aspects of GPT-4’s performance.
Implications for current practice and future research
GPT-4 falls short in medical accuracy when writing comprehensive assessments of complex primary care cases, compared with human doctors. The difference in performance is both statistically significant and clinically relevant. Hence, case assessments by GPT-4, should not be used directly by primary care doctors. Nor should GPT-4 be implemented as a doctor or nurse substitute for patients. However, newer versions, such as GPT-4o, show promising improvements, and continued advancements in general-purpose chatbots may bring their performance closer to that of human specialists in primary care.
Future research on medical chatbots should focus on evaluating emerging models on representative questions asked by clinicians and patients in a clinical setting. At the same time, in line with the previously mentioned AMIE medical chatbot,20 researchers and developers should aim to optimise the performance of such chatbots, for example, by training them specifically on reliable medical information, optimising prompt engineering techniques,22 23 using algorithms for processing a single question in multiple steps or allowing the chatbots access to external sources of information and tools, including other categories of AI-models.24 25 Our study indicates that significant enhancements over GPT-4’s performance are necessary, particularly in the areas of suggesting relevant diagnoses, laboratory tests, physical examinations, referrals and addressing legal matters. If reliable medical chatbots are developed, they could profoundly impact general practice. Initial contact, triage and management of simple cases could conceivably be handled directly by a medical chatbot. Additionally, these chatbots could serve as constantly available expert advisors for medical staff.
Data availability statement
All data relevant to the study are included in the article or uploaded as supplementary information. The scores are published in the Swedish National Data Service’s Data Organisation and Information System repository. Three examples of cases and their corresponding scoring guides and GPT-4 responses have been translated into English and included as supplemental file 1. The original cases, evaluation guides and top-tier responses are publicly available in Swedish on SFAM’s website, from where they were used in this study with permission.
Ethics statements
Patient consent for publication
Ethics approval
Not applicable.
Acknowledgments
Special thanks to The Swedish Association of General Practice, SFAM, for granting permission to use the Swedish family medicine specialist examination. Specifically, Karin Lindhagen was very helpful in compiling the randomly selected responses. Special thanks also to Dr Abed Alsabbagh, who participated in the group reviewing and scoring the responses.
Footnotes
Contributors The study was planned by RA, RG, AE, DS and CW. RA collected the cases and responses, and compiled the scoring guides. DS and AE participated in the group of raters who scored the responses. RA performed the statistical analysis and created the draft of the manuscript. RA, RG, AE, DS and CW participated in discussing results and refining the manuscript. RA is the guarantor of this work and accepts full responsibility for the integrity of the study and the manuscript. ChatGPT was used to proofread and suggest improvements in wording and grammar for this manuscript. All sections were initially written by the human authors and were revised by them after artificial intelligence assistance. GitHub Copilot was used for autocompletion and troubleshooting when writing Python scripts for the statistical analysis.
Funding Västra Götaland region, Sweden. Grant no: NA.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.