Introduction
Dupuytren contracture (DC) is a common progressive fibroproliferative condition with an estimated prevalence of 8.2% (Townley et al. 2006a; Salari et al. 2020a). The condition results in a flexion deformity of the digits of the hand–particularly the fourth and fifth digits–which may lead to severely limited hand function. With the high prevalence of DC, many patients may utilize free online patient-education resources to answer questions they have about their condition. A potential online patient resource is ChatGPT 3.5 (OpenAI–San Francisco, California), a large language artificial intelligence (AI) model that was first released for public use in November 2022. ChatGPT was designed to be an easily accessible and convenient search engine that responds to individualized written prompts. Due to its versatility, it is a logical conclusion that patients may turn to this technology for answers to their healthcare concerns and to educate themselves on their diagnoses and treatment options.
The use of AI in this capacity may influence healthcare, patient education, and the physician-patient relationship. It has been shown that other AI platforms have the ability to significantly improve the quality of shared decision making without negatively impacting clinical efficiency (Bozic et al. 2013; Jayakumar et al. 2021; Crook et al. 2023; Hurley et al. 2023; Mika et al. 2023; Christy et al. 2023; Anastasio et al. 2023; Kaarre et al. 2023). Jagiella-Lodise et al. investigated ChatGPT’s role in patient education regarding orthopedic hand pathologies (Jagiella-Lodise, Suh, and Zelenski 2024). The goal of this study is to expand upon these findings and assess ChatGPT 3.5’s responses to patient questions on causes, treatments, and prognosis of DC using a standardized grading system. Furthermore, this study analyzes the readability of the responses to determine the efficacy of its use in patient education and shared decision making.
Methods
The “Frequently Asked Questions” page from ten well-known healthcare institution websites were reviewed. The authors then collected the ten most common and clinically relevant questions. The questions were input to ChatGPT (https://chat.openai.com/chat) on December 21, 2023 (Appendix). ChatGPT responses were recorded after the first query without follow-up or further statements for clarification.
The senior authors analyzed the responses for accuracy and reliability using the DISCERN instrument (Table 1) and the Jama Benchmark criteria (Charnock et al. 1999). The DISCERN scores were classified as: Excellent (scores 64-80), Good (scores 52-63), Fair (scores 41-51), Poor (scores 30-40), and Very Poor (scores 16-29) (Table 2) (Tahir et al. 2020). The Fleisch-Kincaid Grade Level was used to assess readability (Table 3) (Flesch 1948; Kincaid et al. 1975). If disagreements in the grading occurred, consensus was determined through further discussion. The inter-rater agreement was determined using the Cohen Kappa correlation, which was calculated to be 0.818. This indicates near perfect agreement.
Results
All ChatGPT responses failed to include sources for the information provided, yielding JAMA Benchmark criteria scores of zero. The average DISCERN score was 38.4, indicating overall poor responses (Table 4). However, all responses received 5 points for question 15 of the DISCERN score since they encouraged consultation with a professional as soon as possible for further evaluation and shared decision making (Dutta et al. 2020).
Question 1: What is Dupuytren’s contracture?
DISCERN Score: 35 (Poor)
Flesch-Kincaid Grade Level: 13.0
Analysis: The ChatGPT response provided information that was generally correct but also provided additional information about the treatment of DC that was not immediately relevant to the posed question. The model correctly identified the basic pathophysiology of DC and the common manifestations of the condition such as the fourth and fifth digits being the most commonly affected. However, the chatbot did not mention that both hands could be affected by DC (Townley et al. 2006b). ChatGPT also accurately stated that DC occurs in men more often than women and occurs later in life. It also correctly listed the strong genetic component of its etiology and the association with Peyronie’s disease and Ledderhose disease (Mohede et al. 2020; Ruettermann et al. 2021b).
However, ChatGPT’s response was incomplete in the discussion of risk factors as it failed to include diabetes mellitus, liver disease, epilepsy, heavy alcohol consumption, smoking, and the use of vibrating power tools (Ruettermann et al. 2021b; Burge et al. 1997; An et al. 1988). The response also failed to provide a comprehensive list of treatment options, omitting digital amputation, medical management with steroids or collagenase, and continued observation (Denkler, Park, and Alser 2022). In terms of operative management, the response identified that surgical release of connective tissues was an option, likely alluding to fasciotomy (Denkler, Park, and Alser 2022).
Question 2: What are the symptoms of Dupuytren Contracture?
DISCERN Score: 40 (Poor)
Flesch-Kincaid Grade Level: 13.3
Analysis: ChatGPT’s response correctly identified the main symptoms of DC and its association with Peyronie and Ledderhose Diseases (Townley et al. 2006b). However, the model did not list other potential causes of limited finger mobility and cords or nodules of the hand.
Question 3: What causes Dupuytren Contracture?
DISCERN Score: 32 (Poor)
Flesch-Kincaid Grade Level: 16.20
Analysis: The response contained several components that were misleading or required additional clarification. The AI model correctly identified most of the known risk factors for DC –genetic factors, age, male sex, diabetes mellitus, liver disease, exposure to vibrations from manual labor, smoking, and alcohol (Ruettermann et al. 2021b; Burge et al. 1997; An et al. 1988). It did not, however, include injury to the hand or wrist as a risk factor (Samulėnas et al. 2020). While ChatGPT’s response correctly states that DC is most common among individuals of Northern European ancestry, the additional comment highlighting that DC was especially common among individuals of "Scandinavian or Celtic Ancestry’’ does not take into account more recent evidence that DC may not be truly a "Disease of the Vikings’’ as was previously believed or suggested (Ågren et al. 2023; Hindocha, McGrouther, and Bayat 2009; Ng et al. 2020).
To the casual reader, ChatGPT’s response may suggest that Ledderhose and Peyronie’s Disease are in themselves risk factors for DC as opposed to comorbid associated conditions (Gelbard and Rosenbloom 2021). Additionally, ChatGPT’s response may suggest to the reader that females require a family history of DC in order to develop the disease, which may be misleading. Although males are affected more often than females, a family history of DC is not required for diagnosis in female patients (Salari et al. 2020a).
The response also mentions an association between epilepsy and DC, likely alluding to the use of antiepileptic agents as a risk factor for DC. Long term administration of phenobarbital for treatment of epilepsy has been linked to the development of fibroproliferative disease such as DC (Broekstra et al. 2018; Strzelczyk et al. 2008). The response fails to specify that DC appears to be associated with a singular class of medications and may suggest to the casual reader that the risk of DC from antiseizure medication therapy is generalizable across all classifications of agents. Additionally, the response does not suggest consulting with a physician before making any changes or stopping an antiepileptic medication, posing a potential safety concern (Ågren et al. 2023; Hindocha, McGrouther, and Bayat 2009; Ng et al. 2020).
Question 4: How is Dupuytren Contracture treated?
DISCERN Score: 39 (Poor)
Flesch-Kincaid Grade Level: 12.44
Analysis: The chatbot correctly identified the most common treatments for DC (Ruettermann et al. 2021a; Soreide et al. 2018). While a brief explanation for each treatment option is included, the response failed to include their associated indications. Additionally, more detail for the “observation” intervention is needed to clarify how long observation would be maintained and what signs or symptoms would encourage an intervention (Bayat and McGrouther 2006). The response also failed to discuss recurrence rates following treatment. There is no known cure for DC. However, it can be managed surgically and non-surgically (Denkler 2022; Wong et al. 2022; Kan et al. 2017). Recurrence rates following surgical intervention vary from 10% to 58% depending on the specific intervention (Wong et al. 2022). The chatbot did reinforce that the individuality of a patient influences which treatment a physician will select.
Question 5: What happens if I don’t treat my Dupuytren Contracture?
DISCERN Score: 40 (Poor)
Flesch-Kincaid Grade Level: 16.18
Analysis: The chatbot successfully addresses the consequences of DC, should it go untreated (Bayat and McGrouther 2006). A satisfactory explanation is offered, allowing the reader to fully understand long-lasting effects. The chatbot acknowledges that the condition progresses differently in each patient, and severity may range, worsening with time.
Question 6: How will having Dupuytren’s affect my quality of life and ability to do everyday tasks?
DISCERN Score: 43 (Fair)
Flesch-Kincaid Grade Level: 16.55
Analysis: The chatbot does an excellent job highlighting the various ways that DC may impede one’s lifestyle with satisfactory explanations of each (Wilburn et al. 2013). The chatbot, once again, emphasizes the individuality of the disease, explaining that symptoms can vary. It remains imperative to monitor the condition with a healthcare professional who will then determine when treatment is required to prevent further disease progression, as stated in the response (Zyluk and Jagielski 2007; Dutta et al. 2020).
Question 7: Who is most likely to develop Dupuytren?
DISCERN Score: 34 (Poor)
Flesch-Kincaid Grade Level: 14.71
Analysis: ChatGPT did an excellent job of highlighting important etiologies and risk factors of DC, as well as giving appropriate clarifications for each. It did, however, fail to mention some common associations such as nodular plantar fibromatosis, nodular fasciitis of popliteal fascia, and Peyronie’s disease (Gelbard and Rosenbloom 2021). However, research regarding the relationship between these conditions is still ongoing. Additionally, the response by the chatbot regarding higher prevalence in certain populations seems to be outdated. Clarification by recent studies have shown DC to be most prevalent in Africa and Asia, not those of northern European descent. With an overall prevalence of 8%, the highest reported prevalence rates are 17% in Africa, 15% in Asia, 10% in Europe, and 2% in the Americas (Salari et al. 2020b). Nevertheless, the overall response by the chatbot was informative and correctly identified numerous important risk factors, such as genetics, diabetes, alcohol, smoking, and previous hand injury (Ruettermann et al. 2021a).
Question 8: What are the common complications for Dupuytren Contracture treatment?
DISCERN Score: 37 (Poor)
Flesch-Kincaid Grade Level: 12.30
Analysis: The ChatGPT response regarding complications of DC treatment is informative and thorough. Each treatment option is appropriately addressed with its specific complication risks. However, one key point that is overlooked by the chatbot response is the risk of each of these complications. When attempting to make an informed decision about treatment and future management, it is vital to know all complications, but it is also just as important to understand the percentage of cases that have these complications. For example, segmental fasciotomy complication rates range from 0% to 5.6%. Additionally, the risk of recurrence within five years following surgery ranges from 12% to 73% (Ruettermann et al. 2021a). Recurrence rates are important considerations, as this may guide patient decisions regarding treatment options.
Question 9: Which treatment for Dupuytren’s contracture is better: surgical or non-surgical?
DISCERN Score: 47 (Fair)
Flesch-Kincaid Grade Level: 13.07
Analysis: The chatbot correctly identifies the many factors that determine whether surgical or conservative methods are better for the treatment of Dupuytren’s, including disease severity, patient preferences, and overall goals. ChatGPT also acknowledges that the choice is individualized and may be transformed as the disease progresses. Furthermore, the response does a good job of summarizing both surgical and non-surgical treatment options, as well as their risks and benefits. What ChatGPT fails to cover is the evolution of treatment algorithms, as fasciectomies and fasciotomies have decreased while XIAFLEX® injections have increased with more than 50,000 being delivered between 2010-2017 (Lipman, Carstensen, and Deal 2017). Cost is an important factor that is also not mentioned. Non-surgical treatments are often much cheaper and may be done at home, while more invasive procedures can cost upwards of $6000-7000 (Camper et al. 2019).
Question 10: How is Dupuytren’s contracture diagnosed?
DISCERN Score: 37 (Poor)
Flesch-Kincaid Grade Level: 15.04
Analysis: ChatGPT is correct in that the diagnosis of DC is clinical, largely based on history and physical exam. The response includes the major factors looked for in diagnosis and what the patient may expect upon examination, including family history, comorbidities, presence of nodules or cords, and loss of extension. All these points covered in the patient history and physical exam have been shown to be primary diagnostic criteria or associated factors (Mandel and DeMarco 2014). Although details are not provided on the differential diagnoses, the response was accurate, and the overall procedures would not change.
The chatbot response does a great job at addressing the possible need for imaging such as ultrasound (US), magnetic resonance imaging (MRI), and computed tomography (CT). It is correct that imaging is often unnecessary but may be more useful for surgical purposes and tracking disease progression. While US is inexpensive and non-invasive, it is most commonly used to assist with minimally invasive procedures as there is little evidence that echogenicity is associated with disease progression (Molenkamp et al. 2019a). MRI corresponds to disease progression by histological stage much better than the other imaging options, but is it is more expensive and time consuming (Molenkamp et al. 2019b).
Discussion
Given the potential role of ChatGPT in patient education, it is vital for patients and healthcare workers alike to be aware of the technology’s abilities and limitations. Our primary investigation was aimed at determining the educational potential and accuracy of ChatGPT responses regarding DC to determine the efficacy of its use in shared decision making.
The average DISCERN score was 38.4, which is categorized as “Poor.” Overall, we found that most responses required only minimal clarification, suggesting that ChatGPT may have potential to be used as an accessory to patient education. However, inconsistencies in its responses were observed, and some answers required further elaboration to avoid misleading the reader. Additionally, ChatGPT failed to provide citations for its information. This resulted in a JAMA Benchmark criteria score of zero and reduced the DISCERN scores for all responses. It also restricts the opportunity for patients to verify the information provided. Similarly, several recent studies have investigated ChatGPT’s role in patient education in orthopedic surgery such as the model’s ability to provide accurate and high-quality responses to commonly asked patient questions. Crook et al. and Hurley et al. found that ChatGPT was capable of providing easily understandable responses to questions about common hand procedures and shoulder stabilizations, respectively. However, the model’s responses fell short on benchmarks of attribution, reliability, and currency of information (Crook et al. 2023; Hurley et al. 2023). Furthermore, studies have shown that ChatGPT is inconsistent in the accuracy of its responses to questions, at times providing inaccurate responses or responses requiring further clarification on topics including total hip arthroplasty, anterior cruciate ligament surgeries, rotator cuff repair, distal radius fractures, and foot & ankle surgeries (Mika 2023; Christy et al. 2023; Anastasio et al. 2023; Kaarre et al. 2023; Eng et al. 2024).
Additionally, the average Flesch-Kincaid Grade level was 14.3, indicating a college graduate reading level. It’s been established that the average American has the capacity to read at an 8th grade level (Cotugna, Vickery, and Carpenter-Haefele 2005; Weiss et al. 1994; Brega et al. 2015). Therefore, The National Institutes of Health (NIH) recommends that patient education materials don’t exceed a 6th grade reading level (Walsh and Volsko 2008; Weiss and Coyne 1997). Eng et al. found that ChatGPT provided responses with a literacy level comparative to a college freshman. Furthermore, Fahy et al. also found that ChatGPT-4 provides responses at a higher readability index which may be confusing to some users. The high literacy rate required to comprehend ChatGPT’s responses greatly limits its utility and inclusivity of all patient populations. Therefore, proper patient education should be completed by a board-certified orthopedic surgeon. This underscores the need for patient education materials that focus on accuracy, readability, and accessibility.
Limitations
While the questions posed to ChatGPT were largely straightforward and uncomplicated, the chatbot’s ability to gather and interpret information required to answer questions requiring a higher level of critical thinking is highly debated. A previous study by Kung et al. challenged ChatGPT in answering orthopedic examination questions and determined that the chatbot often outperformed medical residents within the field (Kung et al. 2023). A similar study, however, concluded that the chatbot underperformed, and would not pass board examinations (Massey, Montgomery, and Zhang 2023). Additional investigation into this area is needed to determine ChatGPT’s ability to answer questions requiring higher order critical thinking. Additionally, components of the DISCERN scoring and the determination of the quality of ChatGPT answers in this study were subjective and therefore open for debate. Furthermore, AI software such as ChatGPT is constantly evolving. Therefore, there is variability in the responses based on question phrasing and inputting additional prompts. This opens the possibility for future research directions to evaluate the reliability of various software versions, such as comparing the responses of ChatGPT 3.5 to those of ChatGPT 4.0.
Conclusion
Given the growing role ChatGPT plays in patient education, it is vital for patients and healthcare workers alike to be aware of the technology’s abilities and limitations. Our findings suggest that ChatGPT possesses the ability to function as a supplementary educational tool for patients with DC. However, it may generate responses that omit important information or are too complex for the general public. Additionally, it is important to understand that every patient is unique. Therefore, it is recommended that they seek diagnosis, treatment, and address any concerns with an appropriate medical provider.