Introduction
Knee osteoarthritis (OA), or degenerative joint disease, is one of the most common reasons for presentation to orthopaedic and primary care offices (Weinstein et al. 2013; Turkiewicz et al. 2015; Van Manen, Nace, and Mont 2012). The prevalence of knee arthritis has grown considerably since the mid-twentieth century, affecting more than 50% of individuals over the age of 65 and approximately 80% by the age of 75 years old (Wallace et al. 2017; Arden and Nevitt 2006). Standing knee radiographs with multiple views are typically the first imaging study obtained to evaluate the presence and severity of OA (Boegård and Jonsson 1999; Duncan et al. 2015). At most institutions and radiology centers, a radiologist interprets the radiographs, and a written report is made available to the referring physician (primary care or orthopaedic surgeon). In some, but not all cases, radiographs are reviewed by a musculoskeletal specialized radiologist. Additionally, radiographs are commonly repeated at the initial orthopaedic evaluation, even if ordered previously by a different physician (Yayac et al. 2021). Unfortunately, at the present time there is no gold standard radiographic scale for osteoarthritis, and this difficulty is compounded by the involvement of physicians across different specialties.
While there is no one gold standard classification method, multiple systems have been described for the classification of radiographic knee osteoarthritis based on etiology, symptom duration and severity, and radiographic findings (Lespasio et al. 2017). The Kellgren-Lawrence (KL) system, described in 1957, is one of the most widely used systems for research purposes (Kellgren and Lawrence 1957). Despite common use in research, the KL system is not widely used in clinical practice (Riddle, Jiranek, and Hull 2013). Limitations include the overemphasis of osteophytes and underemphasis of joint space narrowing, which has been demonstrated as a more reliable indicator of OA (Heng, Bin Abd Razak, and Mitra 2015; Kallman et al. 1989; Wright and The MARS Group 2014). In our experience, OA is commonly graded in clinical practice as “mild,” “moderate,” or “severe” disease without the use of any specific classification system. Thus, differences in each physician’s interpretation of the same image may lead to discrepancies in clinical documentation, patient management, and prior authorization.
The purpose of this study was to investigate agreement in the interpretation of knee radiographs between orthopaedic surgeons and radiologists using these simple and subjective terms that are commonly used in practice. We looked at agreement among orthopaedic surgeons specializing in arthroplasty, musculoskeletal radiologists, and general radiologists. Specifically, we investigated agreement in (1) severity of OA, (2) location of OA. We hypothesized that there would be moderate to strong agreement between physicians of the same specialty but lower agreement between those of different specialties.
Methods
Study Setting and Participants
One hundred five patients presenting to a single orthopaedic practice for unilateral knee pain were identified. Mean age was 62 +/- 16 with 65 females (62%) and a mean body mass index (BMI) of 28 +/- 7. Standing anterior to posterior (AP) and lateral radiographs were obtained for each patient. Patient history and demographic information was blinded before evaluation by each reviewer.
Six physicians independently reviewed the radiographs to characterize the severity and location of OA. Reviewers included two high-volume adult reconstruction orthopaedic surgeons, two fellowship-trained musculoskeletal (MSK) radiologists, and two general radiologists. For each set of radiographs, osteoarthritis was described as “mild,” “moderate,” or “severe,” which mirrored the language utilized in the clinical documentation of the providers. The location of degenerative changes was described as medial compartment, lateral compartment, patellofemoral (PF), or any combination. Although blinded to the entirety of the patient’s clinical presentation, reviewers were then asked to indicate the perceived need for TKA solely based on the severity noted on the knee radiographs.
Statistical Analysis
Agreement testing was calculated using Fleiss’ kappa for categorical variables. Values less than 0.3 were considered no true agreement, between 0.3 and 0.5 weak agreement, between 0.5 and 0.8 moderate agreement, and greater than 0.8 strong agreement. Moderate or strong agreement was considered reliable agreement. We compared agreement among readers of the same specialty and between different specialties using the following groups: (1) surgeons and all radiologists, (2) surgeons and MSK radiologists, (3) surgeons and general radiologists, and (4) MSK radiologists and general radiologists. All statistical analyses were done using R Studio (Version 3.6.3, Vienna, Austria).
Results
Overall Agreement
When comparing reads between all reviewers, we found weak agreement in assessment of severity, and PF OA (Table 1). There was no true agreement in the assessment of medial, lateral, or tricompartmental OA. No evaluations had moderate or strong evaluation among the entire group of reviewers.
Agreement Within Orthopaedic Surgeons
Orthopaedic surgeons demonstrated weak agreement in assessment of severity and lateral compartment OA, but no true agreement for medial, PF, or tricompartmental OA (Table 2). They did, however, show moderate agreement (κ = 0.503) in their perceived need for TKA based on radiographic findings.
Agreement Among all Radiologists (MSK and General)
The radiologists combined as a single group showed moderate agreement in assessment of PF OA, weak agreement in severity, and no true agreement for all other locations and recommendation of TKA (Table 3). MSK radiologists had weak agreement in assessment of severity and PF OA and no agreement in the presence of medial, lateral, or tricompartmental OA (Table 4). Similar to orthopaedic surgeons, they demonstrated moderate agreement in the perceived need for TKA (κ = 0.568), which was the strongest agreement within any single specialty.
Agreement Across Specialties
There was weak agreement in the assessment of severity in all four comparison groups (Table 5), with the strongest agreement between general and MSK radiologists (Kappa-Fleiss 0.438) followed by surgeons and MSK radiologists (κ = 0.438). Location of OA showed the lowest agreement of any comparison between groups, ranging from κ values of 0.126 to 0.183.
Discussion
The most important result of this study was that the commonly used definitions of mild, moderate, and severe arthritis reported for knee radiographs are not consistent or reproducible. In this study, we found that there was generally weak agreement both among and between orthopaedic surgeons and radiologists in the interpretation of radiographic knee arthritis based on an assessment of the severity and location of disease. Only three comparisons had moderate agreement: (1) assessment of PF arthritis by radiologists, (2) perceived need for TKA by orthopaedic surgeons, (3) perceived need for TKA by MSK radiologists. The provider’s perceived need for TKA is merely a subjective measure of whether the patient would be a candidate for TKA solely based on radiographic OA severity but is not clinically relevant as a patient’s entire clinical picture and physical exam is needed. There were no comparisons that resulted in strong agreement within or between specialties. These findings suggest that the utilization and adherence to one of the many standard classification systems is needed for consistently interpreting radiographic knee arthritis that is both reliable and applicable for use in clinical practice.
The Kellgren and Lawrence (KL) grading system is widely regarded and was found to have the highest inter- and intra-observer correlation coefficients when looking at the severity of knee arthritis (0.83 for both) compared to other joints (Kellgren and Lawrence 1957). Despite the popularity of the KL grading system, it is not without drawbacks. Wright et al. evaluated interrater reliability in six classification systems (KL, International Knee Documentation Committee (IKDC), Fairbank, Brandt et al, Ahlback, and Jager-Wirth) in patients undergoing anterior cruciate ligament revision reconstruction, looking for degenerative changes (Wright and The MARS Group 2014). For the KL classification, they had an intraclass correlation coefficient of 0.38 for AP radiographs and 0.54 for Rosenberg flexion radiographs (Wright and The MARS Group 2014). They found the IKDC classification, which is based on the degree of joint space narrowing, to have the best combination of interrater reliability and correlation with arthroscopic findings (Wright and The MARS Group 2014; Mehta et al. 2007). A study by Riddle et al. looked at the interrater reliability of the KL system in arthroplasty surgeons, finding moderate to high agreement in knees indicated for TKA, but lower agreement in contralateral knees (Riddle, Jiranek, and Hull 2013).
While different radiographic classification systems have demonstrated various advantages or disadvantages for research purpose, the lack of a gold standard system of radiographic arthritis leads to a variety of clinical approaches to interpreting knee radiographs. Although widely regarded and used in practice, the KL system has been shown to underpredict the degree of OA observed intraoperatively at time of the arthroplasty (Abdelaziz et al. 2019; Blackburn et al. 1994). Complicating matters further, even within the KL system, different versions of the same criteria have led to lower agreement between readers (Schiphof et al. 2011). While studies have investigated the reliability of systems, there is little reported on the prevalence at which systems are used in routine evaluation of radiographs. It is critical that reviewers—both within and between specialties—speak the same language when communicating and documenting in patient notes. As variable agreement has been demonstrated in the literature using these well-established systems, our findings suggest even lower agreement when reviewers grade OA based on a subjective evaluation, as is commonly the case in practice.
From a patient’s perspective, inconsistent reporting and charting of their radiographic findings can lead to potential implications. Especially in current times, in which patients can access their charts and read clinical documentation, these inconsistencies can be anxiety-provoking for a patient (Meyer et al. 2021). When one provider reports “mild” and another provider reports something different, the diagnostic uncertainty can cause unnecessary stress and the potential for mistrust. Additionally, insurance companies often deny surgeries based on a radiologist’s reading and documentation of mild-moderate and severe OA. Therefore, communication with loosely defined terms such as “mild,” “moderate,” and “severe” can lead to prior authorization issues for the patient and payor.
Previous studies have investigated the agreement between surgeons and radiologists in other orthopaedic subspecialties, with variable results. One study demonstrated higher agreement between radiologists compared to surgeons when evaluating chondral knee lesions on magnetic resonance imaging (MRI) (Cavalli et al. 2011). Another study showed that experienced surgeons were more accurate at identifying shoulder lesions on MRIs when comparing to intraoperative findings (van Grinsven et al. 2015). A study on radiographic diagnosis of femoro-acetabular impingement demonstrated higher interobserver reliability within the same specialties but poor agreement between radiologists and surgeons (Ayeni et al. 2014). Assessment of hip fracture healing, another challenge that has demonstrated poor agreement with no reliable standard, was shown to be improved by using a standardized union score (Chiavaras et al. 2013). These studies demonstrate inconsistent agreement across multiple subspecialties within orthopaedics, which we found to be true as well when looking at knee OA.
There are several limitations to note in our study. Radiographs reviewed included AP and lateral, but the inclusion of Rosenberg flexion and PA would have given the providers a more complete assessment, which could alter agreement findings. A sunrise view would have been helpful for our providers to best assess PF osteoarthritis. Additionally, this study does not include an intra-rater reliability measure in order to limit the workload on our providers. There is also potential for experience bias in our study, in which the Orthopaedic surgeons and MSK radiologists may be more familiar looking at knee films. Despite this, all six providers were from a high-volume institution and had previous experience interpreting knee osteoarthritis. A limitation of our study also includes selection of radiographs from a single orthopaedic practice in a metropolitan area. Radiographs were not reviewed prior to the study to determine quality of alignment; therefore, some inconsistencies may have existed between images. Additionally, the two orthopaedic surgeons were fellowship-trained high-volume arthroplasty surgeons, thus our findings may not be generalizable to other subspecialty or generalist orthopaedic surgeons. Instructions to grade osteoarthritis as mild, moderate, or severe were likely interpreted subjectively by each of the reviewers; however, this subjective interpretation contributed to our main finding of inconsistent agreement between reviewers and is consistent with that used in ordinary practice.
Utilizing and having strict adherence to a standard grading system for evaluating radiographic knee arthritis remains a challenge, both within and between specialties. While the ultimate decision to undergo TKA must include information from patient history and physical examination, radiographs play an integral role in the evaluation and grading of arthritis. But without objectively stated findings, the utility of such an x-ray report must be considered. Radiographic assessment also frequently plays a role in assessment of surgical necessity by third-party payers, which invites issues of prior authorization. Our study demonstrated that, even between physicians of the same specialty, there remains a high degree of inconsistency. This inconsistency was even more pronounced when comparing across specialties. This may in turn result in obstacles when obtaining third-part payor approval, thus leading to challenges in providing timely and appropriate care. Future research should seek to identify how often language such as “mild,” “moderate,” and “severe” is used to grade OA as opposed to standardized grading systems. Establishing and adhering to a gold standard for use in the clinical arena that is reliable and efficient is critical to allow for improved decision making and communication among physicians, patients, and third-party payers.