Introduction
Pre-operative planning for shoulder arthroplasty has undergone an evolution from the use of two-dimensional computed tomography (CT) scans to three-dimensional CT scans which can be modulated within templating software for accurate implant positioning. Glenoid component placement within optimal parameters of version and inclination dramatically affects longevity in both anatomic shoulder arthroplasty (aTSA) and reverse shoulder arthroplasty (rTSA). Studies have demonstrated that three-dimensional templating software can improve the accuracy of glenoid component placement in both aTSA and rTSA (Iannotti et al. 2014, 2015; Walch et al. 2015; Venne et al. 2015). The software can also allow for the manufacturing of patient-specific instrumentation, the use of which can increase operative reproducibility relative to a surgeon’s initial plan (Hendel et al. 2012; Levy et al. 2014; Throckmorton et al. 2015). Planning software influences implant positioning accuracy, as well as selection of the appropriate implant type (i.e. aTSA versus rTSA), both of which can affect outcomes and survival rates of shoulder arthroplasty.
A surgeon’s decision of whether to perform an aTSA versus an rTSA is based on one’s assessment of a patient’s lifestyle and demographic background, as well as specific anatomic features. The presence of a massive rotator cuff tear or significant glenoid deformity are generally accepted as guidelines in support of an rTSA. However, surgeons are often faced with borderline cases for which no clear guidelines exist. For example, in a patient with an intact but degenerated rotator cuff, decision-making is driven by a surgeon’s intuition and based on evolving data, with recent trends supporting the use of an rTSA in elderly cuff-intact patients. A recent study demonstrated that the rate of secondary rotator cuff dysfunction at ten years was 16.8%, with a 55% risk of rotator cuff dysfunction or revision surgery at 15 years (Young et al. 2012). Furthermore, recently published data have shown that the outcomes following both types of arthroplasties may not be as markedly different as once thought. Shoch et al. reported greater satisfaction, fewer complications and fewer reoperations following rTSA compared to aTSA, with similar patient reported outcomes between the two (Schoch et al. 2020).
Machine learning (ML) algorithms are gaining popularity within orthopedic surgery. Machine learning is also known, and commonly referred to, as ‘Artificial Intelligence’ (AI). Reported uses include the prediction of outcomes after total joint arthroplasty (Fontana et al. 2019; Huber, Kurz, and Leidl 2019; Kunze, Polce, Patel, et al. 2021; Kunze et al. 2020), hip arthroscopy (Kunze, Polce, Clapp, et al. 2021; Kunze, Polce, Nwachukwu, et al. 2021), readmission rates (Arvind et al. 2021), and post-operative complications (Gowd et al. 2019) among other things. ML models have also been applied to assist in patient selection (Biron et al. 2019) as well as in the prediction of unplanned readmissions (Arvind et al. 2021) and complications (Gowd et al. 2019) following shoulder arthroplasty. To our knowledge, the application of ML algorithms in the prediction of shoulder arthroplasty type, specifically regarding whether to perform an anatomic versus reverse shoulder arthroplasty, has not been studied.
The primary objective of this study was to determine whether an ML-based software’s prediction of optimal shoulder arthroplasty type based on CT imaging would affect overall agreement amongst a group of expert surgeons. We hypothesized that by providing an ‘artificial intelligence’ software’s surgical prediction, agreement among six surgeons would increase compared to overall agreement in the absence of an algorithmic prediction. The secondary objective of this study was to evaluate the accuracy of the prediction of the ML algorithm compared to an experienced surgeon. We hypothesized that the artificial intelligence software could accurately predict an experienced surgeon’s decision regarding procedure type.
Methods
Eighty-four shoulders with a diagnosis of primary glenohumeral osteoarthritis were used in this study. The mean age of the 84 patients was 68 (34 to 89) years. The images were obtained from different organizations, using a standardized protocol that resulted in uniform output from a previously-validated three-dimensional planning software Glenosys (Imascap, Brest, France) (Walch et al. 2015).
The 84 cases were divided into two subsets of 42 cases. Two groups of three surgeons were created randomly. Each of the six surgeons is considered a high-volume arthroplasty surgeon at each of their respective institutions, and all were within the first ten years of their respective practices. Four of the surgeons practice in the United States and two practice in France. None of the surgeons were involved in the conception or design of this software.
Surgeons number 1, 2 and 3 were in Group I and surgeons number 4, 5 and 6 were in Group II. Each group of surgeons received one subset of 42 cases. During a first round, each surgeon of each group had to plan the 42 cases with software’s prediction blinded, resulting in a total of 252 plans without prediction (2 groups of 3 surgeons, each surgeon planning 42 cases). In a second round, each surgeon of each group had to plan the 42 cases with the prediction visible prior to surgical decision making, resulting in a total of 252 plans with prediction. There was a five-month interval between the planning of each round of the same cases to reduce recall bias.
The software provided a prediction based on several factors: Age, Gender, Glenoid Orientation, Glenoid Sphere Radius, Glenoid Version, Glenoid Inclination, Humeral Subluxation, Glenoid Direction, Glenoid Area. It was previously trained to predict the majority vote of a group of shoulder specialists. The software was initially developed using 210 cases that were not used during software training. The cases were performed by four surgeons who are considered international experts in the field of shoulder arthroplasty, none of whom were involved in this study. The software analyzed those factors mentioned above and learned which combination of values typically resulted in the decision for an aTSA versus an rTSA. The agreement between the prediction and the majority vote of the shoulder specialists was measured using Cohen’s kappa. A value of 0.71 was found: there was a substantial agreement between the algorithmic prediction and the majority vote of the surgeons.
Pre-operative planning without ML-prediction
From November to December 2020 (round 1), planning without AI prediction display was performed, as one would typically do in clinical practice currently. The decision of whether to perform an anatomic or reverse shoulder arthroplasty was based off the automated three-dimensional segmented image, the glenoid measurements and CT-based assessment of rotator cuff integrity. The details regarding the measurements generated automatically by the software (i.e. glenoid version, glenoid inclination and humeral head subluxation) have been previously described in the peer-reviewed literature (Boileau et al. 2018; Shukla, McLaughlin, Lee, Nguyen, et al. 2019). In general, a reverse shoulder arthroplasty was selected in shoulders with Goutallier grade 3 and 4 fatty infiltration (Goutallier et al. 1994), superior glenoid inclination over 10 degrees given the potential for eccentric loading-related early implant failure (Kandemir et al. 2006; Walch, Young, et al. 2012), glenoid biconcavity with glenoid retroversion over 27 degrees (or if an aTSA polyethylene glenoid could not be positioned with less than 10 degrees of retroversion) or humeral head subluxation over 80% (Mizuno et al. 2013; Walch, Moraga, et al. 2012).
Pre-operative planning with ML-prediction
From May to June 2021 (round 2), a software-generated prediction on whether an anatomic or reverse shoulder arthroplasty would be most appropriate was visible to each surgeon at the time of case planning. This prediction was expressed as a percentage of probability (Figure 2) in favor of aTSA or rTSA. Each case was then planned at the surgeon’s discretion.
Statistical Analysis
Statistical analysis was performed on MedCalc® Statistical Software version 19.6.4 (MedCalc Software Ltd, Ostend, Belgium; https://www.medcalc.org; 2021). The interpretation of Cohen’s and Fleiss’ kappa values were made as follows, according to previously published boundaries(26): < 0 (less than chance agreement); 0.01 – 0.20 (slight agreement); 0.21 – 0.40 (fair agreement); 0.41 – 0.60 (moderate agreement); 0.61 – 0.80 (substantial agreement); 0.81 – 0.99 (almost perfect agreement). The kappa coefficient has been routinely used to assess agreement within the orthopedic literature (Atoun et al. 2017; Heuer et al. 2014; Parsons et al. 2020; Ricchetti et al. 2021; Shukla, McLaughlin, Lee, Cofield, et al. 2019).
Agreement between each surgeon and ML prediction
The Cohen’s kappa was used to quantify the agreement between each surgeon and ML prediction during round 1 (no prediction displayed) and round 2 (prediction displayed).
Agreement between the majority vote of the surgeons and ML prediction
The aTSA vs rTSA surgeons’ majority vote was registered for each of the two rounds and for each case. During round 1, aTSA or rTSA majority vote was obtained on each of the 84 cases without software’s prediction displayed. During round 2, aTSA or rTSA majority vote was obtained on each of the same 84 cases with software’s prediction displayed. The Cohen’s kappa was calculated for each round, between the majority vote of the surgeons and the ML prediction. This was done to evaluate how the display of AI prediction could affect the majority vote among a group of surgeons.
Agreement between the surgeons of each group
Regarding each group of surgeons (group 1 and group 2), the Fleiss’ kappa was used to quantify the agreement between the surgeons of the group, with (round 1) and without (round 2) ML prediction displayed.
Results
Machine Learning Predictability
The mean age of the 84 patients was 68 (range: 34 to 89) years. In 395 / 504 (78%) plans there was perfect agreement between the software-predicted surgery type and the surgeon’s decision, with disagreement being present in 109 / 504 (22%). In round I (prediction blinded) there was perfect agreement between the surgeon and algorithm in 194 / 252 cases (77%), and in round II, there was perfect agreement in 201 / 252 cases (80%). For four of the six surgeons, the kappa value increased when cases were planned with software prediction (WP) compared to having no prediction (NP) available (Figure 1), while the kappa values decreased for the other two surgeons.
There were two cases for which there was complete disagreement with the software both with and without the software prediction, meaning that all six surgeons disagreed with the proposed plan. In both cases, an rTSA was preferred over the proposed plan of an aTSA.
This study showed that the availability of a software-generated ML prediction increased the agreement among four out of six surgeons, though these increases in agreement were modest, as no increases in more than one grade (i.e. ‘moderate’ to ‘substantial’, or ‘substantial’ to ‘almost perfect’) were observed.
Agreement between the majority vote of the surgeons and the ML prediction (Table I)
The agreement between majority votes and the software’s prediction evolved from moderate (k = 0.56; 95% CI 0.45 to 0.77) in round 1 (no prediction displayed) to substantial (k = 0.61; 95% CI 0.45 to 0.77) in round 2 (prediction displayed).
Agreement between the surgeons of each group (Table II)
During round 1, there was an ‘almost perfect’ agreement in group 1 (k = 0.87) and ‘substantial’ agreement in group 2 (k = 0.77). Similar results were found during round 2, with an increase of the kappa in both groups (0.94 for group 1 and 0.74 for group 2).
Discussion
The ML algorithm evaluated in this study provided a recommendation on shoulder arthroplasty type with 78% accuracy as compared with the surgeons’ recommendation. Additionally, the algorithm’s prediction influenced surgical decision-making in 10% of cases. Though the presence of a prediction only resulted in a modest change in agreement between rounds of planning, the data suggest that the software’s ability to predict the most appropriate surgery, or that surgery that best aligned with the preferences of the shoulder specialists in this study, was very good. These data also highlight the persistent variation in decision-making amongst surgeons, as consistent with data reported by Parsons et al (Parsons et al. 2020)., who demonstrated substantial variation within and amongst surgeons planning rTSA. The authors emphasized that surgeons largely rely on intuition and experience. The primary hypothesis was validated, in that the ML prediction did increase agreement among the surgeons as compared to overall agreement in the absence of the algorithm’s prediction. The secondary hypothesis was validated as well, in that there was very strong agreement between the software’s predicted surgery type and the surgeons’ preference, as there was perfect agreement between the algorithm and surgeon in 78% of cases.
These findings highlight the potential utility of ML for surgical decision-making prior to shoulder arthroplasty. The use of ML for this purpose within orthopedic surgery has yet to be reported to our knowledge, and represents a contribution to the body of literature on this evolving subject.
The use of ML in orthopedic surgery has been reported with increasing frequency (Fontana et al. 2019; Huber, Kurz, and Leidl 2019; Kunze, Polce, Patel, et al. 2021; Kunze et al. 2020; Kunze, Polce, Clapp, et al. 2021; Kunze, Polce, Nwachukwu, et al. 2021; Arvind et al. 2021; Gowd et al. 2019). However, there is a relative paucity of literature on the use of ML with respect to total shoulder arthroplasty. Reddy et al. identified that the maximum areas for influence of ML in healthcare are administration, clinical decision support, patient monitoring and healthcare interventions (Reddy, Fox, and Purohit 2018). Machine learning categories include supervised machine learning, unsupervised machine learning, semi-supervised machine learning and artificial neural networks/deep learning (33).
Arvind et al. reported that predictive analytics algorithms could acceptably predict unplanned readmission following shoulder arthroplasty (Arvind et al. 2021). Biron et al developed an ML model to predict suitable short stay or outpatient shoulder arthroplasty candidates (Biron et al. 2019). Gowd et al recommended continued validation efforts towards the refinement of intelligent models such as for the calculation of patient-specific risk for complications, which was studied in their work (Gowd et al. 2019).
The type of supervised ML used in this study was support vector machine (SVM). Several studies have demonstrated the applicability of this type of ML. Polce et al demonstrated that the SVM algorithm, which was the same type of ML algorithm used in this study, performed best amongst five ML algorithms employed to predict satisfaction following TSA (Polce et al. 2021).
Surgical decision-making is dependent on many clinical features that were not included in this algorithm’s analysis, such as patient activity and physiologic age. Additionally, the value of surgeon experience and comfort level could not be studied here and are critically influential in planning. While a tentative surgical plan can be created pre-operatively, the final decision is sometimes not known until the time of surgery. Challenging cases are often discussed with colleagues or mentors. There is no software-derived algorithm that can or should supplant this process or claim to definitively direct a surgeon towards one arthroplasty type. The ML prediction is rather another data-point that one can use, if he or she chooses, to augment the decision-making process. It can provide surgeons with an additional measure of confidence during planning as the machine-learned prediction is embedded in the planning software, though only if the surgeon choses to use it.
The decision on whether or not to select an aTSA versus an rTSA can be challenging, particularly in borderline cases. The decision is further complicated by the fact that it is becoming less clear whether or not the results of an rTSA are truly inferior to outcomes after an aTSA, such in patients with an intact rotator cuff (Schoch et al. 2020). Cox et al. reported high satisfaction rates and similar outcomes in patients with an aTSA and contralateral rTSA (Cox et al. 2018). Shoch et al. reported higher patient satisfaction, fewer reoperations and fewer complications following rTSA versus aTSA (Schoch et al. 2020). As outcomes and complication rates of the two arthroplasty types continue to approximate each other, surgeons consider other parameters. This study demonstrates that ML can help us to better understand the parameters that have been found to influence surgeons in their choice of implant. The ML algorithm studied here provided a recommendation based on several factors: age, gender, glenoid orientation, glenoid sphere radius, glenoid direction, glenoid area, glenoid inclination and version, and humeral head subluxation value. Though there are other considerations that influence one’s decision, the algorithm studied here can provide some measure of guidance based on the quantifiable parameters that were incorporated, and can help to guide this decision with further refining of the software. This software-generated prediction feature is embedded within a widely-available pre-operative planning software. When the surgeon initiates case-planning, the prediction is auto-generated, so that no alteration in planning work-flow is required.
One strength of this study was that multiple surgeons planned the same cases. We observed similar benefits from this as have other multi-surgeon studies (Throckmorton et al. 2015; Parsons et al. 2020). The heterogeneity in results as regards the inability of the software’s prediction to uniformly influence surgeon agreement increases the validity of our methodology. Importantly, it also provides a degree of confidence in the software and its ability to align with the surgeons’ decisions in 78% of cases.
Limitations
One limitation of this study was that surgical decisions were based on CT imaging only, without any potentially influential clinical data, as discussed above. As a natural evolution of this, patient-specific functional parameters and clinical outcomes data will need to be incorporated into the algorithm, which is an area of future development and study. Additionally, more precise analysis of the pre-operative soft-tissue quality would greatly augment the software’s value to surgeons, which is an additional area of future research. Despite these current deficiencies in the algorithm’s dataset, this study demonstrated that machine-learning can potentially be used to provide some measure of guidance in surgical-decision for shoulder arthroplasty, with the understanding that ultimately this choice is a matter of experience and of preference. Additionally, though there was the potential for recall bias during the second round of case planning, measures were taken to minimize this risk including the implementation of a five-month interval between planning of the same cases, case order alteration and anonymization through the use of alpha-numeric case identifiers. Another limitation was that there were no defined parameters to which each surgeon adhered when selecting an anatomic versus reverse shoulder arthroplasty. However, this limitation potentially results in a more clinically-realistic or translatable result. As the indications for reverse shoulder arthroplasty continue to evolve, it is likely that surgeons are less-inclined to make decisions solely based on strict anatomic parameters. General guidelines do exist in the literature, and it has been purported that a reverse shoulder arthroplasty should be considered in the presence of Goutallier grade 3 and 4 fatty infiltration (Goutallier et al. 1994), superior inclination over 10 degrees (Kandemir et al. 2006; Walch, Young, et al. 2012), and glenoid biconcavity with retroversion over 27 degrees or humeral head subluxation over 80% (Mizuno et al. 2013; Walch, Moraga, et al. 2012). Though these factors were taken into consideration, each surgeon independently used his own discretion. Each surgery was planned by a high-volume surgeon with similar experience and expertise in this particular field. Though this algorithm had little influence on the choice of high-volume surgeons, it might have a greater influence on those surgeons who perform a lower number of shoulder arthroplasties, and who could therefore greatly benefit from this guidance.
ML within orthopedic surgery is in its infancy in terms of development and refinement, and ML-based models are not widely available for many orthopedic surgeons as of yet. However, available data are increasing, and it is likely that ML will be integrated into the clinical workflow in the near future. Therefore, despite each limitation outlined above, these data are useful so that we can determine which clinical applications are appropriate for ML, identify areas of improvement and focus future research efforts.
Conclusion
The ML software studied here demonstrated that it could reliably predict with 78% accuracy whether an aTSA or an rTSA should be used in select cases, compared to a standard of the majority vote of six different shoulder arthroplasty specialists. However, these surgeons with expertise in the field were not significantly influenced by the software’s prediction. Nonetheless, this demonstrates that machine learning is valuable in this application, and can help to guide surgeons who might perform shoulder arthroplasty infrequently, particularly as there are no clear guidelines on arthroplasty type in many cases.
This article will be in press until the end of 2023.