Which Behaviors Generate The Best Reviews? A Sentiment Analysis of Online Reviews on AOSSM Surgeons

Justin E Tang; Ting Cong; Arielle Hall; Jun Kim; James Gladstone

doi:10.60118/001c.87964

Click here to start your pathway to continuing education!

Introduction

With progressive efforts to transition orthopaedic surgical practice in the United States from the traditional fee-for-service model to value-based (pay-for-performance) compensation models, there has been a push at the provider-level to understand, and ultimately quantify, what constitutes “value”, “performance”, and “quality” in orthopaedic surgery. With this trend, surgeons have become more attentive to, and perhaps sensitive to the topic of, the tactics and data that comprise these measures. As a provider, taking a step back and reflecting on what fundamentally constitutes good patient care and the teachings of the Hippocratic Oath, one can arrive at the conclusion that determination of “value”, “performance” and “quality”, relates in great part to the patient experience.

Patient experience, in one of its simplest and purest forms, exists as patient ratings of orthopaedic surgeons. Over the course of the past decade, online rating platforms, such as healthgrades© and ZocDoc©, have taken off as a prevalent source of information for prospective patients to make informed decisions about the doctors they choose. This democratization of the patient-doctor relationship can, to a measurable degree, influence patient physician preference (Noble et al. 2006), and can lead to changes at a surgical practice level.

Quantifying the patient experience is not a novel concept. Prior studies have performed analyses correlating positive patient reviews to factors such as a surgeon’s social media presence (Donnally, Li, et al. 2018), academic rather than non-academic practice, and mid- rather than early- or late-career practice (Frost and Mesfin 2015). These studies were generally performed manually, by tabulating numerical “star ratings” and correlating this to qualities of the surgeon and their practice (Arthur, Etzioni, and Schwartz 2019; Ramkumar et al. 2017). A minority of studies analyzed the anecdotal accounts of online surgeon reviews in order to identify modifiable risk factors for poor patient experience (Liu et al. 2019). Furthermore, the majority of these studies were descriptive, and so provide less value to practice modification (Hong et al. 2019a). To date, there is little data in the literature summarizing, in a comprehensive and quantifiable manner, the important qualities of an orthopaedic sports medicine surgeon and their practice that lead to a good patient experience.

The “Valence Aware Dictionary and sEntiment Reasoner” (VADER) Sentiment analysis is a widely used and accepted python computer programming package used to obtain compound sentiment analysis scores of written texts (Hutto and Gilbert 2014). Written text is given a numerical score based on its positivity and negativity of language, performed against an established library of natural language. Using this powerful tool, we can distill online written patient accounts down to a number, providing measurable value to all existing online reviews of orthopaedic sports medicine surgeons. In this study, we sought to use VADER sentiment analysis against online reviews of American Orthopaedic Society for Sports Medicine (AOSSM)-member sports medicine surgeons, in order to identify the most important patient experiential factors that lead to higher surgeon ratings. We hope this information can aid in surgeon practice modification to better the patient experience.

Methods

Study Design

The names of the physicians in the AOSSM Society were extracted manually from the AOSSM directory. These names were correlated on Healthgrades© and Zocdoc©, and a web-scraper code was used to obtain the online written reviews and star-rating reviews for each surgeon. Some websites such as Google© reviews, Wellness©, Vitals©, and rateMD©, were excluded because they contained firewalls that prevented the mass web-scraping of the publicly-available data. Healthgrades© and Zocdoc© are two popularly used websites for reviews that permitted this data extraction. For the following portions of this study, star-rating reviews will refer to the reported ratings out of five stars given to surgeons on these websites. This data is publicly available, and star-rating reviews provide an overall average rating acting as a standardized reference value. In order to validate our sentiment analysis model, we correlated our automated quantitative sentiment analysis scores against the surgeon’s overall star ratings by linear regression. Inclusion criteria included all surgeons who were listed in the directory of the AOSSM website who also had online profiles on healthgrades.com. Exclusion criteria included those surgeons who had no online star ratings, and those who had no written reviews.

The outcomes of the study are:

Primary outcome: Validation of our sentiment analysis model of anecdotal reviews by comparing against star ratings. This would permit use of the sentiment analysis model for analysis of online reviews in order to identify patient experiential factors that contribute to good reviews.
Secondary outcomes: Compare sentiment analysis scores against other surgeon qualities including gender, age, time in practice, practice type, and practice geography. Also, additional review words and phrases that may affect these scores were determined a priori by author opinion and correlated to sentiment analysis.

Surgeon Demographics

All of the demographic characteristics of the surgeons were also extracted from healthgrades.com and zocdoc.com. The physician’s age and gender identity were pulled directly from what was reported online (Table 1). Physician state of practice, MD/DO status, and practice type (categorized as academic/hospital or health system employed, private practice, or military/VA) was pulled manually from online websites. Those without these characteristics listed were excluded from the analysis. The states of practice were then separated into six geographic regions (West, Midwest, Northeast, South, Other, and International). The practice location also separated into metropolitan areas vs. non-metropolitan areas. The four regions were defined based on categories previously established in literature (Kirkpatrick et al. 2017), however certain US states and territories were excluded, so an “other” category was made as well as an international one.

Table 1.Demographic Data on AOSSM Surgeons Analyzed

Demographics	Counts
Gender
Male	1703 (92.8%)
Female	132 (7.2%)
Age
<50 y/o	773 (47.6%)
>= 50 y/o	850 (52.4%)
Geographic Location
West	351 Surgeons, 3459 Reviews
Midwest	425 Surgeons, 4154 Reviews
South	670 Surgeons, 7761 Reviews
Northeast	401 Surgeons, 4213 Reviews
Other**	9 Surgeons, 64 Reviews
International***	17 Surgeons, 139 Reviews
Degree
MD	1798 (95.3%)
DO/DPM	88 (4.7%)
Type of Practice
Private	1099 (58.6%)
Academic/Hospital Employed	745 (39.8%)
Military/VA	10 (0.5%)
Other	20(1.1%)

*some physicians did not have certain demographics listed, as such not included in respective analyses
**other states/US territories include: Hawaii, Alaska, Virgin Islands, & Puerto Rico
***International countries include: Canada, Argentina, Brazil, Columbia, Germany, Israel, Japan, Santiago, and Taiwan

Sentiment Analysis

The “Valence Aware Dictionary and sEntiment Reasoner” (VADER) package is able to take writing and assign numerical scores to the “sentiment” of each individual sentence and then output an overall average sentiment for a given paragraph. VADER was used to obtain scores of each written review for every surgeon on the healthgrades.com and zocdoc.com databases. It is able to take sentences as inputs in order to output a compound score based on positivity or negativity of specific words with equations to account for punctuation, capitalization, and modulators.

VADER Score Calculation

VADER relies on a previously defined dictionary of specific and common words that retain inherent positive or negative qualities. The dictionary was developed through ten independent human raters. These raters assigned scores ranging from -4 to +4 to this dictionary of words. On this scale, a score of 0 represented a neutral sentiment (Hutto and Gilbert 2014). The algorithm then takes the inputted sentences, scans for these specific words, and sums and normalizes the scores to between -1 to +1, where -1 indicates a worse sentiment and +1 indicates a positive sentiment.

VADER calculation also factors in modifiers to words. A positive or negative empirically derived mean is factored into the calculation when either an emphasizing adverb or negating adverb is used respectively. This means that phrases such as “very informative” would be given a higher score than just “informative.” Further, negation is also factored in when it precedes words within the VADER dictionary by reversing the sign during calculation. Therefore, positive sentiment words would be calculated as negative and vice versa.

Model Validation

Linear regression analysis was performed comparing the average sentiment analysis score for every doctor to their average star score in order to show a correlation between calculated scores of this study and the online rating.

Word Frequency Analysis

Frequencies of most used words recognized by NLTK are also reported. In order to focus this analysis mainly on words that reflect behaviors and practices of surgeons, some words that generally describe a patient’s experience but that were not clinically relevant were removed. For example, “great” and “worst” may be used to describe a patient’s visit with a provider, but it does not indicate what factored into those experiences.

Multiple Logistic Regression

Finally, a multiple logistic regression was performed on clinically relevant keywords to determine the odds of affecting a sentiment score (Table 3). Specific clinically relevant words, when included in a review, increased or decreased the odds of receiving a sentiment analysis score >0.5, indicating a largely positive review.

Statistical Analysis

Student t-tests were performed to determine the relationship between demographic variables (age, gender, degree type) and average sentiment score of written reviews. A one-way ANOVA was used to test for potential differences among average scores with respect to geographic regions as well as practice type.

Results

Out of 2623 AOSSM surgeons identified, following application of the inclusion and exclusion criteria, a subset of 2080 AOSSM surgeons were analyzed consisting of 19,664 online reviews.

Model Validation: Linear Regression

The linear regression analysis of average sentiment analysis scores to average star scores showed a positive correlation between the scores (Figure 1, r2= 0.586, p-value < 0.01), indicating good concordance between sentiment scores and reported overall star-reviews.

Figure 1.Linear regression analysis of average online reported star score compared to calculated sentiment analysis score

Demographic Analysis

A Student T-test was run to check for a significant difference between the means of the sentiment analysis scores given to male and female surgeons. The test indicated an insignificant correlation between gender and greater or lower sentiment analysis scores (mean sentiments: male = +0.496, female = +0.546; p = 0.132). The student T-test performed on the average star scores compared to gender also returned insignificant (average stars: male = 4.29, female = 4.34; p = 0.391). These results are summarized in Table 2.

Table 2.Student T-test comparing Star and Written Reviews to Gender and Age

	Male Average	Female Average	P Val
Written Reviews*	+0.496	+0.546	0.132
Star Reviews**	4.29	4.34	0.391
	>50 y/o Average	<50 y/o Average	P Val
Written Reviews*	+0.460	+0.536	<0.01
Star Reviews**	4.14	4.42	<0.01

*A more positive sentiment analysis number means a more positive review (scale of -1 to 1)
**On a scale for 1-5 stars, a greater number means a better review

Further T-tests were conducted to check for a significant difference between means of sentiment analysis scores given to surgeons older and younger than the age of 50. The age cutoff of 50 years old was selected in order to establish as equal cohorts of surgeons under and above the cutoff as possible. The test indicated a significant correlation between older surgeons and lower sentiment analysis scores (mean sentiments: < 50y = +0.536, > 50y = +0.460; p = <0.01). This also returned significant (average stars: < 50y = 4.42, > 50y = 4.14; p = <0.01). This indicates that there is a significant decrease in written sentiment about surgeons older than 50. These results are summarized in Table 2.

The average sentiment analysis scores for geographic regions were: West: 0.58, Midwest: 0.59, South: 0.59, Northeast: 0.57, Other: 0.54, International: 0.54 (p=0.81) indicated an insignificant difference among average sentiment analysis scores for all surgeons practicing in these regions. The average score of doctors practicing in metropolitan areas was 0.57 and the average score for non-metropolitan areas was 0.59 (p= 0.08), also showing no statistical difference. The student t-test for degree type also indicated an insignificant difference between how MDs and DOs are written about (MD Mean=0.59, DO Mean = 0.59; p=0.97). Finally, the ANOVA for practice type also returned insignificant (Private: 0.58, Academic/Hospital: 0.60, Military: 0.65; p = 0.16)

Table 3.Clinically-Relevant Single Word Frequency Analysis of Best and Worst Reviews

Best Reviews		Worst Reviews
Word	Frequency	Word	Frequency
Care	2395	Pain	1575
Caring	1381	Injury	332
Friendly	1191	Rude	256
Pain	1182	Care	249
Kind	970	Unprofessional	99
Bigram	Frequency	Bigram	Frequency
Pain free	380	No pain	238
Kind caring	230	Severe pain	92
Feel comfortable	228	Knee pain	91
Cares patients	222	Shoulder pain	90
No pain	208	Pain free	82

Word Frequency Analysis

Frequencies of most used words recognized by NLTK are also reported. In order to focus this analysis mainly on words that reflect behaviors and practices of surgeons, some words that generally describe a patient’s experience but that were not clinically relevant were removed. For example, “great” and “worst” may be used to describe a patient’s visit with a provider, but it does not indicate what factored into those experiences.

Multiple Logistic Regression

Finally, a multiple logistic regression was performed on clinically relevant keywords to determine the odds of affecting a sentiment score (Table 4).

Table 4.Multiple logistic regression analysis on clinically relevant keywords

	2.5% CI	97.5% CI	OR	P val
Long Wait	0.172	0.718	0.351	<0.01
Pain	0.270	0.333	0.300	<0.01
No pain	0.667	1.321	0.939	0.718
Severe pain	0.174	0.743	0.359	<0.01
Pain free	2.537	5.202	3.633	<0.01
Friendly staff	1.267	23.422	5.448	0.02
Warm	1.866	7.562	3.756	<0.01
Confident	3.462	8.030	5.273	<0.01
Listens	1.533	2.429	1.930	<0.01
Bedside manner	1.300	2.239	1.706	<0.01
Knowledgeable	1.221	1.743	1.459	<0.01
Recover	0.947	4.326	2.02	0.06
Accessible	0.500	4.547	1.509	0.46
Front desk	0.757	2.199	1.290	0.34
Parking	0.323	7.738	1.582	0.57
Nurse	0.510	0.922	0.686	0.01
Receptionist	0.253	0.693	0.419	<0.01

Discussion

In this study we used a computer algorithm to analyze, in a comprehensive and consistent manner, online review sentiments that correlate with good patient experience for AOSSM surgeons. As patients increasingly rely on online sentiments in choosing a practitioner for their treatment, this democratization of the patient-surgeon relationship should be well-understood by surgeons and be factored strategically into practice behavior.

According to Nielsen – a global performance management company – friends and family recommendation remains the most credible form of advertising, followed closely by online reviews (“Audience Is Everything®,” n.d.). There is evidence that word of mouth reputation drives a significant portion of doctoral referrals in the United States (Tu and Lauer 2008). Specifically for orthopedic surgeons, prior literature has suggested that female sex, fewer years in practice or mid-career practice, and greater number of reviews are correlated with higher website physician ratings (Frost and Mesfin 2015; Heimdal et al. 2021). Though our data did not demonstrate sex differences in online ratings, we did find that younger surgeon age, less than 50, is correlated with higher sentiment scores. This may reflect cultural progression in the field of orthopaedic surgery as well as advancements in modern surgical training, which may make surgeons more likeable to patients, though there is little data to support this in the literature (Blasier 2009). One study demonstrated that academic practice is associated with better surgeon reviews (Frost and Mesfin 2015), which we were not able to replicate using our algorithm. Geographic and metropolitan versus non-metropolitan practice location did not significantly correlate to patient satisfaction. We also found no difference in DO and MD surgeon sentiment scores, though the small DO pool is a limiting factor.

Based on our word frequency analysis (Table 3), the current study agrees with previous, where phrases such as warm, confident, knowledgeable, and have proper bedside manner are associated with significantly higher overall review scores (Kalagara et al. 2019). As a result, this study and previous literature should encourage surgeons to continue to practice these interpersonal characteristics and place emphasis on these traits. Additionally, multiple logistic regression demonstrated that many positive behavioral characteristics such as “warm,” “confident,” “listens,” and “bedside manner” significantly improved the odds of receiving a positive review. It’s clear that patients are still cognizant and appreciative of these behavioral characteristics when exemplified by their surgeons. As a result, it is prudent for sports medicine doctors to be cognizant of these influences and continue to incorporate these facets into their medical doctrine.

Our multiple regression analyses of common review phrases revealed a number of points for discussion (table 4). Studies have indicated that the reviews written assigned to specific surgeons are often less reflective of the skills or about the surgeons themselves but rather factors such as office environment or the other ancillary staff (Donnally, McCormick, et al. 2018). Hong et al. in their study indicated that office efficiency, long wait times, staff interactions, and other non-physician related factors have significant contributions to online review (Hong et al. 2019b). This is also further reflected by the present study since the multivariate analysis indicated that when the phrase “long wait” was used in a review, it was a third as likely to receive a largely positive review from the patient. Additionally, comments that included mention of a “nurse” or “receptionist” also decreased the odds of receiving a positive review. This is likely variable, not to say that nurses and receptionists will always decrease the odds of a positive review, but rather indicating that the staff of a facility and not the physician is having an influence on the reviews attributed to the physician. To that end, reviews that included “friendly staff” were five times more likely to receive positive reviews. Both of these are in accordance with the results found in Hong et al.'s study where a positive interaction between patients and office personnel was often noted in reviews as well as negative non-physician specific related factors such as wait times (Hong et al. 2019b). As such, surgeons should be cognizant of the fact that despite their best efforts toward their own behavior, patients are often assessing their visits holistically. As a result, reviews may not solely be focused on the surgeon but could be a reflection of the office as a whole, thus indicating the need to emphasize the importance of these findings with all those involved in a patient visit.

The bigram and word frequency analysis indicates that one of the largest contributing factors driving both positive and negative reviews of sports medicine surgeons is in fact pain and pain management. Feucht et al. in their study on patient expectations prior to primary and revision anterior cruciate ligament surgery indicated that patients generally have very high post-op expectations in terms of returning to normal function and pain (Feucht et al. 2016). Additionally, throughout orthopedics it has been noted that unrealistic preoperative expectations that go unfulfilled after surgery results in significant dissatisfaction (Culliton et al. 2012; Iversen et al. 1998; Noble et al. 2006). This study highlights the importance and necessity of establishing proper pain expectations prior to surgery and bolsters this growing body of literature surrounding pain expectations in orthopedics. This is highlighted by both the word frequency analyses and the multiple logistic regression as the top-rated surgeons, although are described with behavioral factors, are also being lauded for pain resolution, the most used bigram being “pain free.” The reviews reflect the fact that patients are dissatisfied most with lasting pain post-operatively, where if “pain” was utilized in any review, it had 0.30 odds of receiving a positive review. It is prudent, therefore, for surgeons to practice proper pain expectation management during pre-operative visits. By elucidating any misconceptions and addressing potentially unrealistic pain resolution expectations before surgery, physicians may be able to quell patient anxieties preemptively and prevent future dissatisfaction. Pain resolution is highly unpredictable and making this clear to patients early may improve the negative feedback being communicated online.

This study is not without its limitations. Some physicians, knowing the importance of social media, may have artificially altered their online profiles. As a result, their reviews displayed may not be representative of all of their patients. However, these reviews are those that future patients would be seeing and making educated decisions with, so the results of this study are still factors that should be considered in order to try to improve one’s own profile moving forward. Additionally, many surgeons may encourage their “successful” cases to leave them a review and neglect to ask others, contributing further to this artificial bias. Finally, there is a possibility that the reviews being written are not reflective of the actions of the doctor. Because pain is inherently difficult to treat and predict, a surgeon may have followed proper protocol to treating pain, but the patient may just not have seen complete resolution and written a review because of that experience. At the current time, there is no way to discern among patients who had lasting pain due to lack of care and those who have pain despite proper management.

Conclusion

In this study we used computerized sentiment analysis to calculate numerical scores for online written reviews for AOSSM surgeons, in order to identify qualities which improve online surgeon reviews. As expected, compassionate interpersonal skills and efficacy in controlling pain were associated with the best online review scores. These potentially modifiable factors can be incorporated into efforts to improve patient satisfaction and should be critically incorporated into surgeon practice.

Which Behaviors Generate The Best Reviews? A Sentiment Analysis of Online Reviews on AOSSM Surgeons

Abstract

Background

Purpose

Study Design

Methods

Results

Conclusion

Clinical Relevance

Introduction

Methods

Study Design

Surgeon Demographics

Sentiment Analysis

VADER Score Calculation

Model Validation

Word Frequency Analysis

Multiple Logistic Regression

Statistical Analysis

Results

Model Validation: Linear Regression

Demographic Analysis

Word Frequency Analysis

Multiple Logistic Regression

Discussion

Conclusion

References