Data Science in the Real World

A Data Science Approach to Choosing a Physician

Published in

The Startup

11 min readJun 4, 2019

My doctor was great, was, because I moved. Or he moved. Or my work changed health insurance and he isn’t covered anymore. Or he retired, opened a food truck, and adopted six poodles — anyway, it doesn’t matter why — I need to find a new doctor.

It turns out my health insurance, Oscar, has a pretty helpful search tool — lots of filters and doctors in network, and even reviews, perfect! Well, not totally perfect, tons of doctors don’t have reviews and those that do don’t have much availability. I’d definitely be more comfortable with a physician who has plenty of time to get to know me. One study explains that most people need only a couple reviews before they trust a business, and, more so, the overall star rating factors in more than anything else into a consumer’s decision. And, a huge majority of people trust online reviews just as much as a personal recommendation.

So, if I want to find a great doctor who’s great and available, maybe I can leverage Oscar’s database to predict who would be reviewed highly, even though they haven’t had any reviews yet.

With a quick web-scraping script, we’re able to collect a very detailed view of every doctor in their network — location, specialty, age, and…patient reviews.

That’s a labelled dataset.

Now it’s important to be very clear about what we are predicting. Patient reviews — especially through a healthcare provider — is a measure of how satisfied the patient is with their treatment experience. It is NOT only a measure of the quality of medical care they’ve received from their doctor. Check in, hospitality, cost, and perceived thoroughness all factor into how satisfied a patient may be with their visit; just the same way as your favorite restaurant is likely judged on its atmosphere, presentation, courtesy, and value. But that’s not necessarily misleading, truthfully, most patients are not fully qualified to criticize their doctor’s medical ability whereas patients are able to perceive all these other factors which account for a good visit — when a new patient is looking for a physician, this aggregated patient experience is definitely useful.

Most of his patients ***survived***, isn’t that enough for at least 3 stars?

Another bias to recognize is who is actually leaving reviews. Thankfully, all reviewers are also verified patients and under the same health provider (though not necessarily the same healthcare plan). But, who is taking the time to give the feedback, after all, they’re not being directly incentivized. Unfortunately, our scraped data set does not have individual reviews which we can analyze; however, it’s not a far reach to continue our restaurant analogy by understanding the type of consumer which generally leaves reviews elsewhere. Essentially, strong experiences (positive or negative) are more likely to influence patients to leave reviews — we’ll see how that comes into play very soon.

Let’s jump into the data, before we even begin modeling we want to get as much information out of our data as possible. We’ll cover the highlights here, the full code is available at github.

The entire api call has given us a massive nested dictionary, after importing and flattening it we’re looking at 82 feature columns and 3133 unique physicians in the New York City area.

Our first meaningful step here is to inspect how many doctors actually have reviews — anything else we’ll have to exclude from our supervised learning experiment. The following couple of lines will quickly drop any rows which don’t have reviews and print out what’s remaining.

df.dropna(subset=['provider.num_reviews'],inplace=True)
print(df.shape)(1206, 82)

And now we have 1206 data points. Roughly 2/3 of doctors did not have any reviews at all through Oscar. Let’s make a histogram of our patient reviews now and see what the distribution looks like:

import plotly_express as pxpx.histogram(df,
           x='provider.percent_recommend',
           title='Physician review distribution')

Definitely a massively biased dataset — unsurprising as we’ve touched on how strong experiences are more likely to provoke reviews. Although, it’s worth noting the lack of strong negative reviews; truthfully there are too many possible explanations and no way for me to verify or correct for them, so we’ll move on, the important thing is we must account for this bias in our training, more on that later.

Let’s now examine one feature that especially stands out in the data — specialties. From the bar graph below, it’s very clear that this feature is heavily biased as well, completely towards primary care physicians (PCP). This is certainly understandable as generally PCPs have much higher patient contact and most patients don’t or even can’t pick their own specialist, which results in fewer patient reviews.

Let’s see specifically how the doctor’s primary specialty effects their rating. With the following code we can generate a scatter plot and see the statistical differences.

px.scatter(df,
           x="provider.percent_recommend",
           y="provider.relative_cost_to_member",
           color="primary specialty",
           marginal_x="box")

Apparently there isn’t much difference between specialties once outliers are excluded (for the interactive graph please view the github).

Feature Engineering

With relatively so few rows with which to teach our algorithm, we need to extract as much useful information from each feature column as possible — information which can’t be learned implicitly without more data.

First, let’s look at all of the features available to us:

df.columns['availabilities.ANY', 'availabilities.NEXT_TWO_DAYS',
'availabilities.NEXT_TWO_WEEKS', 'availabilities.THIS_MONTH',
'availabilities.THIS_WEEK', 'availabilityConfidenceLevels.ANY',
'availabilityConfidenceLevels.NEXT_TWO_DAYS',
'availabilityConfidenceLevels.NEXT_TWO_MONTHS',
'availabilityConfidenceLevels.NEXT_TWO_WEEKS',
'availabilityConfidenceLevels.THIS_MONTH',
'availabilityConfidenceLevels.THIS_WEEK',
'availabilityConfidenceLevels.TODAY_OR_TOMORROW',
'provider.INTERNAL_cost_to_member_scores',
'provider.INTERNAL_member_experience_scores',
'provider.INTERNAL_preventive_care_quality_scores',
'provider.INTERNAL_score', 'provider.absolute_cost_to_member',
'provider.affiliations', 'provider.after_hours_indicator',
'provider.biography', 'provider.board_certified',
'provider.can_show_reviews', 'provider.certifications',
'provider.cleveland_clinic_pcp_capacity',
'provider.cost_to_member_bucket', 'provider.dea_info',
'provider.deduped_specialties', 'provider.educations',
'provider.exclusively_pcp', 'provider.first_name',                                          'provider.gender', 'provider.isoDate', 'provider.languages', 'provider.last_edit_source', 'provider.last_edit_time', 'provider.last_name', 'provider.licenses',
'provider.medconds_treated', 'provider.medprocs_treated',
'provider.member_feedback_facet_infos', 'provider.middle_name',
'provider.npi', 'provider.num_reviews', 'provider.offices',
'provider.partner_program', 'provider.patient_demographics',
'provider.pcp', 'provider.percent_0_12', 'provider.percent_13_17',
'provider.percent_18_30', 'provider.percent_31_50',
'provider.percent_51_up', 'provider.percent_female',
'provider.percent_male', 'provider.percent_recommend',
'provider.photo_avatar_id', 'provider.photo_full_body_id',
'provider.preferred_partner', 'provider.preventive_care_quality',
'provider.primary_city', 'provider.primary_state',
'provider.provider_attributes', 'provider.provider_hash',
'provider.provider_id', 'provider.relative_cost_to_member',
'provider.residency_status_primary_specialty',
'provider.residents_attending_physicians_license_number',
'provider.specialist', 'provider.specialties',
'provider.supervising_physician_specialty', 'provider.tin_infos',
'provider.top_provider', 'provider.years_experience']

It’s definitely a lot all at once, but let’s start by seeing what we can extract from some text features. For example:

‘provider.biography’ is seriously a huge resource, with a deliberate effort of stemming, lemmatization, and some combination of word2vec and bag of words we would end up with some very useful features.

‘provider.languages’ is ideal for a bag of words approach, after all, many of these doctors have high patient-contact and being able to communicate with your patients is crucial to understanding their needs.

‘provider.certifications’ lists the physician’s certifications, but interestingly, there seem to be a fair number of typos when digging into the data. If this is self-reported we may be able to deduce if making a typo in your certifications would qualify as useful information, after all, if it happens frequently maybe there’s a pattern of carelessness; but maybe an assistant or Oscar agent is inputting this information, in which case it may be irrelevant — the important questions to ask are:

Is it relevant that a typo was made?
Is it important that two people made the same typo?
Is it worth correcting the typos?

The answer to all of these is — we don’t know how relevant the features are until we make them! In our case, we have hundreds of specialties and hundreds of typos, so…we’ll come back and extract these features if we absolutely have to.

Please visit the github link for more detail on implementing text feature engineering

Next, since we know NYC is a diverse place with quickly-changing neighborhoods, let’s see how we can better incorporate location information into our data. We have access to each physician’s primary workplace zipcode (awesome), fairly useful categorical information in itself, especially in conjunction with information like languages, age, specialty, etc. Let’s take it two steps further, NYU has acutally published some in-depth demographic information online which includes details about the population, housing, and services by district. And with a little (actually, a lot) of manual data entry we can do a fairly straightforward merge to tie all of that to our single zipcode feature.

Please visit the github link for more detail on implementing location feature engineering

Using mapbox we can clearly see the relationship between location and our target.

loc_df = data.groupby(['provider.zip']).mean()
loc_df['n']=data.groupby(['provider.zip'])["n_member_feedback_facet_infos"].sum()px.set_mapbox_access_token(open('.mapbox_token').read())px.scatter_mapbox(loc_df, lat="LAT", lon="LNG", color='rating', size="n", color_continuous_scale=px.colors.sequential.Viridis)

Although our dataset is biased towards high ratings, we can definitely see there are portions of Brooklyn and Queens which, on average, have lower scores. The plot also emphasizes the high concentration of doctors in lower Manhattan.

I want to draw attention to another couple features in particular, the ‘provider.INTERNAL’ features and ‘provider.member_feedback_facet_infos’. These internal score features are calculated by Oscar and the latter feature is actually another form of review — reviewers can leave specific feedback such as “Easy to book” and “Nice office environment”. My primary concern is that they may not be independent from the patient review score even though we’ve already removed any rows without reviews. The co-variance plot below luckily shows that the only features which are strongly correlated are the review badges with the number of reviews.

However, that doesn’t mean we can necessarily use all this data. For one — as we’re trying to predict reviews when none already exist — ‘provider.num_reviews’ obviously has to go. Also, the positive correlation between ‘member_scoreALL’ and our target indicates a possible data leak, and, as we can’t be sure whether it’s calculated with our target directly, it’s best to exclude it from our dataset.

Finally, we’re ready to begin fitting models. We still have to be cognizant of our small number of data samples — our approach will involve fitting a number of different regression models, which have the potential to find disparate patterns in the data, and then ensembling the results into a higher-level model to leverage all of the distinct outputs. The bulk of this code is shown below in which we import our model architectures from the sklearn library, fit them on slices of our data, predict on the remaining data, repeat until we’ve predicted the entire dataset, and then fit with our top-level model.

#kfold predictions for each data sample for each model type so we can ensemble
kf = KFold(n_splits=10)#selection of regression models from sklearn with updated default values
models = [LinearRegression(),
          KNeighborsRegressor(n_neighbors=100),
          BayesianRidge(n_iter=300, tol=0.001),
          DecisionTreeRegressor(),
          RandomForestRegressor(n_estimators=100),
          GradientBoostingRegressor(),
          AdaBoostRegressor()]#kfold predictions for each data sample for each model type so we can ensemble
kf = KFold(n_splits=10)#selection of regression models from sklearn with updated default values
models = [KNeighborsRegressor(n_neighbors=100),
          BayesianRidge(n_iter=300, tol=0.001),
          DecisionTreeRegressor(),
          RandomForestRegressor(n_estimators=100),
          GradientBoostingRegressor(),
          AdaBoostRegressor()]#remove rating from feature set
X,y = data.loc[:,:'LNG'], data.loc[:,'rating']#initialize scaler
scaler = StandardScaler()#create k-fold split probability predictions for each model
for train_index,test_index in kf.split(X):
    
    #split data according to random fold
    x_train,y_train,x_test,y_test = X.loc[train_index], y[train_index], X.loc[test_index], y[test_index]
    
    #learn scaling from training data
    scaler.fit(x_train)
    #scale training and test data
    x_train = scaler.transform(x_train)
    x_test = scaler.transform(x_test)#loop over every model on every fold and generate predictions based on the available training data
    for model in models:
        print(type(model).__name__)
        
        #knearestneighbors does not accept sample_weight
        if type(model).__name__!='KNeighborsRegressor':
            model.fit(x_train,y_train,sample_weight=sample_weight[train_index])
        else:
            model.fit(x_train,y_train)
        
        #generate test predictions
        y_preds = model.predict(x_test)
        #save predictions to the corresponding row
        data.loc[test_index, type(model).__name__] = y_preds
        
        #generate overall score for this model/fold (for observation)
        rmse = sqrt(mean_squared_error(y_test, y_preds))#move rating column to the back
data['rating'] = data.pop('rating')#create meta model
#remove rating and old features from dataset
meta_X,meta_y = data.iloc[:,-(len(models)+1):-1], data.iloc[:,-1]#split data into train and test 
meta_x_train, meta_x_test, meta_y_train, meta_y_test = train_test_split(meta_X, meta_y, test_size=0.2)#initialize and fit adaboost model
meta_model = AdaBoostRegressor()
meta_model.fit(meta_x_train, meta_y_train)#generate predictions
meta_preds = meta_model.predict(meta_x_test)
#calculate overall ensemble rmse score
meta_score = sqrt(mean_squared_error(meta_y_test, meta_preds))px.histogram(pd.DataFrame({'pred':meta_preds}),x='pred')

And our new predicted values look very similar to our training set! With an overall rmse of 0.181, on a 5-star rating system, we would, on average, be accurate to a doctor’s rating within one star. We can also see how each model contributed to the ensemble with the following line of code:

print(meta_model.feature_importances_)[0.18907524 0.03818779 0.17733793 0.20602109 0.26546008 0.12391788]

With each value respective of it’s model, we can see that our meta-level model gained very little information from the DecisionTreeRegressor prediction features, while the remaining models all contributed significantly, especially AdaBoost.

And that’s it! I can now predict ratings for every unreviewed doctor and find the best fit for me. Though, this model still has significant room for growth with further investment. There are firstly a number of features which we did not address:

From a doctor’s name or picture we can extract information such as age and race, which are specifically very helpful when needing to connect to patients from a similar background.
Medical School, residency program, and all other education were also excluded — obviously this is a significant factor used by hospitals to recruit. Combining their education with a list of internationally ranked schools would absolutely be useful. As would simply label encoding and determining how school and patient reviews are correlated in practice — but would need a much larger sample size to execute.

However, the biggest improvement left to be made is to transform our unpersonalized recommendations into a formal recommender system. What’s prevented us from doing that at the moment is the information publicly available from Oscar does not include patient-specific reviews, but rather just an aggregate of all patients. With patient information, their history, and their individual reviews we would be able to build a a model which can predict whether a doctor is a correct fit specifically for that patient, without the need for filters.

There’s definitely more potential here than just for my own gain too; specifically in the case of finding a physician, helping new doctors and private clinics quickly get patients helps to distribute the demand away from already established doctors, ideally bringing down healthcare costs. And, in a larger scope, having a positive or negative recommendation for new businesses and products allows consumers to make quick, educated decisions on otherwise uncurated options. We see this kind of recommendation in services such as Netflix, where we’re unclear as to how they’ve decided that we might like a specific film, but we can see the percent match regardless. On the other hand, Amazon does not offer recommendations for unreviewed products, leaving many consumers to do a significant amount of research or ignore the product entirely (which then leads businesses to desperately bring in reviews, sometimes by bribing or discounting their products).

Data Science in the Real World

A Data Science Approach to Choosing a Physician

Feature Engineering

Written by Ameen Al-Khafaji