Evidence-based medicine and patient-centred care — Dealing with Personally Identifying Information

Charles Copley
The Patient Experience Studio at Cedar
5 min readNov 9, 2020
Photo by National Cancer Institute on Unsplash

Evidence-based medicine is the “conscientious, explicit, judicious and reasonable use of modern, best evidence in making decisions about the care of individual patients” (Masic I et al. 2008) and has completely transformed the way that medicine is practiced. Patient-centered care is another philosophy of medical care that seeks to focus on the individual’s specific health needs and health outcomes, and in doing so elevates the patient’s role in medical treatment. Unfortunately these two paradigms have often not been well integrated in the way that medicine is practiced. Bridging the gap between the two is one of the challenges in contemporary medical care.

One challenge in achieving this goal will be greater use of data that contains Protected Health Information (PHI). More judicious and extensive used of such data can advance evidence-based patient-centred care. PHI is frequently found in almost all forms of medical data and therefore presents a significant challenge to ensuring patient confidentiality. Dealing with the issue in structured data is relatively easy; identifiable fields are simply identified, and the data contained is removed. Free-form text, for example clinical notes, patient feedback of their experience etc. presents a more difficult problem. On the one hand, these data are an extremely rich source of information, particularly around the nuances in medical care and patient experience that are not easily captured in a standard form. On the other hand, the data are not easily de-identified, and so can be risky to analyze. The unfortunate result of this is an increase in the costs of services, a decrease in the overall potential level of service and a consequent challenge in synthesizing these otherwise extremely valuable data. This blog post describes the state of automated processes for dealing with this issue, and gives results from an implementation of an automated PHI flagged used on Cedar patient billing data.

Photo by Markus Spiske on Unsplash

We spent quite a while reading through solutions that have been used elsewhere. In doing so we found that the promise of applying machine learning solutions to the problem as well as the potential upsides in solving it, have led to a number of efforts. One that shows particular promise is Philter (Norgeot.B et.al). Philter is a Python package, and was released under an open access license. The code repository is available here.

Evaluation of Philter against I2B2 (Integrating Biology and the Bedside (i2b2) competitions in 2006 and 2014) from Norgeot et al.

Norgeot et al. evaluated similar packages against Philter. One example is Physionet (open-access by the MIMIC II group; written in Perl); another is Scrubber (not open source as far as I can tell, provided by the U.S. National Library of Medicine).

Comparison of Philter, Physionet and Scrubber from Norgeot et al.

Despite the above work, many of these packages are not easily deployed within a production environment since they are written Perl (rather than a more accessible production oriented language e.g. Python) or are not open source. It is often useful to evaluate simple processes that can be deployed, are open source . Scrubadub is such a process, deployed as an open-source Python package designed to “Remove personally identifiable information from free text”.

Below we have a simple example using Scrubadub.

import scrubadubtext = "John wrote to Doug using his email doug@gmail.com."
cleaned_text = scrubadub.clean(text, replace_with='identifier')
u"{{NAME}} wrote to {{NAME}} using his email {{EMAIL}}"

We used Scrubadub to evaluate its usefulness in our patient response data. In our data set we randomly selected 120 text-based conversations that patients had with our patient help-desk. In total the data set was comprised of 13520 words. We then manually assessed Scrubadub’s evaluation of each word. A contingency table is presented below

Descriptive data of the Patient billing data set used to evaluate Scrubadub

Unfortunately we have not evaluated Scrubadub against the same data set (saving that for a future post!), however when used against our data set we achieve fairly impressive results:

Scrubadub Performance on Patient billing questions

The above can be compared to Table 1 from Norgeot et al. for a perspective. A future blog post would evaluate them directly against each other with the same underlying data set.

Overall this seems positive. Determining what is acceptable for a production system is a debate to save for another day, but it is worth noting that the metrics above are comparable to those shown in Table 1. In addition, there are a number of easy optimizations that could improve these results with minimal effort. Future work could include evaluating the ease of implementation and performance of more sophisticated algorithms- the results of this could be useful to determine whether the additional integration effort would be worthwhile.

In conclusion, it seems that it is technically possible to make better use of data that do contain PHI with the overall aim of improving patient outcomes, both from an evidence-based and patient-centered care paradigm!


  1. Norgeot, B., Muenzen, K., Peterson, T.A. et al. Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. npj Digit. Med. 3, 57 (2020). https://doi.org/10.1038/s41746-020-0258-y
  2. Masic I, Miokovic M, Muhamedagic B. Evidence based medicine — new approaches and challenges. Acta Inform Med. 2008;16(4):219–225. doi:10.5455/aim.2008.16.219–225
  3. Weaver RR. Reconciling evidence-based medicine and patient-centred care: defining evidence-based inputs to patient-centred decisions. J Eval Clin Pract. 2015 Dec;21(6):1076–80. doi: 10.1111/jep.12465. Epub 2015 Oct 12. PMID: 26456314; PMCID: PMC5057360.