MDSD4Health: a Machine Learning Model & Dataset Disclosure (MDSD) Curriculum for Healthcare & Public Health Contexts
Detailing the development of a free-access curriculum to promote algorithmic transparency in health contexts
--
Abstract
In recent years, machine learning (ML) methodologies have advanced in sophistication and utility, leading to increased adoption in research and development contexts across disciplines. However, while ML is useful, increasing attention is being paid to these methods’ vulnerabilities to bias through both the structure of models’ algorithms and the data on which they are trained. While there exist no industry standards in health contexts to document performance characteristics of trained ML models, nor the limitations or methods utilized in the curation and preprocessing of ML datasets, several disclosure methods and mediums have been proposed alongside efforts to standardize them. A key barrier to these standardization efforts, however, is a lack of educational resources about the purpose and value of using model and dataset disclosure methods (MDSDs) in health contexts, as well as limited training to promote their use. To fill this gap, we present MDSD4Health, a low-barrier, free-access MDSD curriculum for health contexts.
Background & Motivation
Machine Learning Bias in Healthcare and Public Health
Machine learning (ML) refers to and encompasses a variety of methods that enable a computational system to autonomously “learn” from large amounts of data to perform a task (e.g., make predictions or decisions) without being explicitly programmed to do so. In recent years, ML methodologies have advanced in sophistication and utility, leading to increased adoption in research and development contexts across disciplines. However, while ML is demonstrably useful, increasing attention is being paid to these methods’ vulnerabilities to bias through both the structure of models’ algorithms and the data on which they are trained. Propensity of ML models to perpetuate or exacerbate biases is especially consequential in healthcare and public health contexts for which key foci include mitigation of health disparities and promotion of health equity.
Several consequences of bias perpetuation have been documented in recent years on the part of ML models and their implementers in the healthcare space, including the use of a staffing resource optimization algorithm by the Arkansas Department of Human Services that, in 2016, inadvertently cut care coverage and service hours to formerly qualifying patients with no clear explanation why. Other examples include the use of models trained on historical healthcare service utilization data, which may perpetuate existing patterns of care disparities in their assessments and recommendations. One such case was a complex care needs assessment model which disproportionately misclassified eligible Black patients as ineligible for a high-risk care management program compared to eligible white patients. This error was attributed in part to using historical care expenditures as an indicator of care need while failing to account for underrepresentation stemming from historically unequal access to care.
Proposed bias mitigation strategies include reviewing datasets to assess data representation across contextually relevant categories (such as demographics or care continuity patterns); optimizing models for imbalanced data sets; testing models throughout all stages of development; focusing on clinically or practically relevant outcomes rather than performance metrics alone; making code and datasets available for replication and reproduction; and collaborating on interdisciplinary teams for feature engineering, choosing appropriate settings for ML use, interpreting findings, and conducting follow-up assessments.
As several of these strategies require in depth review of a model’s development process or training datasets, they may be enabled by robust disclosure of model origins, including known development limitations, performance metrics, intended uses, and training, testing, and validation dataset composition. Given this, the development of robust disclosure methods and mediums has been proposed as a central strategy in the pursuit of achieving algorithmic transparency in healthcare and public health contexts.
Disclosure for Machine Learning Bias Mitigation
While there exist no industry standards to document performance characteristics of trained ML models, nor the limitations or methods utilized in the curation and preprocessing of ML datasets, several disclosure methods and mediums have been proposed [see here, here, here, here, and here]. In this work, we offer a term to refer to the vast collection of proposed methods and mediums which enable the deliberate communication and reporting of an ML model’s origins (limitations, performance metrics, intended uses), as well as the origin and composition of the datasets used for a model’s training, testing, and validation: Model and Dataset Disclosures (MDSDs).
MDSDs may be broadly categorized into two subcategories:
- Dataset disclosures
- Model disclosures.
Dataset Disclosures
Per our working definition, dataset disclosures are MDSDs which focus specifically on the origins, characteristics, composition, and recommended uses of datasets intended to be used in the training, testing, and validation of ML models. Select dataset disclosure methods and mediums proposed in the literature include datasheets for datasets, dataset cards, the dataset nutrition label [first generation and second generation], and data statements for natural language processing.
Model Disclosures
Model disclosures, per our working definition, are MDSDs which focus specifically on the provenance, limitations, performance metrics, and recommended usage of trained ML models. Select model disclosure methods and mediums proposed in the literature include model cards, FactSheets, and the TRIPOD statement (while not developed for ML models specifically, the TRIPOD Statement format can be applied to relevant supervised learning models.)
Prior Work & Opportunities to Contribute
Recognizing the value of model disclosures in health-related machine learning model deployment, my friend and colleague Vivian Neilley at Google Cloud Healthcare & Life Sciences, conceptualized a Fast Healthcare Interoperability Resources (FHIR) standard in pursuit of more meaningful integration and standardization of model card usage in healthcare contexts. She prepared and presented this work toward the completion of her Master of Science degree in health informatics and analytics (HIA) at Tufts University School of Medicine in 2021.
A substantial barrier to the standardization ideals proposed in Vivian’s work, however, was the lack of educational resources about the purpose and value of using model cards in health contexts, as well as limited training to promote their use. Lack of accessible education is a well-documented barrier to practice standardization [also see here and here]. Thus, one opportunity to expand upon this work was to develop an educational resource to help fill the gap.
A second opportunity to build on Vivian’s work was to expand the effort’s lens from model disclosures alone to include dataset disclosures. Given that data inform the performance of ML models, the collection, curation, and composition of datasets substantially influence the kind of models that can be built and how they are used. Directing focus to the role of dataset reporting in model transparency efforts may enable more robust reporting strategies and further the ideals of standardizing model transparency practices in health contexts.
Project Goals
The primary goal of this project was to develop a working ML model and dataset disclosure (MDSD) curriculum for public health and healthcare contexts, which included a learning guide, content modules, and coding exercises to reinforce concepts. The project aimed to build on prior work on model card usage standardization by centering two opportunities for improvement:
- limited education as a barrier to practice standardization
- lack of focus on dataset disclosures
This project represented an effort to use and develop free-access and open-source educational materials to promote equitable use, adoption, and eventual standardization of MDSD methods in healthcare and public health contexts (such as that initially proposed by Vivian).
A secondary goal of this project was to identify opportunities for project continuity. As this project built upon the work of a former graduate student, an additional component of this work was to identify pathways for future project iterations by those interested in healthcare or public health ML model transparency.
Methods
Website Development
The Model and Dataset Disclosure Curriculum for Health Contexts (MDSD4Health) was published online via Google Sites, a structured webpage creation tool offered by Google. After developing an initial structure for the curriculum to follow, the content was built out as a series of webpages. To date, the curriculum contains 28 webpages and subpages, including a home page, an “about the curriculum” page, resource pages and subpages, an acknowledgement page, and several pages dedicated to MDSD4Health curriculum content.
Content Curation & Development
Information available on the MDSD4Health website includes both original content and curated content sourced externally. Curated content was carefully selected from credible sources, including academic journals and reputable online education organizations. Content was also selected based on free-access availability to align with the curriculum’s commitment to no-cost access (i.e., no article paywalls or institutional affiliation required). Curated MDSD4Health content includes open access academic publications from journals like the Journal of the American Medical Association (JAMA), educational videos from creators like Complexly’s Crash Course, online articles from respected publishers like the MIT Technology Review, and webpages sourced from platforms like Google Developers. All curated content is accompanied by an original or externally sourced graphic, as well as key points to explicitly tie the content back to the objective of the lesson. All curated content is also accompanied by a recommended citation such that the original creators are credited for their work rather than MDSD4Health.
Original content was developed to supplement or expand upon curated content. Original videos were created and recorded using Microsoft PowerPoint and uploaded to YouTube before being embedded in the MDSD4Health website. Original graphics were created using Canva. Original Python exercises were developed in Google Colaboratory (“Colab”). Two exercises were adapted from existing exercises and tailored to the objective of its corresponding submodule, which is stated explicitly in each exercises’ introduction, alongside an overview of changes made.
Curriculum Organization
MDSD4Health content is organized into modules, submodules, and sections. The curriculum is composed of five modules, each containing between one and three submodules. Two submodules contain further “mini” submodules with abridged content. Within each submodule exists several sections housing content relating to the submodule. For a detailed overview the curriculum organization, see the MDSD4Health Learning Guide.
Modules & Submodules
Modules and submodules offered in the MDSD4Health curriculum are organized as follows:
- Module 1: Concepts in Machine Learning
- Submodule 1.1: What is Machine Learning?
- Submodule 1.2: Bias & Fairness in Machine Learning
- Module 2: The Role of Disclosure
- Submodule 2.1: Machine Learning Replicability & Reproducibility
- Submodule 2.2: Machine Learning Generalization
- Submodule 2.3: Primer on Model & Dataset Disclosures (MDSDs)
- Module 3: Module & Dataset Disclosures, Pt. I (Dataset Disclosures)
- Submodule 3.1: Datasheets for Datasets
- Submodule 3.2: Other Dataset Disclosures
- Submodule 3.2.1: Dataset Cards
- Submodule 3.2.2: Dataset Nutrition Label
- Submodule 3.2.3: Data Statements for Natural Language Processing
- Module 4: Module & Dataset Disclosures, Pt. II (Dataset Disclosures)
- Submodule 4.1: Model Cards
- Submodule 4.2: Other Model Disclosures
- Submodule 4.2.1: IBM FactSheets
- Submodule 4.4.2: TRIPOD Statement
- Module 5: What Did We Miss?
Sections
For consistency, each submodule’s content is divided into uniform sections, beginning with an action-oriented directive, such as “read,” “explore,” or “exercise.” In the curriculum’s accompanying Learning Guide, all sections are supplemented with a corresponding, hyperlinked icon.
Exercises
The MDSD4Health curriculum includes five demonstration-style Python exercises, designed to be run in Google Colab or other Jupyter notebook environments. Each module contains one to two exercises designed to further demonstrate a concept introduced in the exercise’s corresponding submodule.
The first two exercises, offered in submodules 1.1 and 1.2 respectively, pertain to foundational concepts in machine learning.
In the first exercise, learners create a simple k-nearest neighbors (KNN) classifier model using the famed Wisconsin Breast Cancer (Diagnostic) Dataset. The exercise demonstrates importing and splitting a dataset with a target variable, creating a KNN classifier using the scikit-learn package in Python, and assessing performance using confusion matrix-based performance metrics.
In the second exercise, learners build a simple artificial neural network using synthetic survey data that inadvertently introduces bias. The exercise is based on an existing notebook-based lab offered by Crash Course Artificial Intelligence and is tailored to a healthcare context. The exercise demonstrates synthetic survey data generation, splitting datasets, building a simple neural network, and reviewing performance metrics. However, in addition to constructing and testing a model, the exercise introduces learners to the concept of model auditing beyond performance metrics by turning attention to a biased model’s impact on a healthcare-related funding decision. Learners find that their synthetic survey data was subject to both sampling bias and undetected feature correlation. Although confusion matrix-based performance metrics indicated good model performance, failure to account for contextualizing nuances of their dataset resulted in distributional misrepresentation and thus misinterpretation by their model.
The third exercise, offered in submodule 2.1, introduces learners to disclosure of data preprocessing methods to enable methods replication and reproduction. Learners import, merge, and subset two pre-pandemic (2017–2020) NHANES datasets to meet given inclusion criteria and are then offered disclosure methods language to accompany this subset, such that others could easily replicate their methods if needed. The exercise’s corresponding submodule aims to foster an interest in study transparency and a commitment to reliable science. This exercise offers a brief introduction to how disclosing data acquisition and preprocessing measures can contribute to this overarching effort in ways that encourage transparency.
The last two exercises, offered in submodules 3.1 and 4.1 respectively, relate to two MDSDs proposed in the literature: Datasheets for Datasets and Model Cards. The exercise offered in submodule 3.1 provides a script that produces a datasheet for the NHANES data subset created in the previous module. Learners upload their subset to acquire automated disclosures about the dataset’s characteristics (shape and data type frequencies) and offer manual disclosures where it is indicated in the script based on the acquisition and preprocessing activities they conducted in the prior exercise. The script leverages the PyFPDF package to generate and export their datasheet as a PDF.
In the curriculum’s final exercise in submodule 4.1, learners create and export model card for a classifier developed using the Wisconsin Breast Cancer (Diagnostic) Dataset as an HTML page. The exercise is based on an existing notebook offered by the Google Cloud team as part of their Model Card Toolkit. Changes made to the original notebook include addition of comments, annotations, expansion/rephrasing of code block explanations, and re-arrangement of model evaluation steps; however, all executable code has been left unchanged.
Other Features
In addition to the main curriculum organized into modules, submodules, and sections, MDSD4Health content is available through other website features.
- Home Page: The MDSD4Health home page offers a welcome to MDSD4Health.com and orients a user to the curriculum’s purpose and guiding ideas, including “machine learning education for all,” “transparency in health-related automation,” and “information as a public good.” The page concludes with a section titled “Start Exploring” with hyperlinked icons to ease a user into the curriculum’s content.
- Learning Guide: The MDSD4Health Learning Guide is an accompanying curriculum document that offers a high-level summary of MDSD4Health content through brief section descriptions, key concepts, and completion time estimates. It is designed to be used by individuals who are either reviewing the MDSD4Health curriculum for themselves or integrating the content into an existing course to identify portions of interest more effectively.
- “About the Curriculum” page: The “About the Curriculum” page offers answers to frequently asked and anticipated questions about MDSD4Health, including “What is MDSD4Health.com?” “Who created this curriculum?” “Why focus on model and dataset disclosures?” “Where can I suggest revisions to the curriculum?” and others.
- Resources: The MDSD4Health curriculum offers a resources page that provides relevant information in each of the following domains:
- Python & Colab Help: Because all MDSD4Health exercises are provided in Python via Colab notebooks, MDSD4Health offers a page dedicated to support learners in using Python and Colab. Within this page are explanations of Python (a high-level general-purpose programming language) and Colab (a free-access, cloud-based, in-browser Jupyter notebook environment) as well as links to various support resources, including a beginner’s guide to Python, a Python cheat sheet, a Colab tutorial, and a direct link to Stack Overflow for troubleshooting.
- Report a Problem or Suggest a Revision: To enable continuous improvement, MDSD4Health leverages the collective insights of its learners via suggested revisions to content. Learners are encouraged to report problems and suspected content inaccuracies via the Feedback Form (through which they may also suggest or request that additional content be included). Learners with Python programming experience who believe there is a better or more elegant way to structure an exercise are also encouraged to suggest revisions via a direct link to the MDSD4Health GitHub repository.
- Twitter Participation Guide: MDSD4Health uses the social media platform, Twitter, to facilitate discussion through a series of “thought prompts” at the end of several submodules. Learners are encouraged to optionally participate in submodule-relevant discussions by using provided hashtags and tagging the MDSD4Health Twitter account. To ensure that learners who would like to participate on Twitter are equipped to, MDSD4Health offers a “Twitter Participation Guide” that provides information about how to tag us and how to use hashtags. For newcomers to Twitter, this page links to a “Getting Started with Twitter” guide developed and offered by the Twitter platform. Finally, this guide includes links to two curated Twitter Lists containing Tweets from people who learners may be interested in following:
- MDSD Developers: a list of Twitter users who developed or published a proposed model or dataset disclosure (MDSD) method that was featured in MDSD4Health
- MDSD4Health Content Sources: a list of Twitter accounts for the creators and platforms referenced in MDSD4Health
Curriculum Review & Revision
The MDSD4Health curriculum website was developed and improved iteratively based on review and feedback from several ML-literate colleagues and mentors, including:
- Vivian Neilley, M.S. (MDSD4Health project preceptor and Product Manager at Google Cloud)
- Ramya Palacholla, M.D., M.P.H. (Director of the Master of Health Informatics and Analytics program at Tufts University School of Medicine)
- Vahab Vahdatzad, Ph.D. M.S. (Course co-director for Introduction to Artificial Intelligence and Big Data in Health Care at Tufts University School of Medicine)
Improvements made to MDSD4Health based on their feedback include:
- Addition of completion time estimates to the Learning Guide to help users more readily glean a rough time commitment required for each section
- Addition of information about how instructors of secondary or higher education who teach similar or adjacent material related to ML methods transparency can incorporate MDSD4Health materials into their own curricula
- Creation of Twitter Lists comprised of users who are relevant to MDSD4Health for learners to follow
- Design and incorporation of a health-related classification use case to accompany the highlighted use case in submodule 1.1
- Removal of artificial neural network calculations from the main curriculum content in submodule 1.1 due to their complexity and replacement as “bonus material”
Discussion
Intended Outcomes
The primary objective of this work was to develop an educational resource that could be used to teach and learn about MDSDs in health contexts. The intended outcome of this work was to increase exposure to MDSDs and their relevance in health contexts in such a way that may promote equitable use, adoption, and eventual standardization of MDSD methods in health contexts (such as that initially proposed by the work of Vivian Neilley, upon which this project was based).
Next Steps
The secondary objective of this work was to identify opportunities for other people interested in healthcare or public health ML model transparency to build upon this work, as I did with Vivian’s initial contribution. As such, the next steps in this work are to compile feedback from users of the website to identify website and curriculum improvements. The feedback will be compiled via the website’s Feedback Form and saved in the provenance of suggestions via pull requests on GitHub.
Conclusion
While there exist no industry standards in healthcare contexts or otherwise to document performance characteristics of trained ML models, nor the limitations or methods utilized in the curation and preprocessing of ML datasets, several disclosure methods and mediums have been proposed along with efforts to standardize them [see here, here, here, here, and here]. A key barrier to these standardization efforts, however, is the lack of educational resources about the purpose and value of using MDSDs in health contexts, as well as limited training to promote their use. Through developing and offering a low-barrier, free-access MDSD curriculum for health contexts, MDSD4Health, we aimed to fill this gap. The MDSD4Health website is live and available for use at https://www.mdsd4health.com/.
Ways You Can Get Involved
All are welcome to review and critique the MDSD4Health curriculum.
If you would like to suggest a revision, report a content inaccuracy, or request new content topics, please do so via our Feedback Form.
If you would like to revise our exercises, please do so via a pull request on GitHub.
If you would like to volunteer with MDSD4Health, send us an email.
Acknowledgements
This work was executed toward the partial completion of my Master of Science degree in Health Informatics and Analytics at Tufts University School of Medicine in July, 2022. This project was carried out under the preception of Vivian Neilley, M.S., at Google Cloud and course direction of Ramya Palacholla, M.D., M.P.H. at Tufts University School of Medicine.
I thank Vivian and Ramya immensely for their support throughout this project, along with Vahab Vahdatzad, Ph.D. M.S. for his thoughtful feedback on the curriculum’s content and organization.
I also thank my graduate school colleagues and classmates in the Master of Science in Health Informatics and Analytics (MS-HIA) and Master of Public Health (MPH) programs at Tufts University for their continued support and feedback at all stages of this project.
References
Google Cloud. What is machine learning? Google Cloud. Accessed June 5, 2022. https://cloud.google.com/learn/what-is-machine-learning
Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349(6245):255–260. doi:10.1126/science.aaa8415
Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A. A Survey on Bias and Fairness in Machine Learning. ACM Comput Surv. 2021;54(6):1–35. doi:10.1145/3457607
Mhasawade V, Zhao Y, Chunara R. Machine learning and algorithmic fairness in public and population health. Nat Mach Intell. 2021;3(8):659–666. doi:10.1038/s42256–021–00373–4
Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. JAMA Intern Med. 2018;178(11):1544. doi:10.1001/jamainternmed.2018.3763
Lecher C. What Happens When an Algorithm Cuts Your Healthcare? The Verge. Published online March 21, 2018. https://www.theverge.com/2018/3/21/17144260/healthcare-medicaid-algorithm-arkansas-cerebral-palsy
Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring Fairness in Machine Learning to Advance Health Equity. Ann Intern Med. 2018;169(12):866. doi:10.7326/M18–1990
Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–453. doi:10.1126/science.aax2342
O’Reilly-Shah VN, Gentry KR, Walters AM, Zivot J, Anderson CT, Tighe PJ. Bias and ethical considerations in machine learning and the automation of perioperative risk assessment. Br J Anaesth. 2020;125(6):843–846. doi:10.1016/j.bja.2020.07.040
Beam AL, Manrai AK, Ghassemi M. Challenges to the Reproducibility of Machine Learning Models in Health Care. JAMA. 2020;323(4):305. doi:10.1001/jama.2019.20866
Mitchell M, Wu S, Zaldivar A, et al. Model Cards for Model Reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM; 2019:220–229. doi:10.1145/3287560.3287596
Gebru T, Morgenstern J, Vecchione B, et al. Datasheets for datasets. Commun ACM. 2021;64(12):86–92. doi:10.1145/3458723
Arnold M, Bellamy RKE, Hind M, et al. FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. Published online 2018. doi:10.48550/ARXIV.1808.07261
Holland S, Hosny A, Newman S, Joseph J, Chmielinski K. The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards. Published online 2018. doi:10.48550/ARXIV.1805.03677
Bender EM, Friedman B. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. TACL. 2018;6:587–604. doi:10.1162/tacl_a_00041
Hugging Face. Create a dataset card. Hugging Face. https://huggingface.co/docs/datasets/dataset_card
Chmielinski KS, Newman S, Taylor M, et al. The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence. Published online 2022. doi:10.48550/ARXIV.2201.03954
Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Intern Med. 2015;162(1):55–63. doi:10.7326/M14–0697
HL7 FHIR Foundation. https://www.fhir.org/
Neilley V. What Health Care Must Learn from Meteorology about the Importance of R2O.; 2022. https://www.statnews.com/2022/03/31/what-health-care-must-learn-from-meteorology-about-the-importance-of-r2o/
Conrad D, Hanson PA, Hasenau SM, Stocker-Schneider J. Identifying the barriers to use of standardized nursing language in the electronic health record by the ambulatory care nurse practitioner. Journal of the American Academy of Nurse Practitioners. 2012;24(7):443–451. doi:10.1111/j.1745–7599.2012.00705.x
Fischer F, Lange K, Klose K, Greiner W, Kraemer A. Barriers and Strategies in Guideline Implementation — A Scoping Review. Healthcare. 2016;4(3):36. doi:10.3390/healthcare4030036
The Future of Public Health. National Academies Press; 1988:1091. doi:10.17226/1091
Shaveet E. MDSD4Health. MDSD4Health. Published July 2022. https://www.mdsd4health.com/
Google. Google Sites. https://sites.google.com/new
Journal of the American Medical Association. JAMA Network. https://jamanetwork.com/journals/jama
Green H, Green J. Crash Course. Crash Course. https://thecrashcourse.com/
MIT Technology Review. MIT Technology Review. https://www.technologyreview.com/
Google Developers. Google Developers. https://developers.google.com/
Microsoft PowerPoint. Microsoft. https://www.microsoft.com/en-us/microsoft-365/powerpoint
YouTube. Google. https://www.youtube.com/
Colaboratory. Google. https://colab.research.google.com/
Python. Python Foundation. https://www.python.org/
Jupyter. Project Jupyter. https://jupyter.org/
Brungard B. Cats vs Dogs? Let’s Make an AI to Settle This: Crash Course Ai #19. Vol 19. PBS Digital Studios; 2019. https://www.youtube.com/watch?v=_DZJV9ey1nE&list=PL8dPuuaLjXtO65LeD2p4_Sb5XQ51par_b&index=21