醫療品質資料摘要

Johns Hopkins R Programming 的第四週作業:Hospital Quality

Yao-Jen Kuo
數聚點文摘

--

Photo by Hush Naidoo on Unsplash

這份作業主要是檢視我們能否利用 R 語言繪製一個直方圖(histogram)以及自訂出三個函數與課程所提供的全美醫院醫療品質資料互動進而回答問題:

  • 繪製 30-day mortality rates for heart attack 的直方圖
  • 找出指定州別最好醫療品質的醫院
  • 排序指定州別的醫療品質
  • 排序所有州的醫療品質
Johns Hopkins University Data Science Specialization: R Programming

Many people have gotten jobs in machine learning just by completing that MOOC. There’re other similar online courses that help; for example the John Hopkins Data Science specialization.

Andrew Ng answering How should you start a career in Machine Learning? on Quora

Johns Hopkins University 在 Coursera 開設的資料科學專項課程是一門非常完整且質量俱佳的課程,但是獨立完成專項課程中的十門課程,對初學者並不是一件簡單的事情;由於課程中的 Programming Assignment 的難度比上課內容或 swirl 小作業都難上許多,容易在遇到要繳交 Programming Assignment 的時候,就澆熄了滿懷雄心壯志的學習熱情。

跟 DataInPoint 一起完成課程中的 Programming Assignments 吧,今天我們解的是第二門課 R Programming 第四週要繳交的作業!

概述

The data for this assignment come from the Hospital Compare web site (http://hospitalcompare.hhs.gov) run by the U.S. Department of Health and Human Services. The purpose of the web site is to provide data and information about the quality of care at over 4,000 Medicare-certified hospitals in the U.S. This dataset essentially covers all major U.S. hospitals. This dataset is used for a variety of purposes, including determining whether hospitals should be fined for not providing high quality care to patients (see http://goo.gl/jAXFX for some background on this particular topic).

R Programming, Johns Hopkins University

首先自訂一個跟解題無關的 get_hospital_data() 函數將指定資料下載並解壓縮在自己熟悉的家目錄之下:

用文字編輯器觀察一下 outcome-of-care-measure.csv,這是我們後續四個題目主要面對的資料集,美國 4,000 多間醫院的醫療品質資料:

outcome-of-care-measure.csv

第一題:繪製第十一個變數 30-day mortality rates for heart attack 的直方圖

To make a simple histogram of the 30-day death rates from heart attack (column 11 in the outcome dataset.) We need to coerce the column to be numeric. You may get a warning about NAs being introduced but that is okay.

R Programming, Johns Hopkins University

這題只是一個暖身,練習用 base plotting system 或者 ggplot2 繪製直方圖,要完成這題有幾點要注意:

  • 資料中有 Not Available 這樣的字串表示遺漏值,而非空字元
  • read.csv() 指定參數不要將字元讀取為 factor 型別
  • 直方圖繪製的變數型別必須為數字 numeric 因此需要利用內建函數 as.numeric() 做型別轉換
base plotting system
ggplot2

第二題:找出州內最好的醫院

Write a function called best that take two arguments: the 2-character abbreviated name of a state and an outcome name. The function reads the outcome-of-care-measures.csv file and returns a character vector with the name of the hospital that has the best (i.e. lowest) 30-day mortality for the specified outcome in that state. The hospital name is the name provided in the Hospital.Name variable. The outcomes can be one of “heart attack”, “heart failure”, or “pneumonia”. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.

Handling ties. If there is a tie for the best hospital for a given outcome, then the hospital names should be sorted in alphabetical order and the first hospital in that set should be chosen (i.e. if hospitals “b”, “c”, and “f” are tied for best, then hospital “b” should be returned).

R Programming, Johns Hopkins University

第二題的自訂函數為 best() 裡面必須輸入 stateoutcome 參數;輸出範例為:

best("TX", "heart failure")## [1] "FORT DUNCAN MEDICAL CENTER"best("MD", "heart attack")## [1] "JOHNS HOPKINS HOSPITAL, THE"best("MD", "pneumonia")## [1] "GREATER BALTIMORE MEDICAL CENTER"best("BB", "heart attack")## Error in best("BB", "heart attack") : invalid statebest("NY", "hert attack")## Error in best("NY", "hert attack") : invalid outcome

要完成這題有幾點要注意:

  • 最好的定義為該醫院在 heart attack、heart failure 或 pneumonia 的 30-day mortality rate 最低
  • 如果有多家醫院的 30-day mortality rate 都是最低,則回傳字母排序前面的醫院
  • 當使用者輸入錯誤的州名或 outcome 名稱,要利用 stop() 函數回傳 invalid state 或 invalid outcome 的客製錯誤訊息
  • 使用 dplyr 套件做資料整併時,要利用 get() 將函數輸入由字串轉為物件
  • 利用 suppressWarnings() 函數消弭轉換型別時產生的警告訊息

檢查輸出是否與範例相同:

檢查輸出是否與範例相同

第三題:排序單一個州的醫院品質

Write a function called rankhospital that takes three arguments: the 2-character abbreviated name of a state (state), an outcome (outcome), and the ranking of a hospital in that state for that outcome (num). The function reads the outcome-of-care-measures.csv file and returns a character vector with the name of the hospital that has the ranking specified by the num argument. For example, the call rankhospital(“MD”, “heart failure”, 5) would return a character vector containing the name of the hospital with the 5th lowest 30-day death rate for heart failure. The num argument can take values “best”, “worst”, or an integer indicating the ranking (smaller numbers are better). If the number given by num is larger than the number of hospitals in that state, then the function should return NA. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.

Handling ties. It may occur that multiple hospitals have the same 30-day mortality rate for a given cause of death. In those cases ties should be broken by using the hospital name. For example, in Texas (“TX”), the hospitals with lowest 30-day mortality rate for heart failure are shown here.

R Programming, Johns Hopkins University

第三題的自訂函數為 rankhospital() 裡面必須輸入 stateoutcomenum 參數;輸出範例為:

rankhospital("TX", "heart failure", 4)## [1] "DETAR HOSPITAL NAVARRO"rankhospital("MD", "heart attack", "worst")## [1] "HARFORD MEMORIAL HOSPITAL"rankhospital("MN", "heart attack", 5000)## [1] NA

要完成這題有幾點要注意:

  • 沿用前一題的思維與注意事項
  • 利用遞減或遞增排序來處理 worst 或 best 的參數設定,對超過該州醫院數的參數輸入要作例外處理,回傳 NA

檢查輸出是否與範例相同:

檢查輸出是否與範例相同

第四題:排序所有州的醫院品質

Write a function called rankall that takes two arguments: an outcome name (outcome) and a hospital ranking(num). The function reads the outcome-of-care-measures.csv file and returns a 2-column data frame containing the hospital in each state that has the ranking specified in num. For example the function call rankall(“heart attack”, “best”) would return a data frame containing the names of the hospitals that are the best in their respective states for 30-day heart attack death rates. The function should return a value for every state (some may be NA). The first column in the data frame is named hospital, which contains the hospital name, and the second column is named state, which contains the 2-character abbreviation for the state name. Hospitals that do not have data on a particular outcome should be excluded from the set of hospitals when deciding the rankings.

Handling ties. The rankall function should handle ties in the 30-day mortality rates in the same waythat the rankhospital function handles ties.

R Programming, Johns Hopkins University

第四題的自訂函數為 rankall() 裡面必須輸入 outcomenum 參數;輸出範例為:

要完成這題有幾點要注意:

  • 將前一題的思維應用到所有的州
  • 州名要以字母順序排序
  • 有的州醫院個數不足 num 參數的輸入,就要回傳 NA
  • 回傳資料框的列索引要用州名

檢查輸出是否與範例相同:

檢查輸出是否與範例相同
如果您喜歡這篇文章,請多按下方的「拍手」圖像幾次、分享到社群網站、成為我們的贊助者以及訂閱 DataInPoint 的新文章!

--

--