Diagnosis Classification

Published in

INST414: Data Science Techniques

5 min readMay 1, 2024

Technology is increasingly becoming incorporated into the medical field. Procedures and processes that were heavily dependent on human behaviors are becoming more automatized. Examples of this can be seen in the fields of medical imaging or robotic surgeries. A field of interest that will be studied in this article is the incorporation of technology in diagnosing chronic illnesses. The question that will be examined is: can chronic illnesses be accurately classified given a list of symptoms? Stakeholders who depend on this question include patients in the process of getting a diagnosis for their illness and diagnostic clinicians, such as nurses, doctors, etc. Typically, diagnosing a chronic illness is a very time-consuming process that requires multiple appointments, expensive testing, etc. Reaching a diagnosis may be unattainable for some people due to lack of resources, since the process requires extensive scheduling and funding. Some patients may not have a clinic in their area that has the diagnostic capabilities to identify their illness. Additionally, during the coronavirus pandemic, hospitals and medical practices were overwhelmed due to the unprecedented number of patients they were receiving. This meant that many patients had to have their diagnoses delayed until resources were available. If classification by symptoms results in an accurate diagnosis, then this would result in a much faster process for patients. With some illnesses, timeliness of treatment is crucial. If someone is not able to reach and accurate prognosis, then the illness could spread or develop into something life-threatening. If classifying diseases by symptoms is accurate, then, in the future, it could lead to software where patients can just type in their symptoms and achieve a diagnosis incredibly quickly and avoid the costly scheduling process for multiple appointments. This could also lead to more accessible medical care. If someone does not have the necessary resources in their area for testing, but has internet access, then classifying diseases by symptom could lead to decisions about their diagnosis if the software is developed further in the future. Accurately classifying diseases by symptoms currently could lead to advancements like that in the future.

Data containing patient symptoms and diagnoses would be needed to answer the proposed question. The ground-truth labels would be the prognoses included in the training dataset. The data used for the disease classification can be found at this link: https://www.kaggle.com/datasets/marslinoedward/disease-prediction-data?select=Testing.csv. The dataset was found from Kaggle and contains a list of symptoms that each patient was experiencing and their diagnosis, all of the information necessary to answer the question. Each column represented a common symptom, such as stomach pain, acidity, tongue ulcers, skin rashes, etc. If the patient has that symptom, then the column value resulted in a one, if the patient did not have the symptom, then the value was a zero. In order to create the model, each patient was given an ID number, each symptom that had the value of 1 was added to a list of dictionary values, with the diagnosis for that patient being the dictionary key. A classification model was used to answer this question. This is because the outcome of the model was meant to predict the diagnosis of each patient. Diagnosis is a categorial variable, resulting in values such as GERD, chronic cholestasis, peptic ulcer disease, etc. Since the value is categorical, classification was used. If the outcome was a continuous variable, then regression would have been used, however due to diagnosis being categorical, with each disease being a different category for the model to classify the patient’s experience as, classification was used. The list of symptoms are the features that were used to predict the diagnosis outcome.

The model mislabeled the following patients with the following IDs: 0, 5, 21, 39, and 8. Patient 0 was experiencing itching, patchiness in skin, skin rash, and skin eruptions. The model predicted that they were having a drug reaction, when the actual prognosis was a fungal infection. Common drug reaction symptoms include itchiness and effects on the skin, which is most likely why this was mislabeled, due to the overlapping symptoms of the two illnesses. A similar case can be seen with the rest of the patient cases listed above. Patient 5 was predicted to have chronic cholestasis when they were actually experiencing peptic ulcer diseases. Both of these illnesses have abdominal pain, vomiting and nausea within their list of symptoms. Patient 21 was predicted to have an outcome of a drug reaction, when the actual prognosis was Hepatitis C. Patient 39 was predicted to have a fungal infection, but the actual prognosis was psoriasis. Both a fungal infection and psoriasis have impacts on the skin, such as rashes, peeling, and inflammation. Lastly, patient 8 was predicted to be having a drug reaction, however they were actually experiencing gastroenteritis. The symptoms of a drug reaction vary considerably depending on what type of drug a patient is reacting to, so that diagnosis included a very large range of symptoms, which is probably why the modelling was overusing it and incorrectly labelling illnesses with similar symptoms as a drug reaction, due to the high chance that there would be overlap between the two. Similarly, some illnesses, such as chronic cholestasis and peptic ulcer disease effect the same part of the body, meaning that there will also be a large overlap in symptoms. This was also the case for psoriasis being mislabeled as a fungal infection. Both impact the skin in similar ways.

The question of if diseases can be accurately labelled based on their symptoms has an outcome of partially true. The model is able to classify diseases based on a patient’s symptoms, however, as seen by the cases listed above, it can mislabel some diseases if there are a large amount of overlapping symptoms with another disease. The model was able to accurately predict the prognosis of some cases in the test dataset, having only been trained on a list of patient symptoms and prognoses, meaning that there is potential for this to become more accurate in the future. Additionally, the model may be limited by only factoring symptoms into the dataset and no other patient information. Specific demographics of patients may be more susceptible to certain diseases, and not including that information in the model could be causing the diagnosis to be incorrect. Other limitations of this approach include not being able to physically examine a patient. Sometimes a patient may misunderstand the symptoms they are feeling and report the wrong thing to their doctor. Additionally, doctors may notice a symptom that a patient has missed, so not including the doctor’s soap notes in the analysis and solely going off of patient reported symptoms may be creating bias in the model and is a limitation that would need to be addressed in the future.

In order to come to a conclusion about the question, the data had to be cleaned before it was able to be analyzed. Data cleaning involves searching for any null values which would skew the analysis, or any values which were mishandled and inputted wrong into the dataset. When the dataset was initially found from Kaggle, it did not contain any null values or values that were mistyped into the data. Additionally, no columns were corrupted/unusable. Because the dataset did not have any corrupted values/ values that would cause an inaccurate analysis, there are not any common bugs that someone may encounter or fixed that would need to be implemented before analysis can be started.

Github link:

https://github.com/anapetsmart/INST414/blob/main/module6.ipynb

Diagnosis Classification

Written by Anapeshku