Eye Disease Analysis

Anapeshku
INST414: Data Science Techniques
6 min readApr 15, 2024

Data science and network analysis can be applicable to various disciplines outside of the technology field. A growing use for these techniques is in the medical field, where patient cases are studied to notice overlapping cases/attributes and potentially identify any patterns which can help healthcare professionals better diagnose their patients. An earlier diagnosis can be crucial in many cases to treat these illnesses before they can develop and have more severe consequences. A specific specialization that could make use of this type of analysis is ophthalmology, the study of the eye. A question that can be answered from examining eye disease information is what diseases are patients most susceptible towards if both of their eyes initially had a normal fundus. Stakeholders that could benefit from answering this question including ophthalmologists and patients who are prone to eye diseases. Due to factors such as genetic disposition or lifestyle choices, some people may be very vulnerable to diseases such as retinal detachment, vitreous degeneration, epiretinal membrane, etc. Some patients may assume that just because they currently have a normal fundus, they do not need to take any preventative care for their eyes. By finding the most frequently developed eye diseases, both patients and doctors can make more informed preventative decisions to avoid contracting the illness. Doctors can also be informed for what diseases are most common and be more aware of the earlier signs, making them less likely to miss any disease indicators.

The necessary data for this question would include diseases from each eye and patient ID numbers and some patient demographic information. A necessary distinction for this question is left and right eye diseases. Some patients may only have a condition in one of their eyes, such as retinal detachment, but just simply listing them as positive for retinal detachment if it is not present in both eyes would cause some issues, especially when attempting to analyze the frequency of development of these disorders. If the disorder appears in both eyes, then it will need to be listed twice, while if the disorder only appears in one, then it would be listed once. Also, an important aspect of this question is analyzing the susceptibility of patients to certain diseases if one of the eyes has a normal fundus, in comparison to what diseases show up if both of the eyes have an abnormality. The dataset was found from Kaggle and includes this information. Kaggle describes this data as being specifically useful for machine learning algorithms that are attempting to predict eye diseases in humans. The dataset was downloadable in a CSV format which will be used to conduct the analysis. Along with the left and right eye diagnosis, the dataset also includes many different photos of the eye diseases. The photos were not taken into account when doing the analysis due to the question centering around frequency of eye diseases found and what eye diseases tend to be co-morbid with others. The dataset does not include patient demographics, however that aspect of the information is not strictly necessary to answering the question, but could have been beneficial for additional context and potentially fueling future research.

Euclidean distance was used to measure similarity. Euclidean distance is the length of a line connecting two points. The points in this scenario were the eye-related diagnoses that patients were given. In order to find this measurement, the dataset was converted to a dictionary from the original CSV format. Each patient identification number was the key and the values of the dictionary were the diseases/descriptions of each eye. Pandas was used to transform the data from a dictionary to a data frame. Once the file was in this format and the rows were normalized, SKLearn was imported. SKlearn is a commonly used library in data science. It is very versatile and can be used for classification, regression, clustering, etc. In this case, the library was imported and the spatial distance method for Euclidean distance was ran. The code segment centering around the implementation of SKLearn is pasted below:

All of the values in the dataset were compared to the “normal fundus” diagnosis. It was found that when measuring by left eye diagnosis, right eye diagnosis, and both eye diagnoses, the top ten diagnosis most similar to the “normal fundus” are shown in the screenshot below:

This result means that if someone starts off with a normal fundus in their eye, they have the highest chance of developing moderate non proliferative retinopathy, followed by mild nonproliferative retinopathy, cataract, macular epiretinal membrane, lens dust, glaucoma, epiretinal membrane, pathological myopia, dry age-related macular degeneration and drusen. These results also help illustrate the severity of the diseases listed because a patient can still experience these diagnoses while having a normal fundus. Moderate non proliferative retinopathy would be less severe in comparison to drusen because it is the closest in similarity to a normal fundus. Even drusen would be less severe than the other illnesses included in the dataset but excluded from the above list because it is closer in similarity to the normal fundus.

In order to achieve this result, the data needed to be cleaned. This was done through discarding all of the columns which centered around eye photos. While including photos of the diseased eyes is beneficial during the act of diagnosing the patient, when doing analysis on the frequency of these diagnoses, it only serves to overcomplication the information stored within the data frame and does not offer relevant information. Additional steps taken to clean the data include stripping the lines of any unnecessary spacing, so only the text of the diagnosis was stored within the column and filling any of the rows without values with NA. Also, the commas used throughout the file were not consistent characters, so some of the commas separating each diagnosis needed to be replaced with “,” commas in order to split the list of diagnoses and use each as a value in the data frame. Common bugs that someone may encounter would likely happen when doing the last step described. It is very easy to overlook that the very similar “,”is not the common “,” used throughout the majority of the file, so failing to split on this character would result in inaccurate results since the different diagnoses would not be separated properly. Debugging this issue would be a simple fix once someone familiarizes themselves with the dataset and realizes that the inconsistency. The uncommon comma character needs to be replaced with the regular comma using the .replace() method so using the .split() method can function effectively throughout the whole file instead of only select rows.

A limitation of this analysis is that it fails to take into account other factors that could lead to eye diseases. The similarities are based around which patients have specific diseases in individual and both eyes. Patients may be similar in other ways that are not included in the dataset which could factor into their likelihood of developing certain diseases. This includes family history that may result in a predisposition for eye disease or lifestyle factors that could also affect the eyes. Because there is no way to know which of these factors a patient was experiencing that led to their diagnosis, it is difficult to determine whether this would impact the outcome of the data.

Github Code Link: https://github.com/anapetsmart/INST414/blob/main/module_3.ipynb

--

--