Bahmni is an open source EMR which has been deployed at quite a few hospitals in rural low resource settings.
In this blog we try to explore anonymous Bahmni EMR data to see if data collected over a period of time tell us any story. Story telling with data is a fascinating subject in itself. If you would like to explore this topic more you can refer this book to get started.
For any analytics work, before we jump into descriptive and predictive analytics space it is important that we spend time understanding our data. Approximately 75% of any data analytics work is spent understanding, cleaning and preparing the data. We focus on data understanding part in this blog.
- Look at patient trend across various regions
- Explore top 10 diagnosis reported
- Further analysing the top diagnosis reported for various insights
- Explore observations/results reported for these patients.
- Quick peek into other insights which can be derived from this EMR data.
Watch the Videos
- Approximately 65% of patients in top 10 cities are females.
- Gastritis is the most reported diagnosis.
- Ratio of male vs female for gastritis is close to 1:3.
- Between 2011 to 2016, 2013 and 2014 have been the years where there have been the maximum number of gastritis cases reported.
- April to June-July every year seems to have on an average more cases of gastritis as compared to rest of the months.
- No evident male vs female bias reflected looking at their Haemoglobin levels.
- Spelling mistakes or different ways of spelling the same thing in free text fields like city/village needs to be corrected before considering them in the overall patient count per city/village.
- Same diagnosis being repeated multiple times for the single patient. Inability to mark diagnosis as chronic vs acute or as follow up vs new makes it challenging in considering them in the overall count for top 10 diagnosis.
- Most of the lab results need clinical co-relations for establishing any patterns or trends. Also validity of a lab result varies case to case basis. Hence it is difficult to decide which lab results (across various visits) should be considered for analysing any trends for a particular diagnosis.
- For chronic illness like Diabetes there is lack of consistent data such A1C levels, which makes it difficult to observe any particular trend. Hence we may have to look at some mathematical ways of calculating average blood sugar levels using fasting and post prandial values.
- Better understanding of domain and data: As for example exploring data such food habits, lifestyle might give us more insight into why females have higher cases of gastritis reported.
- Data cleaning and preparation: Check for consistency and quality of data to take corrective steps such as City/Village misspelled, missing value treatment, Outlier detection and replacement strategy
- Take a stab at descriptive statistics, measures of central tendency, skewness, hypothesis testing.
- Feature transformation: Try and extract new features like average sugar levels from fasting and postprandial blood sugar levels. See if we can make use of ICD chronic indicator for classifying diagnosis as chronic vs acute. Binning of variables such as age to infant, youth, adult, etc..
- Natural Language processing (NLP) for fields such Chief complaints which can be multi-value comma delimited values.
- Move on to predictive space to see if we can do clustering of patients for particular diagnosis, help classify diabetic patients as Type I or II based on data available without having to perform any particular kind of tests on them.
- Installing R & RStudio: https://www.rstudio.com/products/rstudio/
- Installing MySQL: https://dev.mysql.com/doc/refman/5.6/en/osx-installation-pkg.html
- R-Blogs: https://www.r-bloggers.com/
- Source Code: https://github.com/karrtikiyer-tw/bahmni-eda
- Presentation on SlideShare.