Health Data Science Technology Tools Part 1 (FAQ 002)

Dalton Fabian
The Data Science Pharmacist
6 min readSep 23, 2020

My most recent (and first) post in the data science FAQ series was about the programming languages that aspiring data scientists should know to pursue a data science career. FAQ 002, split into two parts, will focus on technology tools that I would consider “programming language adjacent” (to make up a phrase). These tools are used in conjunction with SQL, R, and Python to complete the data science lifecycle.

This article will cover the fundamental tools that I believe that data scientists should have a working knowledge of. Part 2 will highlights skills that I would consider nice-to-have. The 3 items below (visualization tools, version control, and development environments) are by no means exhaustive since the technologies available to data scientists change daily but are topics that are great starting points for aspiring data scientists.

Visualization Tools

Visualization tools are the lifeblood of a successful data science team. In FAQ 001, I shared how health data scientists use SQL to gather data from the Electronic Health Record and R/Python to perform machine learning to make predictions about the future health of patients. Visualization tools are how we take the data that SQL, R, and Python provide and share the results with our end users. At UnityPoint Health, our users include people in HR, health system leadership, and patient-facing healthcare professionals. It would be much easier to put all of the data into a simple Excel file or Word document without context and give this to our users but that would be a horrible user experience and the quickest way to have people stop using our work.

Instead, visualization tools help us create intuitive dashboards and reports to guide our user’s actions. In our recent care management tool project, we were able to serve up care managers patients who were high risk but also able to give them additional context to the patient’s health. Care managers wanted to enroll patients who are deemed “high risk” but had a hierarchy on which patients should be contacted first. They wanted to be able to reach out to patients in the hospital or patients who recently were discharged from the emergency room or hospital. They also wanted to talk to patients in-person rather than over the phone. A simple Excel dump of data would require our users to manipulate the data themselves to prioritize it in the way that they needed. In the dashboard below, we were able to highlight the high-risk patients but show care managers the additional context that they need to determine the order in which to contact patients, starting from the top and going downward. We set up the visualization to make patients in the hospital or recently discharged (patients with red diamonds) appear at the top of the list.

Dashboard for care managers that marks patients who are in the hospital, recently left the hospital or ER, and have appointments in the coming 30 days.

There are a number of popular visualization platforms that are available. They are also split into two main groups; vendor-managed platforms and open-source platforms. Vendor tools include Tableau, Microsoft’s PowerBI, Qlik, and Sisense. Tools like Shiny, Plotly Dash, Seaborn, ggplot, and others are popular open-source visualization tools. The vendor tools tend to be more popular in workplaces but a basic understanding of the wide variety doesn’t hurt. If you’re interested in going more in-depth, I would recommend Tableau, PowerBI, Shiny, and Plotly Dash as targeted platforms. Companies like Tableau have robust guides on their websites but you can also find courses on sites like Udemy and DataCamp.

Power-up: Check out resources like Storytelling With Data (book and website) to identify best practices for data visualization.

Version Control

Version control is the technology that I had the most trouble placing on the “Need to Have”-”Nice to Have” spectrum. I decided to place it in “Need to Have” because proper use of version control has the opportunity to make your work-life much easier. If you’re unfamiliar with version control but have ever named your documents Important document 1–1–2020 and Important Document 2–1–2020 to save old versions of your work, congrats(!), you’ve used version control!

In programming, version control saves copies of your old code in the background so you never lose old information and can go back to restore old versions if something goes wrong. The nice thing about version control through platforms like GitHub and BitBucket is that you don’t need to have multiple versions of the same file in the same folder. The old versions are stored in what’s called a repository and can be accessed through a web interface or command line. A recent example at work where version control came in handy was with the code we use to get medications for each patient. We re-wrote some of the SQL code to gather data more quickly since the old code was slow. Once we got the new code into our tool, we realized after a couple of days that the new code was not pulling in the correct information. We were able to look back to the old code and see how it compared to the new code and remedy the situation. If we did not have access to the old code, this process would not have been as easy.

Example of GitHub version control (taken from github.com) that shows the old code on the left and the new code on the right. The code written here is CSS, a web development language.

If you’re interested in learning more about version control, both BitBucket and GitHub have resources on their website.

Power-up: any file with code can be version controlled, no matter the amount of code. Try to use version control even if you’re just starting your programming journey.

Development Environments

Development environments are the easiest of the “Need to Have” items and mostly come down to personal preference. When I refer to a development environment, I am referring to the tools that you code in. These environments are also frequently called text editors or Integrated Development Environments (IDEs). I tend to prefer IDEs because they give you a full suite of features to quickly begin coding with your preferred language. A number of IDEs support multiple languages but I will highlight the most popular IDEs for each of the data science languages from FAQ 001.

  • SQL: SQL Server Management Studio (SSMS)
  • R: RStudio (by far the most popular), Jupyter Notebook/JupyterLab*
  • Python: VSCode, PyCharm, Spyder, Jupyter Notebook/JupyterLab*

The IDE is the primary place that you will write code as a data scientist. IDEs also allow you to run your code to see the output to make sure it’s coded properly. Most IDEs tend to be fairly intuitive and should be easy to pick up as you start programming.

A screenshot of RStudio, the most popular IDE for R. Coding window (upper left), output console (lower left), variable tracker (upper right), file management/help section (lower right) are shown in this screenshot

Power-up: Once you’ve been able to get comfortable with the IDE you like the most, try to research the terms “linting” and “debugging” to see how your IDE uses these tools.

* Jupyter Notebooks and JupyterLab are not normally labeled as IDEs but are a popular environment and exist in the gray area between a text editor and IDE.

Wrap-up

In Part 1 of FAQ 002, I have shared the technology tools that I believe are great starting points for aspiring data scientists. These tools help to complement the programming languages covered in FAQ 001. Programming languages will help you get the data needed for your data science work but visualization tools, version control, and development environments will add to your success.

Visualization tools are responsible for taking your data and presenting it to users in an intuitive, engaging format. We talked about popular vendor platforms and open-source platforms. Version control helps you manage code and provides a safeguard to allow you to revert back to old code if a problem arises. Development environments make your coding journey easier by providing everything you need to start coding.

In Part 2 of FAQ 002, I will cover some of the “Nice to Have” technology tools that will add even more benefit on top of the technologies discussed in this article. Stay tuned for the release of Part 2.

Happy coding! ✌️

Note: If you’re interested in more health data science content, make sure to check the other articles in the Health Data Science FAQ series by visiting my FAQ Central page

--

--

Dalton Fabian
The Data Science Pharmacist

I’m a pharmacist turned data science professional who is passionate about helping clinicians and health system leaders to take better care of patients.