Health Data Science Technology Tools Part 2 (FAQ 002)

Dalton Fabian
The Data Science Pharmacist
6 min readOct 2, 2020
Photo by Taylor Vick on Unsplash

Part 1 of FAQ 002 covered some of the technologies and tools that I consider necessary for aspiring data scientists. Part 2 will take a different approach and highlight two technologies that are “nice-to-have” and will make you stand out as a data scientist. In full disclosure, these two items are ones that even I haven’t mastered yet only being a year into my data science career. The problems that my data science team has grappled with has shown me how these technologies would make our work easier. That’s the perspective I’ll take in this article.

Cloud Computing Platforms

Cloud computing platforms are all the rage in today’s technology landscape. Cloud computing allows data scientists to run code with cutting edge hardware technology that they don’t have access to on their laptop or at their employer. It also allows organizations to scale their technology needs up and down on a whim. Previously, organizations would have to buy their own servers, maintain them, and upgrade every so often to make sure their IT applications and programs would run efficiently. This required a lot of coordination and did not allow organizations to be nimble. With cloud computing, hardware and servers are managed by technology companies. In this set-up, IT personnel and data scientists can use advanced technologies for short periods of time without needing to buy their own, new equipment.

One example of the benefits of cloud computing services for data scientists is GPU access. GPUs are traditionally most common for use in video gaming. Similar to a processor in your computer or laptop that can run applications like Word or Excel, a GPU is ideal for rendering images on a screen. In data science, they are ideal for running complicated machine learning algorithms called neural nets that require a substantial amount of mathematical calculations. With cloud computing, data scientists can essentially “rent” the use of a GPU for short periods of time without having to go out and have someone buy one. Our data science team does not have access to a GPU in our current set-up. We could run more sophisticated algorithms like neural nets with our current technology but the process would be incredibly slow. From real-life experience, think about running a model that took 4 days to train on 10,000 patients where we would need to actually run it on 300,000 patients. Not going to happen. We are currently exploring access to cloud computing environments and GPUs could be one of the tools that cloud computing will allow us to access.

There are a few major players in the cloud computing arena: Amazon (AWS), Microsoft (Azure), and Google (Google Cloud Platform). On each company’s platforms, you’ll find ways to host databases, run code to analyze data, store the results of your analysis, and serve up data to users. Most programming and data related tasks are achievable in a cloud environment.

Azure (top left), Google Cloud (top right) and AWS (bottom) make up the major players in the cloud computing ecosystem

Luckily, each of the cloud providers has a number of resources, including courses, on their websites. These courses will walk you from setting up a cloud account to running their most popular services. You can utilize these resources also as an individual for your own side projects or portfolio projects — no corporation affiliation required!

Power up: Host your personal projects on a cloud computing platform for future employers to see your cloud computing skills!

Containers

Containers are an emerging technology in data science but are much more popular in software engineering. The idea behind a container is to allow files of code to run no matter what computer the code is running on. As you get farther into learning to program, you’ll become more familiar with the concept of package management. In R and Python, there are pieces of software called libraries (R) and packages (Python) which are pre-written code that others have made available for anyone to use.

Libraries and packages can be thought of loosely as recipes. It’s much easier to look at a recipe that has the amount of each ingredient that you need rather than trying to figure it out for yourself as you go. Maybe you’re a master chef but I presume that most reading this article are not. Packages and libraries offer a quicker way to program much like a recipe gets you to your dinner faster.

As technologies mature, libraries and packages are updated by the team that created them. This might mean adding or removing features in each release of a package. This can become a problem if the code you wrote depends on a certain piece of code in a Python package that was removed in the latest update. If you try to run the code after you update the package, your code will not work. Where containers come in handy is by allowing you to specify which version of the Python package to use to run the code no matter what the most recent package release contains. This way, you can tell the container that you want to run version 1.2.3 of the package instead of version 1.2.4.

From personal experience, containers also help when you’re working on a team with other data scientists. Depending on how you update your system, you and a colleague may have different versions of many libraries and packages. This means that the code that runs on your colleague’s computer might not run on yours due to the difference. Using containers can help prevent this problem by specifying which version should be used. My team worked on a number of COVID19-related projects this spring. One of our projects was a report that provided key metrics for health system leaders to digest. The report was generated from an R script. On days that my teammate was out of the office, I had to generate this report. A number of times, I would run the report and have chunks of code fail. This would have been a great time for us to have containers to manage our environments to prevent the errors which likely were the result of different R environments on our computers.

Kubernetes and Docker offer services in the world of containerization

Popular services that offer containerization are Kubernetes and Docker. Docker provides containerization and Kubernetes is a service to manage containers (including Docker containers). You can learn more about these tools on their website and the guides that they provide.

Power-up: learning Docker and Kubernetes is the power-up in itself because it will save a lot of code-running headaches in the future!

Wrap Up

This article highlighted two technologies that I think will set you apart as a data scientist. The first technology was cloud computing. My team is in the first stages of taking a look at cloud computing and how it could improve our workflow and tools. The second technology was containers for running code in the same code environment as it was developed on to reduce the likelihood that package management on different computers will cause your code to error.

If you’re interested in learning more about data science as a career, give me a follow and check out my other articles in this FAQ series. I’m looking forward to continuing to share my passion for my non-traditional career as a data science pharmacist!

--

--

Dalton Fabian
The Data Science Pharmacist

I’m a pharmacist turned data science professional who is passionate about helping clinicians and health system leaders to take better care of patients.