Healthcare Machine Learning: Getting Started
If you’re in healthcare, you have no doubt heard about all the interesting new projects that involve machine learning. You may want to gain a basic understanding of healthcare machine learning, or you may want to actually create machine learning models. If you’re in the latter group, this guide is for you. If you’re simply looking to gain a basic understanding, check out this introduction.
We (Pierre Elias and Jenine John) start out with some pointers based on our experience as cardiologists who have gotten into data science, and then provide a compilation of some of the best resources we found over the past few years. It is far from comprehensive and we welcome additions.
Accessing Data and GPUs
If you’d like to engage in healthcare ML research, you’ll want to be in a group that will provide you access to guidance, data, and GPUs (graphics processing units, which are better suited for machine learning than CPUs). While you’re learning, though, you can start off with publicly available datasets (like Kaggle or PhysioNet) and use free GPU resources like Google Colab (AWS also provides free GPU use, but it can be tricky to avoid accidental charges).
Picking a Language and ML framework
If you’re in a group, it’s best to learn the language and ML framework your group is using.
Python is the language commonly used for machine learning. Some groups use R instead.
Two of the most commonly used machine learning frameworks are PyTorch and TensorFlow 2, which incorporates Keras. PyTorch tends to be favored for research because it is flexible, whereas TensorFlow tends to be favored for models that will be deployed. Also, PyTorch’s commands are closer to Python and may therefore be more intuitive, but TensorFlow’s commands are a little more simple. You can’t go wrong with choosing either one as your first framework, and you can learn another framework later.
Recommended Resources for Coding
Coding is a skill that is learned best through practice. Try out multiple resources to see which ones work best for you. Once you choose a resource, you should also consider whether it is worth completing it to the end, because some of the later portions may not be relevant to your needs.
Jump into applying the skills to real data as early as possible. It is one thing to learn from homework and an entirely different one to tackle it with real data. The sooner you get to building things for yourself, the sooner it will begin to click. When you need to figure out how to do something, you’ll know which resources to turn to, or you’ll search online (Stack Overflow is incredibly useful). Jupyter notebooks are very helpful as you’re learning — they’ll allow you to carefully check your code as you go along so you can avoid unexpected behaviors.
Python Basics
Jupyter notebooks are great for running short pieces of Python code and seeing the results. When writing scripts, you’ll want to use an Integrated Development Environment (IDE) or code editor. PyCharm Pro is an IDE that is free with a .edu e-mail address, and Sublime Text is a great code editor. If your group has set up a remote server, PyCharm can be used when working on the server (be careful with the settings, because it’s possible to accidentally delete the server’s data when syncing). Alternatively, Sublime Text can be used in combination with a file syncing tool like Forklift or Cyberduck to work on a remote server.
Here are some courses to get you started with Python — try some out and see which ones you like.
- CodeAcademy — Learn Python
While you will almost certainly code in Python 3 rather than Python 2, the differences are minimal. The Python 2 version is completely free while the Python 3 version has a free trial. This is a great interactive approach to beginning to learn Python. - YouTube — Corey Schafer videos on Python
Excellent set of practical videos that includes topics like setting up Python and importing modules. - Coursera — Python for Everybody
A good basic introduction to Python. The pdf of the accompanying book and other materials are available here. - Hitchhiker’s Guide to Python
A resource for understanding some core topics better, such as organizing scripts and logging.
Python for Data Science
- Pandas tutorials
You’ll likely use the Pandas library, which handles tabular data in Python. The official Getting Started tutorials are a great resource - Coursera — Introduction to Data Science in Python
This course covers using the Pandas library and Numpy library for data science in Python. - Cookiecutter Data Science
Provides a template for organizing data science projects
Shell
One of the first things you should do is get comfortable with the command-line shell, which will enable you to do things like run your scripts. There are lots of online resources about this, such as this page.
Github
Git is an important tool used for versioning (tracking changes to scripts so that you can collaborate with others or go back to prior versions if needed). The website Github is commonly used for this. It can be a little tricky to get the hang of. There are lots of basic Github introductions online. If you’re looking for a detailed introduction, this Youtube playlist explains it well.
It’s a good idea to store private data and Jupyter notebooks in separate directories from git repositories so that you don’t accidentally upload sensitive information. An alternative is to use .gitignore.
Tmux (optional)
Tmux is a handy tool that allows processes (including Jupyter notebooks) to continue on a remote server even if you get disconnected. It also allows multiple processes to be run at the same time. This is a brief tutorial you can turn to if you’d like to use tmux.
Other
- MIT CSAIL — Missing Semester videos
These videos cover topics such as the shell and git. They may be helpful to fill in gaps in your knowledge.
Recommended Resources for Machine Learning
Focus first on gaining a conceptual understanding of machine learning, then create some basic models. Once you have the basics down, you can supplement with additional resources to flesh out your knowledge. The resources with asterisks are especially good.
Machine Learning Techniques
- *Coursera — Machine Learning Specialization by Andrew Ng
Highly recommended introduction to machine learning. It gets pretty detailed. If you get stuck, you can look for how others approached something on Github.
Here are videos of the more detailed full Stanford course, and here is its accompanying website. - *Coursera — Deep Learning Specialization by Andrew Ng
Very detailed course on deep learning. If you’re going through it carefully, it would probably take a couple of weeks of full-time effort, and you’d come out with a good understanding of deep learning concepts.
Here are videos of the full Stanford version of the course, and here is its accompanying website. - *Book — Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurelien Geron
Excellent book that explains both basic concepts of machine learning and practical aspects of programming. Whether you are using TensorFlow or PyTorch, you will likely benefit from this book. - *Youtube — 3blue1brown Deep Learning videos
Cannot recommend 3blue1brown’s videos highly enough. They have phenomenal visualizations to help understand what neural networks are doing. - *Book — An Introduction to Statistical Learning — by James, Witten, Hastie, Tibshirani
Along with its more detailed counterpart, The Elements of Statistical Learning, this is a classic book on the concepts of machine learning. The pdf of the latest edition, which came out in 2021, is available on the website. - Fast.AI
Popular course that guides those with some coding experience to create machine learning models in a straightforward way. - PyTorch tutorials and TensorFlow tutorials
The official tutorials for PyTorch and TensorFlow are very useful.
For PyTorch — once you get a handle on the basics, you can consider exploring PyTorch Lightning, software that automates some of the frequently repeated PyTorch code. - Google — Machine Learning Crash Course
Course that was made for Google engineers who want to learn more about ML. It’s very good and not too long, and is broken up into clear sections. The examples use TensorFlow. - Neuromatch Academy
Course that teaches deep learning to neuroscientists. Uses PyTorch. - Implementation Example — Tutorial for Building a Leaf Classification App
Guides you through the steps of using a pretrained computer vision model.
Machine Learning for Healthcare
- Coursera — AI in Healthcare Specialization
Excellent introduction to the concepts that are important when applying machine learning to healthcare. A must-watch for those doing healthcare ML. - Coursera — AI for Medicine Specialization
Great continuation of Andrew Ng’s Deeplearning.AI work, run by one of his former PhD students. - Getting Started in Medical AI
A list of healthcare AI resources similar to this one, focused on Radiology
Math for Healthcare Machine Learning
- Youtube — 3blue1brown Linear Algebra videos
Another set of 3blue1brown’s excellent videos. Provides a conceptual understanding of linear algebra through visualizations. - Coursera- Math for Machine Learning by Imperial College of London
This course focuses on the math skills relevant for machine learning. - Book — Intuitive Biostatistics by Harvey Motulsky
Not specifically for machine learning, but does a great job of explaining biostatistical concepts that often trip people up, like interpreting p-values. It’s light on mathematics and focuses more on concepts. - Book — Epidemiology: An Introduction by Kenneth Rothman
Learning epidemiology concepts is also very helpful, as it helps with understanding the factors that come into play when considering whether a model trained on one dataset can be used in other settings. This is one of the commonly used introductions to epidemiology, and a couple of others are Gordis Epidemiology and Essentials of Epidemiology in Public Health.
Other Resources
Keeping Up with the Field
- Doctor Penguin
Weekly newsletter on new ML in medicine papers - MIT Annual Deep Learning State of the Art
Great one hour review of recent deep learning advances done by Lex Fridman, who also has a good podcast. - Youtube — Yannic Kilcher
Once you start making models, this popular channel may be interesting to explore. It has videos explaining different types of machine learning models
Datasets
- Kaggle has great datasets with lots of notebooks showing implementation of that data
- CIFAR is a classic database used for deep learning benchmarks
- PhysioNet waveform databases
- Stanford AIMI databases
- Grand Challenge datasets
- PapersWithCode Dataset Index has thousands of datasets, a few hundred of which are in medicine
Machine Learning Packages for Cardiology
- IntroECG
Pierre Elias’s deep learning library that allows users to create their own synthetic ECGs and teaches them how to build deep learning models for electrocardiograms. It provides multiple architecture examples and step-by-step instructions on how to review experimental findings and automate experimentation. - EchoNet
David Ouyang’s deep learning library for determining ejection fraction from echocardiograms. This library comes with a large, freely available echocardiographic dataset and facilitates many of the important principles for doing deep learning research with echocardiograms.