Programming Languages in Data Science (FAQ 001)

Dalton Fabian
The Data Science Pharmacist
5 min readSep 16, 2020

The first question that I get from students and pharmacists about data science usually is about the programming languages that they need to learn. I took my first programming class when I was in high school but I can appreciate the daunting nature of learning a programming language, especially if you have no previous coding experience. In this FAQ article, I will tackle the languages you should know to work in data science, where to learn them, what order to learn them in, and how they relate to the work of a Data Scientist.

Make sure to check the bottom of this article or click this affiliate link to get $20 off your first month of a paid DataCamp subscription. DataCamp is my favorite SQL/R/Python resource because of the breadth and depth of courses along with the career path tracks they have that organize your learning plan! More about DataCamp later!

SQL

The first language that I would recommend learning is SQL. This language is frequently overshadowed by the more “exciting” data science languages of R and Python. What makes SQL important is that it provides the foundation on which R and Python can make predictions about the future, aka machine learning. SQL code that I write at work is responsible for gathering data from the Electronic Health Record (e.g., gathering each patient’s A1c, blood pressure, medication list, etc). Once the SQL code gives me the data about patients, R and Python are able to use the data to make predictions about the patient’s future health (e.g., how likely the patient is to go to the hospital in the next 12 months). SQL and the more advanced data science languages work together in lock-step to create a data science project.

Luckily, SQL is the most straightforward language of the three that I’ll highlight in this article. Its syntax is the simplest and easiest to learn in my opinion. Below is an example of what SQL syntax looks like. There are a number of resources available to help you learn SQL. My personal favorite data science language resource is DataCamp and they have a number of SQL courses. You can also find popular SQL classes on platforms like Udemy.

SELECT [information you want to gather]
FROM [database table you want data from]
WHERE [certain condition is met]
ORDER BY [data point to sort results by]

While SQL is important, there is a point at which additional syntax learned will plateau in the amount it helps your day-to-day work. I would recommend learning the following keywords and concepts to start: SELECT, FROM, WHERE, GROUP BY, ORDER BY, AVG/SUM/MAX/MIN/COUNT, CASE WHEN statements, LIKE, INNER JOIN/LEFT JOIN, IN, DATEDIFF, DATEADD, and Common Table Expressions (CTE). This may seem like an overwhelming list at first, but most of these concepts should only take a few repetitions/coding exercises to master. These are the keywords and concepts that I use most often day-to-day. Other keywords will come up as you work in data science but I have found that I can learn (or re-learn) the other keywords quickly after I’ve mastered the keywords I’ve already mentioned.

POWER-UP: Learn how to write more efficient SQL by seeking out additional resources (hint: there is also a DataCamp class on writing more efficient SQL)

R & Python

After learning SQL, the next language that you should consider learning is R or Python. These languages are most often used in the machine learning stage of a project, the part where you try to predict the likelihood that something will happen in the future. Machine learning will serve as the foundation of the tools that you build as someone working in data science. Examples of machine learning predictions that can be made from R and Python include the likelihood that a patient will be readmitted to the hospital in the 30 days after discharge or the likelihood that a patient will go to the ED or hospital in the next 12 months

There is a fierce debate in the data science community about which language is best but both are used by highly successful organizations doing data science. R is often described as the easier of the two to learn but, from personal experience, both are similarly easy to learn. I learned Python first and prefer it over R for what it’s worth. More recently though, Python has started to become more popular among data science teams than R as more Data Scientists come from other specialties besides statistics. My recommendation is to take a few lessons of each language and decide which one you like better. You can do this on DataCamp or any other coding platform. My experience has shown me that knowing one will help you transition to the other if you need to switch in the future. It might involve a little extra work to switch languages but it’s easier than learning your first language.

Like SQL, there are plenty of resources available to learn Python and R. Sites like DataCamp, Udemy, and Codecademy are all great resources. A benefit of learning R or Python on DataCamp’s platform is the ability to select the Data Scientist career track in R or Python that will walk you from learning the basics about each language to running advanced algorithms. The career path feature is a great resource to be able to prioritize and organize your learning process. If you want to check out DataCamp more in-depth, you can use this affiliate link to get $20 off your first month of a paid DataCamp subscription!

POWER-UP: Once you have the basics of Python or R down, learn how to “version” your code using git. You can learn more about versioning and git on DataCamp or directly from git providers GitHub or BitBucket.

Wrap Up

The process of learning programming languages can be daunting. The good news is that there are a number of different resources available that will walk you from the basics of the three different languages I talked about in this article to more advanced functionality. I promise that the rewards of a career in data science are worth the perseverance through learning these programming languages. Every day, I get to do what I love — combining healthcare, programming, and data science. Please reach out to me through this post or connect with me on LinkedIn if you have questions about the content I covered.

If you’re interested in trying DataCamp, you can use this affiliate link to get $20 off your first month of a paid DataCamp subscription!

Happy Coding! ✌️

Note: If you’re interested in more health data science content, make sure to check the other articles in the Health Data Science FAQ series by visiting my FAQ Central page

--

--

Dalton Fabian
The Data Science Pharmacist

I’m a pharmacist turned data science professional who is passionate about helping clinicians and health system leaders to take better care of patients.