Python or R? A Programming Language Overview for Machine Learning Beginners

Safira Fabilia
6 min readJul 9, 2024

--

So, you’re diving into the world of machine learning, and you’ve hit the age-old question: Python or R? As a novice, you might have come across two of the most popular machine learning textbooks published by O’Reilly: Introduction to Machine Learning with Python and Introduction to Machine Learning with R. Having used both programming languages, I will guide you through a few questions that will hopefully help you choose between them.

Figure 1. An Image of the Covers of the Books Mentioned

What is Python and R?

Figure 2. Python Logo

Python is a general-purpose, open-source programming language launched in 1991, widely used in domains like data science, web development, and gaming. Its popularity is reflected in top rankings on the TIOBE and PYPL indices, largely due to its vast community of users and developers who contribute to its growth and the continuous release of versatile libraries. Python’s readability and interpretability, akin to human language, make it an ideal choice for beginners with no coding experience. Over time, it has become increasingly popular in data science, offering simplicity and extensive specialized libraries for tasks such as data visualization, machine learning, and deep learning.

Figure 3. R Logo

R is an open-source programming language designed for statistical computing and graphics, launched in 1992 and widely adopted in scientific research and academia. It remains a popular tool in both traditional and business analytics, ranking highly on the TIOBE and PYPL indices. With its extensive collection of packages available via the Comprehensive R Archive Network (CRAN), R allows complex statistical functions and models to be executed with just a few lines of code. Additionally, R excels in generating quality reports, data visualizations, and interactive web applications, making it ideal for producing detailed and aesthetically pleasing graphs.

Let’s break down how to choose one of them for machine learning through a few questions.

1. What does your colleagues use?

This is one of the simplest ways to decide whether you should use Python or R. Often, your choice will be influenced by the programming language used in your workplace.

Python is popular among programmers who want to delve into data analysis or apply statistical techniques, as well as developers who transition into data science (Data Science Central, 2016). On the other hand, R is primarily used in academia and research and is excellent for exploratory data analysis (DataCamp, 2022). Consider the industry you want to work in or are currently working in. You can refer to this graph to see which programming language is preferred by different industries.

Figure 4. A comparison of R and Python in Different Industries (source: GeeksForGeeks)

If you’re unsure where to start and none of your colleagues use either language, you can begin by comparing the capabilities of Python and R.

2. What libraries are available for machine learning?

Both Python and R offer extensive ecosystems of packages and libraries. Here is a list of some of the most popular machine learning libraries in both Python and R.

Popular Data Science Libraries in Python:

  • scikit-learn: A comprehensive library for machine learning, providing tools for classification, regression, clustering, and more.
  • TensorFlow: An open-source framework for machine learning and deep learning, developed by Google Brain.
  • Keras: A high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
  • PyTorch: An open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing.
  • XGBoost: A scalable and flexible gradient boosting library, designed for speed and performance.

Popular Data Science Libraries in R:

  • caret: A package that streamlines the process of creating predictive models, providing a unified interface to numerous machine learning algorithms.
  • randomForest: An implementation of the random forest algorithm for classification and regression.
  • xgboost: An efficient and scalable implementation of the gradient boosting framework.
  • nnet: A package that provides functions for feed-forward neural networks with a single hidden layer, and for multinomial log-linear models.
  • glmnet: A package for fitting generalized linear models via penalized maximum likelihood, specifically designed for regression models.

Now you can consider which libraries will be most useful for you and choose the corresponding programming language. Additionally, if data visualization is important to you, consider how each programming language handles data visualization.

3. How does each programming language visualize data?

While data visualization isn’t exclusive to machine learning, it plays a crucial role in creating effective machine learning models. In Python, the standard libraries for exploratory analysis are Seaborn and its lower-level cousin, Matplotlib. Think of Seaborn as a more user-friendly version of Matplotlib.

On the other hand, the ggplot2 library is a standout feature of R. The syntax for creating plots may seem a bit strange at first, but once you get the hang of it, you’ll be producing beautiful and insightful visualizations in no time. With ggplot2, you create visualizations by adding layers to a plot.

Choosing between Python and R for data visualization usually depends on your preference. To help you decide, I’ll compare how some visualizations look in both Python and R.

First, this is an example of a Python & R Histogram:

Figure 5. Python Histogram
Figure 6. R Histogram

Second, this is an example of a Python & R Line Plot:

Figure 7. Python Line Plot
Figure 8. R Line Plot

Both Python and R offer similar capabilities, but R can provide more expansive visualizations. If you need something straightforward, Python might be the better choice. However, if you’re looking for greater customization and flexibility, R is the way to go.

You can learn more about basic data visualization in R and Python through study materials provided by UCSD here. If you are still confused, to help you out, consider the availability of online resources when choosing a programming language.

4. Is there community and support for the programming language?

Learning Python or R on your own can be challenging and exhausting. That’s why having community support is essential.

Figure 9. Python Community Website

Python has a massive and active community. Whether you’re stuck on a bug or need a tutorial, there are countless resources available. One of the reasons for Python’s worldwide popularity is its community of users and developers who continuously improve the language and release new libraries for various purposes.

Figure 10. CRAN

R’s community might be smaller, but it’s mighty, especially in academia. There are numerous resources, forums, and dedicated users ready to help. The extensive capabilities R offers are largely due to its robust community, which has developed one of the richest collections of data-science-related packages, all available via the Comprehensive R Archive Network (CRAN).

Conclusion

Choosing between Python and R for machine learning ultimately depends on your specific needs and environment.

When deciding, consider the programming language preferred in your workplace or industry. Evaluate the available libraries for machine learning and how each language handles data visualization. Lastly, assess the community and support available, as both languages have robust ecosystems but cater to slightly different user bases.

By weighing these factors, you can make an informed decision that best suits your machine learning journey. If you need further guidance, resources from DataCamp and GeeksForGeeks provide excellent comparisons and study materials. Good luck!

--

--