Python vs. R: What is better for data scientists?

Arpit Omprakash
Byte-Sized-Code
Published in
8 min readMar 13, 2020

I recently started learning R, and after a somewhat successful article on Python vs. Julia last month, I thought of writing an essay comparing Python and R. Both are excellent programming languages. Both find applications in many places, including (but not limited to) Data Science and Machine Learning. The number of people opting for careers in data science and machine learning has been steadily rising since the past few years, and this trend is expected to continue in the same direction for quite some time. As more and more people tend to use programming and data analysis in their daily routines, it is essential to have comparisons for the languages in which the people code. Today’s post answers the question — Which one will suit your needs as a data scientist better, R or Python?

Python

Python was launched in 1991 with the vision of creating a general-purpose programming language that is both readable and efficient. As detailed in my other post, it is a pretty primary and straightforward language that serves as an introduction to programming but is also capable of pretty advanced stuff.

Data science is just part of what Python can be used to do. Relevant libraries include popular tools like pandas, numpy, sci-kit-learn, matplotlib, Keras, and TensorFlow. Other web packages like beautifulsoup and requests make it relatively easy to mine for data or download data using code. Python can be used as an object-oriented language, and it’s OOPS capabilities allow for code modularity, which can be used to build stable, modular, and readable code for data science modeling. Data visualization is a bit tricky, and the quality of displays provided by Python is not up to the mark.

R

R was launched in 1995 as an implementation of the S language developed at Bell Labs. It was specially designed for statistical computing and data analysis in academia and thus supported tons of different relevant libraries for data manipulation and visualization.

R is an analysis-oriented procedural language. Developed by statisticians, it feels a bit different than the standard programming language. But, it may be intuitive for non-coding academians (which was the primary target audience). A large number of open-source packages for data analysis and visualization are available for various tasks that are not generally available for other languages, including Python. Just try building a spider plot (or radar chart) in Python and R, you will know the difference. As a procedural language, R makes it easy to understand how different models work, but it comes at the cost of performance at times.

General Comparison

TIOBE Index for March 2020

According to the TIOBE index of March 2020 (it compares and ranks languages according to their popularity), Python ranks 3rd while R ranks 11th in the popularity survey. R has recently climbed up from 14th to 11th rank since last year and is getting attention slowly. This section of the post compares the general vibe of both languages and how easy/difficult it is to learn and write code in the given languages.

Usability and Ease of Learning

People with prior coding experience may prefer Python as it feels more natural than R. That said, beginners with knowledge of statistics but no previous coding experience may find R easier.
The same piece of code can be written in different ways in R, while it is always written in the same way in Python. Thus, you can quickly check your code against someone else’s code in Python. R allows for creative ideas, and you can choose which way you like to write a given piece of code.

Statistical models are easily written in R with very few lines; this can allow for short and concise code.

R studio is the go-to IDE for R development and also Jupyter Notebook supports Interactive R support, Python offers many IDEs. Still, people generally use Spyder and Jupyter Notebook for IPython.
The learning curve for Python is generally slow and gradual but increases after a point. In R, the initial learning curve is steep, but it becomes easy to learn complex functions and implementations once the basics are clear.

Ecosystem

Python has a robust ecosystem and is considered one of the easiest programming languages to learn. The code in Python is easy to read, interpret, and type. Python packages can be easily downloaded and installed using pip. R has a more significant number of packages for data handling and a vibrant ecosystem that allows users to even communicate between open source languages. Many R packages are available in the CRAN (Comprehensive R Archive Network) project and easily installed via a one-line command.
R is excellent for standalone data analysis, which explains why it is the preferred language for academians. Python has many other packages as a general-purpose programming language that can help in developing and deploying web apps and thus is accessible on the business end of data science.
Python is flexible enough to create something new and out of the box. It can also be used to develop ML APIs and scripting on websites. R offers better models and comprehensive statistical tests that can be used to write concise complex programs.

Specific Comparison for Data Science

KDnuggets 2019 poll of Data scientists

Although the languages rank very differently in their overall TIOBE index, a KDnuggets poll of 2019 shows a different trend among data scientists. 65% of data scientists prefer Python, and about 46% of data scientists prefer R. It indicates that there is quite a bit of competition between the languages in the field of data science, so here is a detailed comparison for the same.

Data Collection

Data collection is a significant step for data scientists, being able to get hold of data from different sources including the web, and opening/analyzing different file formats in a single language is an important requirement.

Both R and Python can handle data from various sources, including Excel files and CSV files. R can handle outputs from other statistical analysis programs like Minitab or SPSS effortlessly, making the transition between different software easy.

“Python offers packages such as the requests library that can be used to mine data from different web sources. Another package called beautifulsoup can be used to process and organize the data collected from requests library into tables and data frames for further analysis.”

Although it is not that simple in R, many new packages have come up that address this problem in R. Rvest allows users to perform necessary web scraping and collecting information that can be parsed using other R packages like magittr.

Data Exploration

After collecting data, the next step is to find insights, correlations, and functional relationships in the data. Pandas library in Python makes it easy to store data in data frame objects and then perform statistical analysis on the whole data or subsets of the data. R provides a wide range of packages and powerful tools for analysis compared to Python in this regard as the language was made for exploration and data analysis. It supports probability distributions, various statistical tests, and other machine learning and data mining techniques.

Data Modeling

Numerical modeling in Python can be carried out using Numpy, and scientific analysis can be carried out using scipy. Python boasts of the sci-kit-learn library that provides various advanced and straightforward models for machine learning and data modeling. To do advanced modeling and data analysis in R, however, one has to rely on packages outside of R’s core functionality. However, the community packages in R for audio and time-series data analysis are pretty lovely.
Overall, R is better at mathematical and statistical models. Still, some packages in Python are well maintained for higher-level data manipulations and building machine learning models, including TensorFlow and sci-kit-learn.

Data Visualization

Visualization is probably one realm where R outshines Python by a large margin. Starting from the simple bar plots to the complex PCA graphs and hierarchical clusters, R provides unmatched data visualization. Although Python has a lot of libraries, including matplotlib and seaborn, the actual images and plots rendered are not that great compared to R. But, Plot.ly offers various solutions for effective data visualization in both R and Python by using its intuitive API.

Academia vs. Industry

Both languages do a great job in general data analysis, visualization, and implementing machine learning models. However, the real difference is that Python is production-ready. R was designed for a particular reason, to analyze data, and it does the job well. But,

“Python is a general-purpose language and offers organizations the freedom to develop software stacks, web applications, and APIs that can deliver machine learning solutions to real-life businesses and problems.“

and,

“The statistics, graphical data visualizations, and mathematical models provided in R are much accurate and better suited for analysis in the field of academia.”

R offers a better insight into the models that are present in data science, and thus, learning R will make the inner workings of different algorithms much clearer for people.

So what should you choose?

It doesn’t matter what language you choose if you are starting your journey with machine learning and data science. What matters is you should stick with the language you select and become better and much acquainted with the same.

That said, if you intend to build an app, or you are a previous developer in a different language, or you want to get into software development in the field of data science, you should opt for Python. On the other hand, if you’re going to learn in detail about the algorithms in data science, make incredible graphical visualizations, or want to get into research and academia, R will be much better.
What I feel is you should learn both if you can, it’s not that difficult. I use Python almost every day to tinker and develop games. I have recently been learning R to help fast track some statistical analysis that I’m doing for my experiments in Science.

Before you go…

Connect with Me on Instagram and Facebook.

If you found this article helpful, click the

button below (remember it can go up to 50 claps — it helps me a lot if you’re generous with that clap button ;) ) or share the article on Facebook/Twitter if you want your friends to benefit from it in some way at all.

Share your thoughts in the comments and follow me for more such articles!

--

--

Arpit Omprakash
Byte-Sized-Code

I'm a Programming and Statistics enthusiast studying Biology. To find out if we have common interests, have a look at my thoughts: https://aceking007.github.io/