R 2020?
Jacek Chmiel, Director of Avenga Labs
Olena Domanska, Data Science Lead
Felix Bahr, ML expert
Not just another R vs Python article
Python has dominated the world of data science, and specifically as a general-purpose programming language. No links and no quotes here — it’s obvious.
But if you look carefully at the language popularity statistics — you’ll see a language called R.
There’s a clear trend to compare Python vs. R in the blogosphere. Many beginners start to wonder ‘why should I even bother with R, Python with its set of libraries seems to be enough to enter the data science world.’
More people are switching from R to Python than the other way around. Still, the popularity of both languages is rising. There’s no clear decline in R popularity visible (yet?).
R was never meant to be general-purpose, so when there’s something not related to data and statistics, Python wins, always.
This article is based on a stronger message — Is it still worth learning R while Python is getting more powerful for data science?
When R is considered as better than Python
R was built by statisticians for statisticians (not as a general-purpose language) almost thirty years ago, so it has accumulated tons of builtin features and more than 10 000 libraries are available in its CRAN repository.
So, there’s a higher chance that statistical features will be either built in or quickly available from this broad set of libraries. There are still areas where Python is missing the mark. (//todo
Another area where R is better (or “still better”) is the important aspect of data visualization. Many experts believe even the latest Python libraries are still worse than the R visualization libraries.
R is also believed to be faster for prototyping and verifying hypotheses using its command line and IDE.
When did you experience R being clearly better than Python?
Olena, Data Science Lead
First thing that comes to my mind is predictive analytics, and specifically time-series analysis.
The number of packages available in R for time series is astonishing. R has a 360-degree view of the time series analysis with almost all the algorithms needed for it already implemented. It allows you to do the complicated time series analysis in a single line. For example, to convert data into time series and decompose the time series data into components, all you need to do is just apply the ‘ready to use’ functions.
What I’ve also experienced is that the newest models from statistics first come to R and then the Python community tries to catch up. For example, let’s consider wavelet decompositions or time series distance functions; they have not been implemented in Python until just recently. This is probably due to the fact that the academic community mostly uses R. Remember, R was written by statisticians, for statisticians. Thanks to this, R is the undisputed leader in statistical correctness.
However, if you are going to apply deep neural networks (which work really well in the case of large datasets of time series data, especially LSTMs), it’s easier to work in Python, using TensorFlow, pytorch or Keras.
Besides, I just can’t leave elegant R Visualization unaddressed. Data Visualization is indeed the first part which is needed, even before running your first iteration of the model. There are various examples where graphs can tell a story better than a machine learning algorithm.
With ggplot2, R offers a relatively simple and intuitive creation of images. In its core is a “Grammar of Graphics” philosophy, following a layered approach that allows you to create plots step-by-step: starting with the data, then adding “aesthetics” and style elements. The results could be easily shared using R Shiny — a great tool for building interactive web apps that allows one to explore the data through real-time interactive dashboards and visualizations.
Python, as a “general-purpose” programming language, does not include data visualization tools by default. However, Python provides many libraries for this purpose, such as Matplotlib and Seaborn. Fortunately, Python developers now can enjoy the benefits of ggplot2 visualization as well. Python packages, like plotnine and ggpy which are equivalents of ggplot2 in R, allow you to create plots in Python according to the same “Grammar of Graphics” principle. However, Python plots, when saved as graphics, take up significantly more disk space than R generated graphics, which sometimes may be significant.
Felix, ML expert
I’ve never really used R in a project, so I can’t actually judge whether or where R would have significant advantages over Python.
In which tasks would it be hard to use R and not Python?
Olena, Data Science Lead
One of the main reasons for using Python in data science is Deep Learning. The majority of deep learning research is done in Python, so tools such as Keras and PyTorch have “Python-first” development.
Besides, it is so much simpler to appreciate the benefits of the cloud, computed with Python.
Let’s take Amazon Web Services (AWS) as an example. AWS provides excellent support for Python developers through AWS Python forums; there are no AWS R forums yet. It’s way easier to find answers, resources to successfully build on AWS, ask a question, learn, share knowledge, and get answers from a community of developers.
Python is on a list of programming languages (together with JavaScript, PHP, .NET, Ruby, Java, Go, Node.js, C++) that Amazon has provided a Software Development Kit (SDK) in a base image:
When I want to connect to AWS I usually turn to Python. AWS’s boto3 is an excellent means of connecting to AWS and exploiting its resources. Python’s boto3 comes pre-installed on AWS SageMaker. This AWS service provides the ability to build, train, and deploy machine learning models quickly. SageMaker provides a simple way to set up an R environment in the cloud but it doesn’t give the means to access other AWS products out of the box, for example AWS S3 and AWS Athena.
Some time ago RStudio developed a package called reticulate that allows R to be interfaced into Python, which could be used in combination with boto3, to access AWS products from Sagemaker, similar to Python. However, for the time being, neither reticulate nor other of R’s awesome packages, like paws or botor, are in the AWS Sagemaker image.
Google Cloud and Microsoft Azure embrace Python as a first class language in their clouds as well:
Deploying models into other pieces of software can be beneficial in Python too. Since Python is a general purpose programming language, the whole application could be written in Python. Including Python-based models, which would be easy and smooth.
Felix, ML expert
Since I have limited experience with R, I can’t compare how difficult it would be to implement analyses in R, or specifically where potential problems or difficulties may arise.
Imagine R suddenly disappeared. What would be the impact on your current/recent data science project?
Olena, Data Science Lead
If that happened, we would have to reconstruct our models in Python. The good news is that both of the languages, R and Python, are state of the art programming languages for data science and suitable for almost any data science task. So, the only impact would be some unnecessary additional effort.
Felix, ML expert
We do not currently use R in our analyses or projects; all are done in Python. So there would be no noticeable effect on our projects.
Developer/data scientists — Do they prefer Python for similar tasks that could be done with R?
Olena, Data Science Lead
It really depends on the task; it’s usually more straightforward to do non-statistical tasks in Python.
R was built primarily for statisticians and data scientists, the people coming from these fields (like myself) feel more comfortable using R; for instance, the terminology, basic data types, and model Interpretability. Vectors and dataframes (the most commonly used data structures for Data Scientists) are core native data types in R, that work with everything.
The usefulness of Python for data science comes primarily from the large and active ecosystem of third-party packages: NumPy, Pandas, SciPy, and Scikit-Learn, which are not native to Python, and can be tricky to configure.
The team is also a key factor when it comes to a programming language selection. When I started working as a Data Scientist, all my team members used R, so it was an obvious choice for me in order to contribute efficiently to team projects.
Felix, ML expert
With our projects written in Python and having years of experience in coding in Python, at the moment I would prefer Python over R for any new task.
Useful features readily available and easy to use in Python include support for Apache Spark, which we use to handle and process big amounts of data in a Python analysis. Furthermore, we perform data I/O from/to databases, like Cassandra and ElasticSearch, which are also fully supported in Python.
We use numpy and scipy libraries for statistical modelling and data analysis.
We employ Machine Learning tools provided in the scikit-learn library like various Clustering algorithms or Decision Tree / Random Forest Classifications, in outlier detections and identification of conspicuous or suspicious requests (i.e. potential attacks) made to a customers’ web pages.
Final words
This time the truth is not in the middle. The popularity of Python is growing and areas where R is a must-have are shrinking fast. R is still a very good option for its specialities and focus, but it’s not as clear a choice as it used to be a few years ago.
Our data science teams use both languages and their ecosystems as effective tools for delivering the best output from our data science projects.