Python vs. R: What Cooks Your Noodles Better?
I am often asked the question about the best programming language/technology stack to use in data science projects and research. Usually, the prominent open-source Python and R are mentioned as major options to consider.
Although the questions of such a type are not easy to answer in black and white scale, there are some arguments I have in favor of Python as a stack to build and deliver industrial projects.
Over years of my data science and software development practice, I found the following approach to be useful
- R to be engaged in rapid prototyping and quick data exploration analysis mostly
- Python to be used in industrial and business solutions
In the sections below, I am going to explain why I ended up with such an approach.
As summarized in a great review per https://elitedatascience.com/r-vs-python-for-data-science, Python and R are on the par as for the functional capabilities provided to Data Scientists. You can argue as for something being more convenient in R (like using auto-ARIMA in ARIMA forecasts for time series values) or in Python (a lot more of ready-to-use big data and neural network interfaces) but the bottom line is you can accomplish the same goal in both languages, more or less.
At the same time, R is often more convenient in early analysis and prototyping phases. It has plenty of instruments for productive data wrangling (thanks to dplyr and extended dplyr syntax), exploratory data analysis and rapid prototyping various machine learning solutions (just to mention caret package that can help you to quickly try 100+ models using a single universal interface).
Python-based solutions are much faster in terms of performance (10–17 times faster than comparable R applications on the same hardware in average, in fact — see https://www.theregister.co.uk/2017/02/16/r_sql_server_great_but_beware/, for example).
Another good performance booster for Python-based applications is ability to easily use parallel computation capabilities. R packages for parallel computation are too unstable to use in production environments nowadays.
Integration into Business Operation Framework
Apart from performance (where Python definitely beats R), industrial analytical solutions in modern business organizations should meet other essential business requirement. These are
- Integration with enterprise software systems
- Providing intuitive slick UI/UX experience
- Ease of maintenance and deployment
Python works this out better then R due to the factors below
- Simplicity of integration with a bunch of transactional business systems (CRM, ERP, MRP etc.), BI platforms and other corporate IT systems
- Availability of slick Web App development frameworks to build fantastic UI experience (flask, djungo, frappe — you name them; shiny in R world is far behind IMHO)
- Ability to integrate with a lot of third-party systems via their Web API without major development overhead
- Ease of incorporation of non-data-science packages into the application, to make it the part of the continuous business operation framework
- Ease of maintenance of the application code — many Python IDEs have built-in capabilities to enforce well-standardized enterprise-level Python coding conventions like PEP-8 (in turn, R and its major IDEs like R Studio can forgive you a bogus code style)
Will Microsoft Change the Rules?
There are strong game-changing moves tried by Microsoft recently as they introduced in-database R Services in Microsoft SQL Server 2016. This eliminates a lot of traditional R disadvantages (first of all, in terms of performance and ability to benefit from well-orchestrated parallel computing).
Microsoft is going to go beyond it as it plans to do the same with Python in MS SQL Server 2017.
So they create a stack of high-performance machine learning services of all sorts and flavors. Their obvious interest is to keep the users paying for MS SQL Server licenses in the changing world. However, it could have a drastic impact on adoption of data science and predictive analytics technologies across the industries. With R and Python in-database services in place, such technologies will move to masses, and they will no more be just the matter of elite data science laboratories.
Right now, there is a real battle for data scientist mind share (in terms of Python vs. R) happens. Although it is potentially productive to use the best of two worlds, it goes at a price. You should identify where the demarcation line between R and Python PROs and CONs is. In my past projects, the demarcation clearly draws over the different stages of industrial software development life cycle
- R is good at analysis, research and rapid prototyping
- Python is the stack of choice to build enterprise-ready data products