R for Statistics and Data Science — Why do I prefer R programming

Simona Dee
365 Data Science
Published in
8 min readApr 26, 2018

It was during fresher’s year of university when I was introduced to data analysis; this is when I first went head-to-head with R as a programming language and data analysis software. Around the same time, through living arrangements, I made a couple of important acquaintances. One of them was a graduate working in the IT sector as a Python developer. As university assignments started to roll in, I began to seek advice from him. He’d show me how he’d solve a problem easily in Python, and then I’d do it in R.

“Why are they making us do this in R…” became the alt text of our data analysis chats.

Eventually, however, R started to make more sense to me, and I became much more efficient. Six years later, I am teaching R to aspiring data scientists, and, as my friend is moving towards big data, he’s experiencing a change of hea-R-t.

R is good for handling data.

And I’d like to discuss why I have developed a strong preference for using R for statistics and data science.

1. Statistics is King

2. Packages are members of the court

3. Visualisations are Queen

4. Gateway to Data Science

5. A look to the Future

Statistics is King

At the core of data analysis, and data science, is statistics. R programming is a language developed by two statisticians who set out to create a system for statistical computing and data visualisation. Something that statisticians far and wide could use intuitively.

At its core, R is a vectorised language which offers a massive amount of statistical functionality in the face of base functions and a vast packages library. While the language sometimes uses vocabulary that is not intuitive to the non-statistician, getting used to R’s structure does not take aeons, and the rewards are palpable.

In fact, the game changer for me, when I was first tackling R for statistics and data science, was learning to use one package very well, and then applying my knowledge and developed intuitions to newer and better computations. Adapting to new packages once you’ve learned one in-depth absolutely makes the proverbial learning curve gentler.

That said, let’s get back to the point: for data analytics, statistics is King. If you don’t know statistics, then you will need to learn it. And to creatively, and effectively get statistical insight from your data — either as a data analyst or a data scientist — you will need to grasp the concepts in the environment where you will be using them: the programming language. R’s structure allows the learning statistician to apply their intuitions about the procedural order of manipulations and translate them into code. R was meant for statistics and data science.

Then come the packages. Because R is an open-source language with a buzzing, no — booming, community of contributors, developers, and well-wishers, there are packages for almost everything data analysis. They promptly become the perfect partner in crime (here, crime means efficient data handing), but let’s discuss packages in their own section.

Packages are members of the court

You have heard the saying “too many cooks spoil the broth”, and if you haven’t, it roughly means that if too many people are involved in a shared task, that task will not be completed very well. I believe that in 2018 this phrase bears less and less significance, and R’s community support, and ensuing success, serve to corroborate that.

One of the most notable aspects of R as an open-source language is that it is backed by thousands of users with big code-sharing hearts. The main repository for packages is CRAN (the Comprehensive R Archive Network), and it is just what it sounds like. This is the place where you can find almost any package ever created, with any function anyone has ever contributed, on almost any data-related topic. Packages in R for statistics and data science range from hugely specific to generally versatile; there are ideologically coherent eco-systems of packages and families of functions that share the same theoretical underpinnings. This makes acquiring a large skillset in a medium sized field relatively easy (think data analysis).

Let me give you an example. It is called the tidyverse.

The tidyverse is an eco-system of packages intended for quick and efficient data handling from scratch — from the exploratory phases of data analysis to string manipulation, time series analysis, and machine learning.

It contains a total of 20 packages for the everyday needs of the data analyst and a lot more you can load that work seamlessly with the core tidyverse. To give you an idea:

Dplyr, tidyr, stringr and forcats are here for easy data wrangling, from tidying, to storing, to manipulating.

Purrr and magrittr provide natural programming methods to carry out efficient computations

Readr, readxl, and haven are for importing various types of data into R, like flat files, excel sheets, and other software-specific data formats (SAS, Stata, SPSS)

Ggplot2 is the go-to tool pack for data visualisations. Built on the Grammar of Graphics, this is the most coherent and highly customizable approach to mapping variables to the aesthetics and graphical elements you want.

In any case, learning one or a couple of these packages unlocks possibilities a novice data analyst can’t even fathom.

This is perhaps my favourite part of being an instructor in R for statistics and data science, because this is the stage where my students begin asking questions that go beyond the curriculum. Once you get to know the language, R is logical and systematic; it creates a perpetuum mobile of self-improvement.

I may be slightly biased towards R, of course — this was the language I learned as I was starting out.

If you are interested in starting out with R, specifically tailored towards statistics and data science, this is a coupon that gives you a 95% off the price of my course on Udemy.

Visualisations are queen

I will start off with this: creating stunning visualizations in R is not a walk in the park to begin with. It is a walk through an erupting volcano park. But. It. Gets. Better.

The results of data analysis are not incredibly convincing to the larger public unless presented in an accessible way. Right? Right. So, here is what R has to offer in this department. I will make this very simple:

R has one of the most comprehensive sets of data visualization tools you can come across. Barring the fact that it is also free, the sheer amount of graph galleries and capabilities is breath-taking. In fact, if you would like to make sure that this is the case, and I am not over-exaggerating, just take a look at the R-graph gallery.

Of course, the massive amount of capability comes at a price: learning to create graphs that are personalised in great detail in R is not easy. The ggplot2 package has its own syntax and logic and often, in the beginning at least, you would need to keep a cheat sheet a click away. However, as you learn to use the grammar of graphics and to apply layers to a visualization, you will also be gaining a whole new perspective on what data visualisation is supposed to be. You will begin to think from the data up.

In addition, R goes far beyond covering the basics of data visualization really, really well. It boasts a lot of integration with other libraries and APIs, like Shiny, Plotly, ggVis, GoogleVis, Highcharter, and so on. This lets you jump over the static 2-D canvas and dive into a world of interactive responsive graphics.

As long as you put the work in, when using R for statistics and data science, R will prove a worthy visualisation companion.

Gateway to Data Science

Finally, to bring all this back to the beginning: R might be the best language for learning statistics and data science. This is a language that has been designed for data, and reflects that in every nook, cranny, command, and package.

It is one of the main advantages R has over more general-purpose languages. In fact, textbooks, universities, and MOOCs teach statistics and data science through R. Why? Because R’s structure, design, logic, syntax, and vocabulary reflect the theoretical foundations of statistical computing and data analysis, and the techniques and tools you will use while learning are the practical application you need to understand the matter (be it frequentist statistics, Bayesian statistics, probability, machine learning, etc.).

The professional community in the Tech, Big Data, and Science industries also recognise R’s advantages for statistics and data science. R is one of the most widely used programming languages, with a steadily growing popularity trend.

Finally, a note on R’s adaptation to the future

Recently, there has been talk about R “not being enough”, in terms of the speed of its interpreter, how it handles its data, and vectorisation. There have also been suggestions to scrap R in its entirety and develop a new, better language from scratch.

Of course, this is a language developed in the early 90’s with a very specific goal in mind: handle data. The core purpose of the language has not changed much, in contrast to the demands of the marketplace.

But there are two important points to be made. First, R is adapting (through more vectorisation, use of multicores, etc.). And second:

R is still the leader in being “the data language”.

That said, the corporations and the industries that use R currently; and that teach R to new recruits; and develop with R in mind, will have an extremely difficult time switching to a new language when, and if, it comes to challenge R for statistics and data science.

Currently, there is nothing quite like R for statistics and data science (let alone better) on the programming languages shelf, and this will likely not change anytime soon.

Finally, finally, a disclaimer!

I teach R for Statistics and Data Science over at Udemy. If you’d like to know more about R, statistics, and data science, you can check out my course (the link below gets you 95% off the Udemy price).

Thanks for reading!

--

--

Simona Dee
365 Data Science

Graduated in Humanities but kept looking towards the Numbers; now I teach statistics and data analysis.