Image for post
Image for post

R vs Python: head to head data analysis

Which is better for data analysis?

Vik Paruchuri
Jun 27, 2016 · 12 min read

There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit.

We’ll be analyzing a dataset of NBA players and their performance in the 2013–2014 season. You can download the file here. For each step in the analysis, we’ll show the Python and R code, along with some explanation and discussion of the different approaches. Without further ado, let’s get this head to head matchup started!

Read in a csv file

nba <- read.csv("nba_2013.csv")
import pandas nba = pandas.read_csv("nba_2013.csv")

The above code will load the csv file nba_2013.csv, which contains data on NBA players from the 2013–2014 season, into the variable nba in both languages. The only real difference is that in Python, we need to import the pandas library to get access to Dataframes. Dataframes are available in both R and Python, and are two-dimensional arrays (matrices) where each column can be of a different datatype. At the end of this step, the csv file has been loaded by both languages into a dataframe.

Find the number of players

[1] 481  31
(481, 31)

This prints out the number of players and the number of columns in each. We have 481 rows, or players, and 31 columns containing data on the players.

Look at the first row of the data

head(nba, 1)
player pos age bref_team_id
1 Quincy Acy SF 23 TOT
[output truncated]
player pos  age bref_team_id
0 Quincy Acy SF 23 TOT
[output truncated]

This is pretty much identical. Both print out the first row of the data, and the syntax is very similar. Python is more object-oriented here, and head is a method on the dataframe object, and R has a separate head function. This is a common theme you’ll see as you start to do analysis with these languages, where Python is more object-oriented, and R is more functional.

Find the average of each statistic

Let’s find the average value for each statistic. The columns, as you can see, have names like fg (field goals made), and ast (assists). These are the season statistics for the player. If you want a fuller explanation of all the stats, look here.

sapply(nba, mean, na.rm=TRUE)
player NA
pos NA
age 26.5093555093555
bref_team_id NA
[output truncated]
age             26.509356
g 53.253638
gs 25.571726
[output truncated]

There are some major differences in approach here. In both, we’re applying a function across the dataframe columns. In python, the mean method on dataframes will find the mean of each column by default.

In R, taking the mean of string values will just result in NA — not available. However, we do need to ignore NA values when we take the mean (requiring us to pass na.rm=TRUE into the mean function). If we don’t, we end up with NA for the mean of columns like x3p.. This column is three point percentage. Some players didn’t take three point shots, so their percentage is missing. If we try the mean function in R, we get NA as a response, unless we specify na.rm=TRUE, which ignores NA values when taking the mean. The .mean() method in Python already ignores these values by default.

Make pairwise scatterplots

One common way to explore a dataset is to see how different columns correlate to others. We’ll compare the ast, fg, and trb columns.

ggpairs(nba[,c("ast", "fg", "trb")])
Image for post
Image for post
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(nba[["ast", "fg", "trb"]])
Image for post
Image for post

We get very similar plots in the end, but this shows how the R data science ecosystem has many smaller packages (GGally is a helper package for ggplot2, the most-used R plotting package), and many more visualization packages in general. In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. With visualization in Python, there is usually one main way to do something, whereas in R, there are many packages supporting different methods of doing things (there are at least a half dozen packages to make pair plots, for instance).

Make clusters of the players

One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.

isGoodCol <- function(col){
sum( == 0 && is.numeric(col)
goodCols <- sapply(nba, isGoodCol)
clusters <- kmeans(nba[,goodCols], centers=5)
labels <- clusters$cluster
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=5, random_state=1)
good_columns = nba._get_numeric_data().dropna(axis=1)
labels = kmeans_model.labels_

In order to cluster properly, we remove any non-numeric columns, or columns with missing values (NA, Nan, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn’t numeric. We then use the cluster package to perform k-means and find 5 clusters in our data. We set a random seed using set.seed to be able to reproduce our results.

In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the get_numeric_data and dropna methods to remove non-numeric columns and columns with missing values.

Plot players by cluster

We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.

nba2d <- prcomp(nba[,goodCols], center=TRUE)
twoColumns <- nba2d$x[,1:2]
clusplot(twoColumns, labels)
Image for post
Image for post
from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(good_columns) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)
Image for post
Image for post

Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the clusplot function was used, which is part of the cluster library. We performed PCA via the pccomp function that is builtin to R.

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.

Split into training and testing sets

If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.

trainRowCount <- floor(0.8 * nrow(nba))
trainIndex <- sample(1:nrow(nba), trainRowCount)
train <- nba[trainIndex,]
test <- nba[-trainIndex,]
train = nba.sample(frac=0.8, random_state=1)
test = nba.loc[~nba.index.isin(train.index)]

You’ll notice that R has many more data-analysis focused builtins, like floor, sample, and set.seed, whereas these are called via packages in Python (math.floor, random.sample, random.seed). In Python, the recent version of pandas came with a sample method that returns a certain proportion of rows randomly sampled from a source dataframe — this makes the code much more concise. In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in sample function. In both cases, we set a random seed to make the results reproducible.

Univariate linear regression

Let’s say we want to predict number of assists per player from field goals made per player.

fit <- lm(ast ~ fg, data=train)
predictions <- predict(fit, test)
from sklearn.linear_model import LinearRegression
lr = LinearRegression()[["fg"]], train["ast"])
predictions = lr.predict(test[["fg"]])

Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in lm and predict functions. predict will behave differently depending on the kind of fitted model that is passed into it — it can be used with a variety of fitted models.

Calculate summary statistics for the model

lm(formula = ast ~ fg, data = train)
Min 1Q Median 3Q Max
-228.26 -35.38 -11.45 11.99 559.61
[output truncated]
import statsmodels.formula.api as sm
model = sm.ols(formula='ast ~ fga', data=train)
fitted =
OLS Regression Results                            
Dep. Variable: ast
R-squared: 0.568
Model: OLS
Adj. R-squared: 0.567
[output truncated]

If we want to get summary statistics about the fit, like r-squared value, we’ll need to do a bit more in Python than in R. With R, we can use the builtin summary function to get information on the model. With Python, we need to use the statsmodels package, which enables many statistical methods to be used in Python. We get similar results, although generally it’s a bit harder to do statistical analysis in Python, and some statistical methods that exist in R don’t exist in Python.

Fit a random forest model

Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.

predictorColumns <- c("age", "mp", "fg", "trb", "stl", "blk")
rf <- randomForest(train[predictorColumns], train$ast, ntree=100)
predictions <- predict(rf, test[predictorColumns])
from sklearn.ensemble import RandomForestRegressor
predictor_columns = ["age", "mp", "fg", "trb", "stl", "blk"]
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)[predictor_columns], train["ast"])
predictions = rf.predict(test[predictor_columns])

The main difference here is that we needed to use the randomForest library in R to use the algorithm, whereas it was built in to scikit-learn in Python. scikit-learn has a unified interface for working with many different machine learning algorithms in Python, and there’s usually only one main implementation of each algorithm in Python. With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them. This results in a greater diversity of algorithms (many have several implementations, and many are fresh out of research labs), but with a bit of a usability hit.

Calculate error

Now that we’ve fit two models, let’s calculate error. We’ll use MSE.

mean((test["ast"] - predictions)^2)
from sklearn.metrics import mean_squared_error
mean_squared_error(test["ast"], predictions)

In Python, the scikit-learn library has a variety of error metrics that we can use. In R, there are likely some smaller libraries that calculate MSE, but doing it manually is pretty easy in either language. There’s a small difference in errors that almost certainly due to parameter tuning, and isn’t a big deal.

Download a webpage

Now that we have data on NBA players from 2013–2014, let’s scrape some additional data to supplement it. We’ll just look at one box score from the NBA Finals here to save time.

library(RCurl) url <-
data <- readLines(url)
import requests
url = ""
data = requests.get(url).content

In Python, the requests package makes downloading web pages easy, with a consistent API for all request types. In R, RCurl provides a similarly simple way to make requests. Both download the webpage to a character datatype. Note: this step is unnecessary for the next step in R, but is shown for comparisons’s sake.

Extract player box scores

Now that we have the web page, we’ll need to parse it to extract scores for players.

page <- read_html(url)
table <- html_nodes(page, ".stats_table")[3]
rows <- html_nodes(table, "tr")
cells <- html_nodes(rows, "td a")
teams <- html_text(cells)
extractRow <- function(rows, i){
if(i == 1){
row <- rows[i]
tag <- "td"
if(i == 2){
tag <- "th"
items <- html_nodes(row, tag)
scrapeData <- function(team){
teamData <- html_nodes(page, paste("#",team,"_basic", sep=""))
rows <- html_nodes(teamData, "tr")
lapply(seq_along(rows), extractRow, rows=rows)
data <- lapply(teams, scrapeData)
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'html.parser')
box_scores = []
for tag in soup.find_all(id=re.compile("[A-Z]{3,}_basic")):
rows = []
for i, row in enumerate(tag.find_all("tr")):
if i == 0:
elif i == 1:
tag = "th"
tag = "td"
row_data = [item.get_text() for item in row.find_all(tag)]

This will create a list containing two lists, the first with the box score for CLE, and the second with the box score for GSW. Both contain the headers, along with each player and their in-game stats. We won’t turn this into more training data now, but it could easily be transformed into a format that could be added to our nba dataframe.

The R code is more complex than the Python code, because there isn’t a convenient way to use regular expressions to select items, so we have to do additional parsing to get the team names from the HTML. R also discourages using for loops in favor of applying functions along vectors. We use lapply to do this, but since we need to treat each row different depending on whether it’s a header or not, we pass the index of the item we want, and the entire rows list into the function.

We use rvest, a new and widely used R web scraping package to extract the data we need. Note that we can pass a url directly into rvest, so the last step wasn’t needed in R.

In Python, we use BeautifulSoup, the most commonly used web scraping package. It enables us to loop through the tags and construct a list of lists in a straightforward way.

Python vs R in Conclusion

We’ve taken a look at how to analyze a dataset with R and Python. There are many tasks we didn’t dive into, such as persisting the results of our analysis, sharing the results with others, testing and making things production-ready, and making more visualizations. We’ll dive into these at a later date, which will let us make some more definitive conclusions. For now, here’s what we can say:

As we saw from functions like lm, predict, and others, R lets functions do most of the work. Contrast this to the LinearRegression class in Python, and the sample method on dataframes.

When we looked at summary statistics, we could use the summary built-in function in R, but had to import the statsmodels package in Python. The dataframe is a built-in construct in R, but must be imported via the pandas package in Python.

With Python, we can do linear regression, random forests, and more with the scikit-learn package. It offers a consistent API, and is well-maintained. In R, we have a greater diversity of packages, but also greater fragmentation and less consistency (linear regression is a builtin, lm, randomForest is a separate package, etc).

R was built as a statistical language, and it shows. statsmodels in Python and other packages provide decent coverage for statistical methods, but the R ecosystem is far larger.

With well-maintained libraries like BeautifulSoup and requests, web scraping in Python is far easier than in R. This applies to other tasks that we didn’t look into closely, like saving to databases, deploying web servers, or running complex workflows.

There are clear points of inspiration between both R and Python (pandas Dataframes were inspired by R dataframes, the rvest package was inspired by BeautifulSoup), and both ecosystems continue to grow stronger. It’s remarkable how similar the syntax and approaches are for many common tasks in both languages.

Last word

At Dataquest, we primarily teach Python, but have recently been adding lessons on R. We see both languages as complementary, and although we think Python is stronger in more areas, R is an effective language. It can be used either as a complement for Python in areas like data exploration and statistics, or as your sole data analysis tool. As this walkthrough proves, both languages have a lot of similarities in syntax and approach, and you can’t go wrong with one, the other, or both.

Image for post
Image for post

Originally published at


Learn Data Science Online, In Your Browser

Vik Paruchuri

Written by

Founder @ Dataquest. Data science, machine learning, web/mobile development background. Worked edXonline, U.S. State Department.



Learn Data Science Online, In Your Browser

Vik Paruchuri

Written by

Founder @ Dataquest. Data science, machine learning, web/mobile development background. Worked edXonline, U.S. State Department.



Learn Data Science Online, In Your Browser

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store