# R vs Python: head to head data analysis

## Which is better for data analysis?

There have been dozens of articles written comparing Python and R from a subjective standpoint. We’ll add our own views at some point, but this article aims to look at the languages more objectively. We’ll analyze a dataset side by side in Python and R, and show what code is needed in both languages to achieve the same result. This will let us understand the strengths and weaknesses of each language without the conjecture. At Dataquest, we teach both languages, and think both have a place in a data science toolkit.

We’ll be analyzing a dataset of NBA players and their performance in the 2013–2014 season. You can download the file here. For each step in the analysis, we’ll show the Python and R code, along with some explanation and discussion of the different approaches. Without further ado, let’s get this head to head matchup started!

# Read in a csv file

## R

`nba <- read.csv("nba_2013.csv")`

## Python

`import pandas nba = pandas.read_csv("nba_2013.csv")`

The above code will load the csv file nba_2013.csv, which contains data on NBA players from the 2013–2014 season, into the variable nba in both languages. The only real difference is that in Python, we need to import the pandas library to get access to Dataframes. Dataframes are available in both R and Python, and are two-dimensional arrays (matrices) where each column can be of a different datatype. At the end of this step, the csv file has been loaded by both languages into a dataframe.

# Find the number of players

## R input

**dim**(nba)

## R output

`[1] 481 31`

## Python input

`nba`**.**shape

## Python output

`(481, 31)`

This prints out the number of players and the number of columns in each. We have **481** rows, or players, and **31** columns containing data on the players.

# Look at the first row of the data

## R input

**head**(nba, 1)

## R output

`player pos age bref_team_id`

1 Quincy Acy SF 23 TOT

[output truncated]

## Python input

`nba`**.**head(1)

## Python output

`player pos age bref_team_id`

0 Quincy Acy SF 23 TOT

[output truncated]

This is pretty much identical. Both print out the first row of the data, and the syntax is very similar. Python is more object-oriented here, and **head** is a method on the dataframe object, and R has a separate **head** function. This is a common theme you’ll see as you start to do analysis with these languages, where Python is more object-oriented, and R is more functional.

# Find the average of each statistic

Let’s find the average value for each statistic. The columns, as you can see, have names like **fg** (field goals made), and **ast** (assists). These are the season statistics for the player. If you want a fuller explanation of all the stats, look here.

## R input

**sapply**(nba, **mean**, na.rm**=TRUE**)

## R output

`player NA`

pos NA

age 26.5093555093555

bref_team_id NA

[output truncated]

## Python input

`nba`**.**mean()

## Python output

`age 26.509356`

g 53.253638

gs 25.571726

[output truncated]

There are some major differences in approach here. In both, we’re applying a function across the dataframe columns. In python, the mean method on dataframes will find the mean of each column by default.

In R, taking the mean of string values will just result in **NA** — not available. However, we do need to ignore **NA** values when we take the mean (requiring us to pass **na.rm=TRUE** into the **mean** function). If we don’t, we end up with **NA** for the mean of columns like **x3p**.. This column is three point percentage. Some players didn’t take three point shots, so their percentage is missing. If we try the **mean** function in R, we get **NA** as a response, unless we specify **na.rm=TRUE**, which ignores **NA** values when taking the mean. The **.mean()** method in Python already ignores these values by default.

# Make pairwise scatterplots

One common way to explore a dataset is to see how different columns correlate to others. We’ll compare the ast, fg, and trb columns.

## R input

`library(GGally)`

ggpairs(nba[,c("ast", "fg", "trb")])

## R output

## Python input

`import seaborn as sns`

import matplotlib.pyplot as plt

sns.pairplot(nba[["ast", "fg", "trb"]])

plt.show()

## Python output

We get very similar plots in the end, but this shows how the R data science ecosystem has many smaller packages (GGally is a helper package for ggplot2, the most-used R plotting package), and many more visualization packages in general. In Python, matplotlib is the primary plotting package, and seaborn is a widely used layer over matplotlib. With visualization in Python, there is usually one main way to do something, whereas in R, there are many packages supporting different methods of doing things (there are at least a half dozen packages to make pair plots, for instance).

# Make clusters of the players

One good way to explore this kind of data is to generate cluster plots. These will show which players are most similar.

## R

`library(cluster)`

set.seed(1)

isGoodCol <- function(col){

sum(is.na(col)) == 0 && is.numeric(col)

}

goodCols <- sapply(nba, isGoodCol)

clusters <- kmeans(nba[,goodCols], centers=5)

labels <- clusters$cluster

## Python

`from sklearn.cluster import KMeans`

kmeans_model = KMeans(n_clusters=5, random_state=1)

good_columns = nba._get_numeric_data().dropna(axis=1) kmeans_model.fit(good_columns)

labels = kmeans_model.labels_

In order to cluster properly, we remove any non-numeric columns, or columns with missing values (**NA**, **Nan**, etc). In R, we do this by applying a function across each column, and removing it if it has any missing values or isn’t numeric. We then use the cluster package to perform k-means and find 5 clusters in our data. We set a random seed using **set.seed** to be able to reproduce our results.

In Python, we use the main Python machine learning package, scikit-learn, to fit a k-means clustering model and get our cluster labels. We perform very similar methods to prepare the data that we used in R, except we use the **get_numeric_data** and **dropna** methods to remove non-numeric columns and columns with missing values.

# Plot players by cluster

We can now plot out the players by cluster to discover patterns. One way to do this is to first use PCA to make our data 2-dimensional, then plot it, and shade each point according to cluster association.

## R input

`nba2d <- prcomp(nba[,goodCols], center=TRUE)`

twoColumns <- nba2d$x[,1:2]

clusplot(twoColumns, labels)

## R output

## Python input

`from sklearn.decomposition import PCA`

pca_2 = PCA(2)

plot_columns = pca_2.fit_transform(good_columns) plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels) plt.show()

## Python output

Made a scatter plot of our data, and shaded or changed the icon of the data according to cluster. In R, the **clusplot** function was used, which is part of the cluster library. We performed PCA via the **pccomp** function that is builtin to R.

With Python, we used the PCA class in the scikit-learn library. We used matplotlib to create the plot.

# Split into training and testing sets

If we want to do supervised machine learning, it’s a good idea to split the data into training and testing sets so we don’t overfit.

## R

`trainRowCount <- floor(0.8 * nrow(nba))`

set.seed(1)

trainIndex <- sample(1:nrow(nba), trainRowCount)

train <- nba[trainIndex,]

test <- nba[-trainIndex,]

## Python

`train = nba.sample(frac=0.8, random_state=1)`

test = nba.loc[~nba.index.isin(train.index)]

You’ll notice that R has many more data-analysis focused builtins, like **floor**, **sample**, and **set.seed**, whereas these are called via packages in Python (**math.floor**, **random.sample**, **random.seed**). In Python, the recent version of pandas came with a sample method that returns a certain proportion of rows randomly sampled from a source dataframe — this makes the code much more concise. In R, there are packages to make sampling simpler, but aren’t much more concise than using the built-in **sample** function. In both cases, we set a random seed to make the results reproducible.

# Univariate linear regression

Let’s say we want to predict number of assists per player from field goals made per player.

## R

`fit <- lm(ast ~ fg, data=train)`

predictions <- predict(fit, test)

## Python

`from sklearn.linear_model import LinearRegression`

lr = LinearRegression()

lr.fit(train[["fg"]], train["ast"])

predictions = lr.predict(test[["fg"]])

Scikit-learn has a linear regression model that we can fit and generate predictions from. R relies on the built-in **lm** and **predict** functions. **predict** will behave differently depending on the kind of fitted model that is passed into it — it can be used with a variety of fitted models.

# Calculate summary statistics for the model

## R input

**summary**(fit)

## R output

Call:

lm(formula = ast ~ fg, data = train)Residuals:

Min 1Q Median 3Q Max

-228.26 -35.38 -11.45 11.99 559.61

[output truncated]

## Python input

**import** **statsmodels.formula.api** **as** **sm**

model **=** sm**.**ols(formula**=**'ast ~ fga', data**=**train)

fitted **=** model**.**fit()

fitted**.**summary()

## Python output

`OLS Regression Results `

============================

Dep. Variable: ast

R-squared: 0.568

Model: OLS

Adj. R-squared: 0.567

[output truncated]

If we want to get summary statistics about the fit, like r-squared value, we’ll need to do a bit more in Python than in R. With R, we can use the builtin **summary** function to get information on the model. With Python, we need to use the statsmodels package, which enables many statistical methods to be used in Python. We get similar results, although generally it’s a bit harder to do statistical analysis in Python, and some statistical methods that exist in R don’t exist in Python.

# Fit a random forest model

Our linear regression worked well in the single variable case, but we suspect there may be nonlinearities in the data. Thus, we want to fit a random forest model.

## R

`library(randomForest)`

predictorColumns <- c("age", "mp", "fg", "trb", "stl", "blk")

rf <- randomForest(train[predictorColumns], train$ast, ntree=100)

predictions <- predict(rf, test[predictorColumns])

## Python

`from sklearn.ensemble import RandomForestRegressor`

predictor_columns = ["age", "mp", "fg", "trb", "stl", "blk"]

rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)

rf.fit(train[predictor_columns], train["ast"])

predictions = rf.predict(test[predictor_columns])

The main difference here is that we needed to use the randomForest library in R to use the algorithm, whereas it was built in to scikit-learn in Python. scikit-learn has a unified interface for working with many different machine learning algorithms in Python, and there’s usually only one main implementation of each algorithm in Python. With R, there are many smaller packages containing individual algorithms, often with inconsistent ways to access them. This results in a greater diversity of algorithms (many have several implementations, and many are fresh out of research labs), but with a bit of a usability hit.

# Calculate error

Now that we’ve fit two models, let’s calculate error. We’ll use MSE.

## R input

**mean**((test["ast"] **-** predictions)**^**2)

## R output

`4573.86778567462`

## Python input

**from** **sklearn.metrics** **import** mean_squared_error

mean_squared_error(test["ast"], predictions)

## Python output

`4166.9202475632374`

In Python, the scikit-learn library has a variety of error metrics that we can use. In R, there are likely some smaller libraries that calculate MSE, but doing it manually is pretty easy in either language. There’s a small difference in errors that almost certainly due to parameter tuning, and isn’t a big deal.

# Download a webpage

Now that we have data on NBA players from 2013–2014, let’s scrape some additional data to supplement it. We’ll just look at one box score from the NBA Finals here to save time.

## R

`library(RCurl) url <-`

"http://www.basketball-reference.com/boxscores/201506140GSW.html"

data <- readLines(url)

## Python

`import requests`

url = "http://www.basketball-reference.com/boxscores/201506140GSW.html"

data = requests.get(url).content

In Python, the requests package makes downloading web pages easy, with a consistent API for all request types. In R, RCurl provides a similarly simple way to make requests. Both download the webpage to a character datatype. Note: this step is unnecessary for the next step in R, but is shown for comparisons’s sake.

# Extract player box scores

Now that we have the web page, we’ll need to parse it to extract scores for players.

## R

library(rvest)

page<-read_html(url)

table<-html_nodes(page, ".stats_table")[3]

rows<-html_nodes(table, "tr")

cells<-html_nodes(rows, "td a")

teams<-html_text(cells)extractRow<-function(rows, i){

if(i==1){

return

}

row<-rows[i]

tag<-"td"

if(i==2){

tag<-"th"

}

items<-html_nodes(row, tag)

html_text(items)

}scrapeData<-function(team){

teamData<-html_nodes(page,paste("#",team,"_basic", sep=""))

rows<-html_nodes(teamData, "tr")

lapply(seq_along(rows), extractRow, rows=rows)

}data<-lapply(teams, scrapeData)

## Python

**from** **bs4** **import** BeautifulSoup

**import** **re**

soup **=** BeautifulSoup(data, 'html.parser')

box_scores **=** []

**for** tag **in** soup**.**find_all(id**=**re**.**compile("[A-Z]{3,}_basic")):

rows **=** []

**for** i, row **in** enumerate(tag**.**find_all("tr")):

**if** i **==** 0:

**continue**

**elif** i **==** 1:

tag **=** "th"

**else**:

tag **=** "td"

row_data **=** [item**.**get_text() **for** item **in** row**.**find_all(tag)]

rows**.**append(row_data)

box_scores**.**append(rows)

This will create a list containing two lists, the first with the box score for **CLE**, and the second with the box score for **GSW**. Both contain the headers, along with each player and their in-game stats. We won’t turn this into more training data now, but it could easily be transformed into a format that could be added to our **nba **dataframe.

The R code is more complex than the Python code, because there isn’t a convenient way to use regular expressions to select items, so we have to do additional parsing to get the team names from the HTML. R also discourages using **for** loops in favor of applying functions along vectors. We use **lapply** to do this, but since we need to treat each row different depending on whether it’s a header or not, we pass the index of the item we want, and the entire **rows** list into the function.

We use **rvest**, a new and widely used R web scraping package to extract the data we need. Note that we can pass a url directly into **rvest**, so the last step wasn’t needed in R.

In Python, we use BeautifulSoup, the most commonly used web scraping package. It enables us to loop through the tags and construct a list of lists in a straightforward way.

# Python vs R in Conclusion

We’ve taken a look at how to analyze a dataset with R and Python. There are many tasks we didn’t dive into, such as persisting the results of our analysis, sharing the results with others, testing and making things production-ready, and making more visualizations. We’ll dive into these at a later date, which will let us make some more definitive conclusions. For now, here’s what we can say:

## R is more functional, Python is more object-oriented

As we saw from functions like **lm**, **predict**, and others, R lets functions do most of the work. Contrast this to the **LinearRegression** class in Python, and the **sample** method on dataframes.

## R has more data analysis built-ins, Python relies on packages

When we looked at summary statistics, we could use the **summary** built-in function in R, but had to import the **statsmodels** package in Python. The dataframe is a built-in construct in R, but must be imported via the **pandas** package in Python.

## Python has “main” packages for data analysis tasks, R has a larger ecosystem of small packages

With Python, we can do linear regression, random forests, and more with the scikit-learn package. It offers a consistent API, and is well-maintained. In R, we have a greater diversity of packages, but also greater fragmentation and less consistency (linear regression is a builtin, **lm**, **randomForest** is a separate package, etc).

## R has more statistical support in general

R was built as a statistical language, and it shows. **statsmodels** in Python and other packages provide decent coverage for statistical methods, but the R ecosystem is far larger.

## It’s usually more straightforward to do non-statistical tasks in Python

With well-maintained libraries like BeautifulSoup and requests, web scraping in Python is far easier than in R. This applies to other tasks that we didn’t look into closely, like saving to databases, deploying web servers, or running complex workflows.

## There are many parallels between the data analysis workflow in both

There are clear points of inspiration between both R and Python (pandas Dataframes were inspired by R dataframes, the *rvest* package was inspired by *BeautifulSoup*), and both ecosystems continue to grow stronger. It’s remarkable how similar the syntax and approaches are for many common tasks in both languages.

# Last word

At Dataquest, we primarily teach Python, but have recently been adding lessons on R. We see both languages as complementary, and although we think Python is stronger in more areas, R is an effective language. It can be used either as a complement for Python in areas like data exploration and statistics, or as your sole data analysis tool. As this walkthrough proves, both languages have a lot of similarities in syntax and approach, and you can’t go wrong with one, the other, or both.

*Originally published at **www.dataquest.io**.*