Collaborating between Python and R using Reticulate

Deepsha Menghani
Data Science at Microsoft
15 min readFeb 7, 2023

My colleague Riesling Walker and I worked together on a Microsoft Hackathon project to analyze data around our common hobby using Python and R. We wanted to use our data science skills to analyze our knitting queues on Ravelry, a social networking and organizational website for yarn-related crafts. Through this project our goal was to:

  1. Learn to pull data with API calls in Python using Flask.
  2. Access the output of Python API calls from R easily using Reticulate.
  3. Analyze our project data using GGPlot in R and create a portfolio project.

This is the second article in a three-part series aimed at going in-depth into each of the above steps. This article provides an overview of how we used the Reticulate package to collaborate between R and Python.

You can find the first article in the series here, which provides a handy overview of APIs and a step-by-step guide on how to use APIs in Python for first-time API users. To keep an eye out for upcoming articles in this series, follow Deepsha, Riesling, or Data Science @ Microsoft (all on Medium.com).

Why collaborate between R and Python?

Both R and Python are powerful data science languages and can be used to accomplish end-to-end analytics projects. Although there is a lot of overlap in capabilities between the languages, each language has its strength. For example, you may use R for its powerful data visualization libraries and use Python for its ML libraries. So, there may be some things you would want to do in Python and some things in R within a single project. Additionally, each person has their own preference, so there might also be a scenario where you might want to use R and collaborate with another teammate on a project where they may prefer Python. Reticulate makes it easy to collaborate and talk between the two languages. I think of it as a translator on my fingertips.

In this project we divided Python and R components in the following way:

  • Riesling used Python to write functions that use the Ravelry API to return a JSON object. She then converted it into a Python data frame to be returned by the function. You can find these functions in a file called “python_api_functions”.
  • I then used R to source the Python function file created by Riesling and called the Python functions using the Reticulate package, which automatically returned an R data frame for further analysis in R.

Let’s get started on the details of how we accomplished this, the packages needed, and the various function calls.

Some important notes as you navigate this article

Switching between R and Python code

I will be switching between Python and R code snippets. The Python code is always from the Python functions file and the R code snippets are from the R file being used to call the Python functions. Each code snippet has a handy comment right on top as a reminder of what language that snippet is from.

Where to find the code files

You can find the code to recreate everything in this article at this github repo.

Accessing Ravelry data through API

The code in this article accesses data through Ravelry API calls. To run the code, you need a pro Ravelry account (free to create), and then you must create an App as in the snapshot below. The “Basic Auth: read only access” should be sufficient for running most of the article code. However, if you also want to access the “Favorites list” on Ravelry, you will need an App with “Basic Auth: personal account access”. After you create the app, you will be provided with a username and password that allow access to the data using the API. This step is very quick!

For this article I created a personal account access app so I could also analyze favorite patterns.

Type of apps that can be created with free Ravelry pro account to access its APIs.

Loading the base libraries

The Reticulate package allows working between Python and R seamlessly. If you do not have the below packages installed, you can use the command install.packages(“package name”) to install them before calling the libraries.

# R

# For EDA
library(tidyverse)
library(DT)

# For interacting between python and R
library(reticulate)

# For storing any secret keys required by API calls
library(dotenv)

Sourcing the Python functions file from R

You can run the command below in your R file to source the Python file with all the functions. Then you will be able to call those Python functions from R as if they were written in R. Isn’t that neat! In this case my Python file is in the same folder as the R file I am calling it from.

# R

source_python("python_api_functions.py")

Once you have installed and called the “Reticulate” package, when trying to source the Python functions file, you may be asked to install “miniconda”. You can install it here.

While sourcing the Python file, if you get an error indicating that some Python packages are missing, you can install them using the following command in Python. It is a good idea to check whether you have the Python packages installed that are sourced in the functions file before sourcing them.

# Python

!pip install flask

Let’s test the sourced Python file

There is a greeting function in the Python file that requires a simple string input. Let’s call that function from the R file to make sure that the Python file and its functions have been sourced correctly.

# R

greeting_fellow_humans("Optimus")

[1] "Hello, Optimus. Good morning!"

Great, looks like we successfully sourced a Python file from R and called a Python function! Let’s move on to other slightly more complex Python function calls from R.

Calling Python function with username and password inputs

Let’s start with a simple Python function that Riesling wrote to get yarn weights using the Ravelry API. Below is what the Python function declaration and return looks like. We do not need to understand how it executes in Python, but we do need to know what information to give it and the data type of what will be returned in order to implement it in our R code, so to reduce confusion, we will just include a placeholder for the actual code logic.

# Python 

# Function that returns all valid yarn weights. This python function sits in the python_api_functions.py file.
def get_yarnweights(authUsername:str, authPassword:str) -> pd.core.frame.DataFrame:
...
...
...
return pd.DataFrame.from_records(json.loads(r1.text)['yarn_weights'])

To call this function from my R file, I need to input a username and password for my Ravelry app while calling the function. I stored those in a .env file inside my home folder. I then used the dotenv package to access my username and password. This enabled me to not have it visibly written in my R code.

# R

library(dotenv)

load_dot_env("rav.env")

authUsername = Sys.getenv('authUsername')
authPassword = Sys.getenv('authPassword')

Now that I have my username and password stored secretly, I can call the Python function from R using the environment variables as input into the Python function.

# R 

yarnweights <- get_yarnweights(authUsername, authPassword)

yarnweights %>%
select(id, name) %>%
head() %>%
datatable()
Unique yarn weights data available through the ravelry API.

Awesome! I ran a Python function from my R file that required user input, and I received the output. What is even more exciting is that Reticulate is smart enough to return an R dataframe when I call the function from R instead of the Python dataframe that the original Python function returned! Now I can easily use GGPlot directly on this data frame to plot beautiful charts.

But before we do that, let’s try to call a few other Python functions from R. Additionally, I will resolve some errors that we encounter along the way.

Calling Python function with more inputs

There is a Python function that Riesling wrote called “get_queue()” that takes an input of a Ravelry username and returns the projects in that user’s queue. Let’s take a look at the Python function skeleton, and then the R code of how to call it.

# Python 

# This returns all the patterns from a given person's rav_username
def get_queue(authUsername:str, authPassword:str, rav_username = 'rieslingm', query = '', page = 1, page_size = 100) -> pd.core.frame.DataFrame:
...
...
...
return pd.DataFrame.from_records(json.loads(r1.text)['queued_projects'])

My Ravelry username is ‘yarnsandcoffee’, so let’s look at my queue by calling this Python function from R.

# R

project_queue <- get_queue(authUsername, authPassword, rav_username = 'yarnsandcoffee')

project_queue %>%
select(created_at, short_pattern_name, pattern_id, pattern_author_name) %>%
head() %>% # Only return the first 6 rows
datatable()

Works like a charm, and I get an R dataframe to work with. Now I want to get further details of each of the above patterns in my queue. Let’s try another Python function call where I input a pattern_id to get the details of the pattern in return, like the yardage of yarn required to make that pattern. Let’s take a look at the Python function skeleton.

# Python 

# for a given pattern id, this returns all the pattern details
def get_pattern_details(authUsername:str, authPassword:str, pattern_id:int) -> pd.core.frame.DataFrame:
...
...
...
return pd.json_normalize(json.loads(r1.text)['pattern'], max_level = 1)

From the above function requirement, we see that it needs an integer input for ‘pattern_id’ to return the pattern details. Let’s try that.

# R

pattern_details <- get_pattern_details(authUsername, authPassword, pattern_id = 1168477)

Running the above unfortunately threw an error for me that said “Error in py_call_impl(callable, dots$args, dots$keywords) : json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)”.

Thankfully, a very simple edit to the function call fixed the error. Let’s try providing the pattern_id within quotes instead of as an integer.

# R

pattern_details <- get_pattern_details(authUsername, authPassword, pattern_id = "1168477")

pattern_details %>%
mutate(across(where(is.numeric), ~round(., 1))) %>% # Round columns for cleaner view
select(name, yardage, created_at, difficulty_average, favorites_count, projects_count, rating_average) %>%
datatable()

Great, that worked with a simple fix.

Combine and iterate over API calls

Now let’s say I want to find out the yardage of yarn I will need for all the projects in my queue. For this I need to:

  1. Get my queue with project_ids using the get_queue() function API call.
  2. Take the pattern_ids from the queue and en-quote them before passing into the get_pattern_details() function API call for the yardage information. For this I created a handy R function called get_pattern_details_from_pattern_id() as below.
  3. Pass a list of pattern_ids one item at a time into the function call using purrr::map
# R

# Function to take a pattern_id input, en-quote it and pass it to the get_pattern_details api call function
get_pattern_details_from_pattern_id <- function(z) {
zid = paste0("",z,"") # put quotes around the pattern_id input
get_pattern_details(authUsername, authPassword, pattern_id = zid) # call the function with en-quoted pattern_id
}

# Split the queue, and pass each pattern_id into the above function using map, then reduce the split dataset to a single dataframe
queue_yardage_df <- project_queue %>%
split(.$id) %>% # Split rowwise into a list of dataframes
map(~get_pattern_details_from_pattern_id(select(.,pattern_id ) %>% pull())) %>% # get pattern_id details for each row in the list
map(~select(., id, name, yardage)) %>% # Select required columns from each data frame
reduce(rbind) # Combine the split data frame list back

yardage_needed <- queue_yardage_df %>% unnest(yardage) %>% summarise(sum(yardage, na.rm=TRUE)) %>% pull()

queue_yardage_df %>%
arrange(desc(yardage)) %>% head(4) %>% datatable()

Now I can sum up the yardage to find out whether I have enough yarn in my stash to finish all the projects in my queue. I definitely do!!! Looks like I need over 15,000 yards.

It seems that I can make everything in my queue. But do I have enough yardage to make every pattern that I have ever favorited? Let’s find out.

Favorite patterns: Making an API call over multiple pages

Another handy Python function that Riesling wrote is called “get_favorites()”. Similar to the previous get_queue function, it takes an input of a Ravelry username and returns the projects in their Favorites list. My Ravelry username is ‘yarnsandcoffee’, so let’s look at all the patterns I have ever favorited by calling this Python function from R. Note that for this API call, you need a personal account access app username and password, which is what I am using for my analysis.

Below is what the skeleton of the get_favorites() function looks like:

# Python 

# This returns all the favorites from a given person's rav_username
def get_favorites(authUsername, authPassword, rav_username, ..., page = 1, page_size = 100) -> pd.core.frame.DataFrame:
...
...
...
return pd.json_normalize(json.loads(r1.text)['favorites'], max_level = 1)

A limitation of the API calls is that the information is returned one page at a time and each page has a limited number of entities. For instance, the above function will return the latest 100 patterns in my “Favorites” list. However, I know that I have over 1,300 favorites. So let’s iterate the “get_favorites()” function call over first 14 pages of 100 patterns each.

# R

pagelist <- 1:14

project_favorite <- pagelist %>%
map_dfr(~(get_favorites(authUsername, authPassword, rav_username = 'yarnsandcoffee', page = .x) %>% select(id, created_at, favorited.id, favorited.name))) %>%
mutate(pattern_id = favorited.id)

project_favorite %>%
select(created_at, pattern_id, favorited.name) %>%
datatable()

The above table shows that there are a total of 1,381 favorite entries in my list and by iterating over multiple API calls to the same function we were able to get the entire list.

Bypassing iteration errors by using “possibly()”

Now that I have a list of my favorites, I will repeat a similar exercise to what I did above to get my queue yardage, but this time for my favorites list. Spoiler alert, while iterating the entire list of my favorited patterns through get_pattern_details_from_pattern_id() function using map I will get an error for a few pattern_id API function calls, but we will see how to easily bypass that error.

# R

# Split the favorites, and pass each pattern_id into the above function using map, then reduce the split dataset to a single dataframe
favorite_yardage_df <- project_favorite %>%
split(.$id) %>% # Split into data frame
map(~get_pattern_details_from_pattern_id(select(.,pattern_id ) %>% pull())) %>% # get pattern_id details

When running the code above, map will run through some of the pattern ids correctly. In case there is an issue with the function call for any pattern_id, map() will stop iterating and may throw an error with the following message:

Error received while calling the map function over all pattern IDs.

Let’s instead create a “safe” function call using the “possibly()” wrapper function that map can iterate through, but not necessarily stop running when an error is encountered for a single pattern_id. Instead, we want map to bypass that error, and continue with the next pattern_id in the list.

While bypassing the error, possibly() also allows the input of an alternate return value using the “otherwise” parameter call. In this case I am going to ask possibly to return NULL if there is an error. Now when I use this function instead to iterate through my favorited pattern_ids, an error to the API call will not halt the iteration but instead will continue with the next element in the list. I can then also easily calculate the number of IDs that gave me an error, or the specific IDs that did so.

# R

# Wrap the function to get pattern details with possibly()
safer_get_pattern_details_from_pattern_id <- possibly(get_pattern_details_from_pattern_id, otherwise = NULL)

# Call the safer function to get pattern details for all the projects in my favorites list. Note that this command took a long time to run because I called the API over a 1000 times.
project_favorite_details <- project_favorite %>%
# head(200) %>%
split(.$id) %>%
map(~select(.,pattern_id) %>% distinct()) %>%
map(~safer_get_pattern_details_from_pattern_id(select(.,pattern_id) %>% pull()))

# Calculate the number of nulls returned by the safer function
number_of_errors <- project_favorite_details %>%
keep(~is.null(.x)) %>%
length()

# The list of ids that returned null
ids_with_errors <- project_favorite_details %>%
keep(~is.null(.x)) %>%
names()

ids_with_errors
# output

[1] "329666895" "361997652" "374122847" "383248046" "399779788" "405266136" "415107497"

Looks like there were a few IDs that had an error during the function call and returned NULL instead. Relative to the more than 1,000 patterns in my entire list of favorites, this is not too many errors. Hence, I will skip these IDs and go ahead with calculating the yardage of yarn I need to make everything else in my favorites list. I can use the purrr::compact() function call to gather only the non-NULL outputs and continue with the rest of yardage calculation similar to what I did with my project queue.

# R

favorites_yardage_df <- compact(project_favorite_details) %>%
map(~select(., id, name, yardage)) %>% # Select required columns from each data frame
reduce(rbind) # Combine the split data frame

favorites_yardage_needed <- favorites_yardage_df %>%
unnest(yardage) %>%
summarize(sum(yardage, na.rm = TRUE)) %>%
pull() %>% scales::number(big.mark = ",")

favorites_yardage_df %>%
unnest(yardage) %>%
arrange(desc(yardage)) %>% head() %>% datatable()

So, it seems like I need greater than a million yards of yarn to make every project in my favorites list. Ummmm…. That’s A LOT! (Although I would not be surprised if I already have that much in my yarn stash.)

What we covered in this article

There was a lot we covered in this article, but the highlights below can help you summarize what we described, which is how to:

  • Easily work with Python and R together and use the best of both worlds with Reticulate over multiple function calls.
  • Use Dotenv to store and access secret keys from .env files.
  • Iterate through API function calls using Purrr
  • Handle errors while accessing API calls by changing input methods or calling functions safely with Possibly wrappers

What we learned about integrating Python and R

  • It is very easy to have a team of people working with whatever language they are comfortable with while still collaborating with each other.
  • In this scenario, Riesling wrote the API call functions in Python and I pulled those outputs by calling the functions from within R.
  • Riesling and I continued coding in two different files. I did not need to know the contents of the Python file or its functions to be able to call them from R; all I needed to know were the inputs and outputs of those function calls.
  • If Riesling made any edits to the internal mechanism of the Python functions, I would not see an impact. My R code to call those functions would not fail unless there were any major schema changes (which are good to communicate).
  • Even if a single person wants to do some of their coding in Python and some in R, it is very easy to pass variables from one to the other within the same file or across separate files.

A challenge we faced:

  • If you remember from our previous article, “How to access an API for first-time API users,” Riesling put all of the Ravelry functions within a class with the API key as characteristics of the class. This way, the API key does not need to be an input parameter for each function. We tried to implement this into our R code using Reticulate, but got an error.
  • As we mentioned earlier, Reticulate dynamically changed the output of our functions from a Python dataframe to an R dataframe. Because we defined a new class called ‘raveleryutils’ in Python, there was no R equivalent, so Reticulate did not know what to convert this object into.
  • There might be a way to leverage Reticulate to use classes define in Python within the R environment, but for the sake of time, we simply removed the need for the ‘raveleryutils’ class in our code.

Our biggest learning is that having multi-language talent across the team is a strength that Reticulate has unleashed and made it very easy to integrate into end-to-end projects. There is so much more that this package can do that we have not covered in this article — or really even learned for ourselves, like how to define and use classes across languages!

What’s next

Now that we have learned how to call the Ravelry API in Python and use R to call the Python API function calls, we will move to the data analysis stage. In the next article of this series, we will dive into analyzing Ravelry queues and favorites using GGPlot to answer some very interesting questions like:

  • What year did Riesling and I start adding projects to our Ravelry queues and what has the yearly trend been since? (Hint: Surprisingly, it was a very similar yearly trend.)
  • What days of the week are Riesling and I most active on adding projects in our Ravelry?
  • What type of projects do Riesling and I gravitate toward, and has it changed over the years?
  • Who are our favorite pattern designers?
  • And my favorite question: What yarn weight type should I give Riesling as a gift? (Any guesses?)

Follow Deepsha, Riesling, or Data Science @ Microsoft (all on Medium.com) to see the previous article in this series and stay tuned for the next one.

--

--