Code-switching

Cansu Ergün
HYPATAI
Published in
4 min readApr 20, 2021
Imaginary multi-kernel single notebook for working with Python and R at the same time 😜

Just like learning a second natural language, applying what you know in one programming language to another is easier than starting your second one in the first place without knowing any other. I felt this sense of déjà vu when I started to use R, after working with Python for a long time. Wait! This is almost the same experience of studying Spanish, my second foreign language after learning English.

Words become like placeholders when you know what you are aiming to do.

That is why I added this ‘lorem ipsum’ humor to the image above and named this story code-switching, which is also a linguistic term that literally 🤓 applies to our case here. Using this analogy, I can say that working with pandas library in Python and dplyr package in R serve almost the same purpose, which I found fun and insightful, therefore it must be practical to be able to use both in order to take full advantage of these languages.

Let’s now jump to the learning task I assigned to myself for practicing R, after finishing some online courses on it, and after those super-duper, philosophical, and wise 😄 comments on learning a new programming language.

My task was simply re-implementing what I did on this story (the story was based on classifying income levels with XGBoost and was implemented using the Python API of the library), in R to gain some hands-on model building and evaluation experience in this new coding environment. However, I preferred not to move far away from the playground I am used to working in, that’s why I implemented my code in Jupyter Notebook using the R kernel. I also downloaded and played around a little bit with Rstudio IDE and tried R Markdown (a similar tool like Jupyter Notebook for reporting your work in an organized and interactive way).

I will not visit each step of what I did in my new R notebook, since I already did it in the story of which I shared the link above. Just will mention some points which seemed important to me, in order to excel in this new language and R environment. If interested, please visit here, the link pointing to my GitHub account to see the whole content.

The exciting part after importing and preparing relevant data was the model building phase and especially its results which made me feel proud ✌️ since there seemed to be full consistency across the two results. Namely:

Comparison of training results

Plots of ROC Curves were also aligned:

Comparison of ROC Curves on the validation set

Unfortunately, after this point I noticed the inequality of optimal threshold values maximizing the TPR — FPR, as in the following view:

Inconsistent threshold values on the validation set

As I searched for the root cause, I saw that predictions coming from the model trained in Python were different from those predicted from the R model. Possible reasons could be :

  • The versions of the XGBoost package are not the same on both sides. Unfortunately, I could not reproduce the results in my Python notebook since I upgraded my Python XGBoost package for another task recently. (Note to self: Use the virtual environment for each different task.)
  • Although the data and the seed number are the same on both sides, maybe R has a slightly different way of calculating seed, which creates some randomness. Actually, I have found a similar issue here, so this could also be the reason.

Normally I cannot calm down after such a disagreement, however, this time I gave up the idea of finding the main reason behind it, since my goal was getting used to the R syntax and having some model building experience on it. Besides, I did not have enough time to do such an investigation. Producing the same plots in R using ggplot was a challenging task alone and I must confess I typed so many ‘how-to’s on Google. 🔍 🐶

That’s why I continued, expecting to see other differences in the upcoming steps. Well, at least the sizes of those differences were small and the results were almost coherent. It’s normally unacceptable to move on after detecting something like that, for example, when your outputs are not the same in the development and production environments. But this was just a personal exercise I challenged myself to finish and coming across such issues is a valuable and important step of learning, so I fully embrace my experience. 😇

Next, let’s see the confusion matrices generated on both sides and finish our story. As I told above, the differences are not huge anyway:

Small differences in predicted income levels on the validation set

Please mention any personal experience on using R in the comments section. 🙏

Lorem ipsum dolor sit amet! 🙋🏻‍♀️

--

--