Some old code

Tech Stack Agnosticism

Most data scientists find pride in having stickers of specific tech tools on their laptops as a way of displaying support and hard-earned skills. However, are tech stacks’ popularity likely to outlast an average laptop’s lifetime? And what would this mean for the average data scientist?

Mehrdad Mamaghani
Swedbank AI
Published in
5 min readNov 8, 2019

--

From the Wiktionary entry we learn that agnosticism is:

The view that absolute truth or ultimate certainty is unattainable, especially regarding knowledge not based on experience or perceivable phenomena.

At the turn of the millennium, many to-become data scientists — statisticians, physicists, and bioinformatician, to name a few — had rather few alternatives if they were looking for tools for statistical data analysis. Often, this choice came down to what was already available at the campus or at the company. There were almost no comprehensive open source tools. Nevertheless, for students and hobby practitioners there were plenty of academic licenses and other paths to circumnavigate the heavy subscription fees. Regardless of the choice however, one could be certain that the then existing tools for statistical data analysis offered sufficiently good and reliable functionalities and techniques.

Now, once this choice of tool was made, it was usually very difficult to convince a fellow data practitioner to try another tool. Usual reasons for this persistence were the cons of the competing tool, the cherry-picked pros of the familiar tool and a general reluctance to conversion after having invested time and energy.
More often that not, the one and only thing that convinced a data practitioner to reconsider their choice was license termination or change of employer.

This somewhat religious persistence stretches into our days where dominant academic and enterprise software solutions for data analysis like SAS and Matlab have been replaced by open source libraries in R and Python.
Indeed, not everything is out of order with a healthy dose of pride and stubbornness. A data scientist with extreme proficiency in Python but an inclination to find R code like Klingon speech should not for the sake of agnosticism be forced to change language overnight. Nevertheless, there are many instances in which advancements and capabilities in one language’s ecosystem could have significant enhancements for understanding of practices and techniques that may be absent in other languages’ repertoire.

A clear example of the above is the TensorFlow library with its primary API available via Python. Although the new release of TensorFlow v2.0 allows for far greater cross-lingual code compatibility, historically, the TensorFlow implementation in R has only allowed for partial access to its capabilities and it still demands a Python back-end and libraries. Thus, for many R users, in-depth experimentation with deep learning techniques used to be doubly challenging as most tutorials were written in Python and as the R implementation did not always allow for one-to-one code translation.
Similarly, R has a vast spectrum of libraries and tools for rigorous statistical and bioinformatic analyses that remain inaccessible for those outside of the R community.

The above merely serves to illustrate a few examples. Similar patterns are abundant for other languages, e.g. Julia and Scala, and between libraries, e.g. deep learning in TensorFlow and PyTorch. Moreover, the scope of such cross-lingual connections is likely to deepen given the increasing integration of data science stacks with deployment and DevOps tools.

Multilingual data science teams

Our belief is that an increased capability of handling several programming languages is a necessity to obtain more efficient, more informed and future-proof technical solutions.
Multilingualism in programming as such does not necessarily mean simultaneous mastering of multiple languages, rather, it implies having a sufficiently stable foundation in relevant languages to increase creativity and accelerate exchange of ideas — multilingualism should be a means and not the goal.

Much like individuals who command languages other than their mother tongue, the richness of history, ideas, proverbs and vocabulary of each language usually cross-pollinates into the other language(s) and enhances the degree of imagination and disruptive thinking.

In data science teams, this can be achieved by creating tightly knit multilingual teams, arranging mini-hackathons in “new” languages, or peer review of code across the language isles.

Multilingualism at Swedbank

When it comes to larger business-oriented cases at Swedbank, multilingualism is more of a routine rather than an exception.
The primary exploratory phases include iterations in Hive SQL followed by R and Python for more in-depth analyses. Once an analytical skeleton has been set in place, PySpark becomes the dominant tool until deployment flavors take over, i.e. orientation towards containerization and monitoring.
Custom flavors include usage of Java and Scala where deep learning and streaming techniques are employed.

The “Most Popular Programming Language” charts and why they should be overlooked

Some readers might still be finding some inner resistance to the content in this post. One image that comes to mind is that of the “Most Popular Programming Language” charts. We all know what these charts look like. Indeed, they do serve a purpose and contain interesting information. However, such charts should not determine whether one as a data scientist should place all eggs in one language basket.

More than anything, the reader/aspiring data scientist should ask i) what sort of tasks one is interested in, ii) what type of skills one aims to develop, iii) what language or tech stack offers the best tools for solving specific problems, iv) and looking further ahead, how this solution could best serve business needs and be easily deployed.
More often than not, the optimal setup is likely to include more than one single language.

Additionally, the current popularity of different languages is arguably correlated with common academic practices in different disciplines, i.e. there are many more undergraduates in computer science and physics compared to mathematical statistics. Therefore, such uneven numbers can plausibly be a source of bias for existing surveys on popularity of programming languages.

In conclusion, with increasingly shorter release cycles and growing number of technological disruptions, practitioners in data science are likely to sooner or later be confronted with wholesale reconsideration of their tool set. The best way to be able to foresee such larger transformations and vaccinate against tech stack lock-ins is to exercise an agnostic multilingual approach with regards to data sciences practices and programming languages.

--

--