Data Science in 2020: Technology
Applying Data Science to Data Science: a deep dive into the best-loved technologies in the world of Data Science in 2020.
This article takes a look at what the online Data Science community has been writing about over the last couple of years, with an aim of understanding the trends in technology over the course of 2020. To do this, I’ve sampled roughly 30,000 unique Data Science stories from across Medium between the January 2019 and mid-December 2020.
This article is broken into two parts:
- Technology — This section takes a deep-dive into the technologies that the Data Science world has been writing about and responding to this year. It includes rankings for popular software tools, programming languages and platforms, plus some commentary too.
- Community — The Data Science community is historically vibrant and ambitious, but this year has been tough for many of us. This section looks at how the writing part of the community has responded to the events this year in terms of publications, quality and activity.
This post (Part 1) deals with Technology. Part 2 of this article will deal with Community and will be released in the coming weeks.
Technologies come and go in any field, but within a field such as Data Science that is still in the process of establishing itself and its ‘standard’ tools and methodologies, this is particularly true. New technologies sweep in on a wave of hype, and a similar number fall silently back into obscurity. Given how fast the field is moving, understanding what technologies the community is discussing may give us a leading indicator about where the field is going to end up in medium-term.
That’s what this section is about: what technologies has the Data Science community (on Medium) been discussing over the last year? And which of technologies regularly garner the most enthusiasm?
To keep things simple, I’ve split the technologies I’m considering into the following areas:
- Software tools — Tooling and libraries that constitute the modern Data Science software stack. These tools and libraries are taken from across multiple languages.
- Programming languages — Programming languages in common use with Data Science professionals.
- Cloud Platforms — Platforms designed to support the development of cloud infrastructure.
- Data Science Platforms — Platforms designed for the express purpose of facilitating modern Data Science workflows.
I’ve also adopted two ranking approaches when evaluating these areas. The first is simply ranking by the absolute number of mentions (i.e. the number of articles a technology is referenced in). This is fine for seeing how often a given tool or language surfaces in the sampled articles, and goes some way to indicating the ‘penetration’ of a technology, but it doesn’t necessarily highlight what the community is interested in.
To try and get a better grasp on this latter point, I’ve adopted a second ranking scheme that ranks a given set of technologies by their median popularity (in Medium speak: claps). This is still an imperfect measure, but tends to give a better indication of the relative popularity of a specific technology in terms of the enthusiasm shown towards it.
Let’s dig in.
First up, software tools! For many Data Scientists, applying these tools to business problems makes up a good proportion of their working lives. I’ve drawn the software tools considered in this list by manually curating a list of common search terms and regularly discussed tools in the ever-popular top 10 list blog format. This leaves me open to potential biases in my data collection, but I’ve tried to make an effort to give equal attention to tools from across the Data Science software ecosystem.
Most mentioned tools
Let’s start by looking at the most commonly mentioned tools. Here’s the top 10 most mentioned technologies for 2020:
There was no change in the top 10 between 2019 and 2020. This is unsurprising from the point of view that these tools are as close to perennially popular technologies in this field. It’s also common for new Data Science bloggers to begin by writing introductory articles on precisely these technologies too, which may explain the regularity of their mentions. Still, it is pretty striking that by my count,
pandas – a tabular data manipulation library for Python – appears in fully 15% of all published articles in the last year.
Most popular tools
While the top tools by mention is pretty static over time, their popularity has been much more variable. Here’s the top 10 technologies in this case:
There’s no huge surprises in this ranking, but there are some interesting trends. The ranking is skewed towards ‘infrastructure technologies’: technologies used to build and deploy productionised ML systems. I consider fully six tools in this list as falling into this category (Docker, Flask, Kubernetes, Kubeflow, Airflow, Dask). This is particularly noticeable as there’s a relative lack of ‘actual’ ML technologies in this list, with only PyTorch and PyCaret (an interesting new ‘low-code’ ML library) making the cut. The list is rounded out by two visualisation tools in the form of
bokeh and Streamlit.
It isn’t surprising to see Streamlit sitting at the top of these rankings. In terms of buzz, Streamlit is one of the bigger Data Science technology stories of the last couple of years. Personally, I think it’s a fantastic bit of kit. It has made its way into many of my personal and professional projects over the last year, and it’s a joy to use. If you’d like to find out more about Streamlit and how to get started, here’s an article introducing it with a short tutorial on deploying Streamlit apps to Google Cloud:
Deploying Streamlit Apps to GCP
Streamlit is a minimal, modern data visualization framework for Python. Learn how to deploy a Streamlitapp to GCP.
For me, the presence of the infrastructure tools on this list is particularly interesting. Anecdotally, there seems to have been a consistent rise in the interest in and awareness of ML Ops/ML Engineering-type concepts and tools within the broader Data Science community over the last couple of years, and this may be reflected here: the presence of modern software engineering tools such as Docker and Kubernetes, plus data/ML workflow management tools like Airflow and Kubeflow certainly point in that direction. Is the Data Science community at the start of a speciation event where the analysts split off from the engineers?
It’s also interesting to see Dask creep into this list too. If you’re unaware, Dask provides tools to perform distributed analytics capabilities for users of the Python numerical and scientific ecosystem. In other words: it is somewhat similar to the older (and heavier-duty) Spark in it’s ability to handle very large volumes of data, but unlike Spark, it provides great native interoperability with the well-established Python scientific computing tools. From this perspective, it’s an attractive proposition for practitioners unfamiliar with the Spark and Java ecosystem, but familiar with the ‘core’ Python scientific computing toolkit. I suspect Dask’s popularity within the community will continue to grow in coming years.
Programming languages are one of the more contentious topics within the Data Science community. Everyone seems to have their own favourite, and we’re collectively quite quick to let others know. I suspect the ‘language wars’ typified by certain tired language feature comparisons are overwrought: the world has space for them all! For this section, I’ve drawn the languages considered in the analysis from the top 50 TIOBE language language rankings.
Most discussed languages
Let’s start with the simplest of rankings: how often were different languages discussed over the last year? Here’s the top 5 languages from 2020 by mention:
As you may expect, articles referencing Python tend to dominate the conversation. This is unsurprising: Python is a language that is both an ideal starter-language for new Data Science fans and for professional Data Science and engineering teams alike. It sits at something of a sweet-spot between productivity and capability: it has a rich and relatively mature ecosystem of libraries and tools that cover most modern software use-cases, from cloud computing, through to Internet of Things applications, and — most relevant here — scientific and numerical computing too. In other words: it has a huge, active community of varying levels of skill and experience, which goes some way to explaining the preponderance of mentions in Data Science blogs.
Perhaps more surprisingly on this list, the Java programming language pips R to the post for second place. In the ‘modern’ Data Science toolkit, Java (and its unranked sister language Scala) is often regarded as the languages of choice when tackling ‘Big Data’ problems. In a similar way that Python can act as a ‘common language’ for software engineers and Data Scientists, Java can play a similar role for ‘Big Data’ and data engineering applications. It’s a performant language with a strong pedigree and a very mature ecosystem. So long as the likes of Spark (and Databricks) continue to be used and to grow in usage, its popularity is likely to continuing to grow within the community, too.
Of the remaining three languages in this top 5, R is the least surprising entry. As a language built from inception for statistical computing applications, it’s a language with some excellent features that tends to excel at exploring and modelling tabular data. For good or ill, it doesn’t have the a generality (and corresponding ecosystem), or mass-following of Python, but it’s invaluable for use-cases in the niche it fills. By my count, it’s seen a slight decrease in mentions this year with respect to the other languages in the top 5. It’s also grown the slowest of these too.
It’s hard to tell if this is part of a broader slow-down in the R community, though a cursory glance at CRAN (R’s open-source package registry) suggests that the number of new contributions (i.e. new packages) on CRAN have been trending down over the last couple of years. To crystalise this a bit, here’s a chart showing the number of new packages released from a sample ~3500 packages on CRAN (~20% of the total number of published packages):
It’s worth noting that the decline could be a sign of changes in the R ecosystem which may well be positive (consolidation towards standard tools, for example), though such a precipitous decline in new packages is unlikely to augur rapid growth in its user-base/community either way.
Train and deploy models in the browser, Node.js, or Google Cloud Platform. TensorFlow.js is an open source ML platform…
Most popular languages
While the landscape of the top 5 languages by mentions alone has remained unchanged over the last year, the most popular languages (by median popularity of articles referencing the language) has changed a good deal more. Here’s the top 5 most popular languages from 2020:
To dig into this, I’ve done some topic and style extraction on the corpus and assigned each article a primary topic and a style. Without going into too much detail here, the ‘optimal’ number of topics and styles was found through Topic Stability Analysis (using a TF-IDF and NMF extraction pipeline), and Silhouette Analysis (using hand-crafted features and K-Means clustering) respectively. This process identified five consistent topics and four predominant styles, shown in the tables below.
If we take a look at the topics I’ve associated with articles mentioning each of the languages, the proportion of articles associated with each topic looks like this:
As you might expect, these language rankings are (partially) a function of topic, style and novelty. On these measures, both Python and R lose out, in part to their ubiquity, but also the styles of writing common to each camp. With statistics & probability being the least popular of the identified topic groups, it’s unsurprising — if a bit sad — that R doesn’t get the visibility it may well deserve. This correlation can also be seen by comparing the predominant R and Python styles against the corpus as a whole:
From this we can see that Python has an excess of ‘Filler’-style articles (short, low-content articles — the second least popular variety), and a below-average number of ‘Informative’ articles (longer, content-rich articles — the most popular style of article). This makes sense: writing about some well-trodden quirk of Python is a common first-blog topic for beginners, and there are huge number of these posts in the corpus.
In contrast, R has an above average number of ‘Technical’-style articles (technically dense and typically inaccessible articles — the least popular article style). These tend to be less accessible and, given that a high proportion of these tend to also be relatively technical statistics articles, result in R articles tending to get less traction with the broader Data Science community.
The most interesting entry on this list for me is Swift. This is Apple’s open-sourced ‘replacement’ for the Objective C language that now underpins most Apple software products. It has recently been adopted as the language of choice for the next generation of lower-level TensorFlow tools. Swift is a good fit for this application: it is a statically typed language with some impressive language features that make it fast and (generally) suitable for massive high-performance applications. If you haven’t heard about it yet, it’s worth reading up on:
Swift for TensorFlow
Swift for TensorFlow is a next generation system for deep learning and differentiable computing.
It seems like everyone is building platforms these days — and in many cases for good reason. For the purposes of analysis, this section breaks the world of platforms into two halves: cloud platforms and Data Science platforms. Here, I’m considering the former to be a more general cloud platform, capable of building and deploying almost any form of modern software, while in the latter I’m considering platforms specifically tailored to facilitate Data Science work, and therefore aimed at Data Scientists as their primary persona. I’ve taken the platforms considered here directly from the Gartner 2020 Data Science Platform Magic Quadrant.
In less than 15 years, ‘The Big Three’ cloud giants have come to power much of the modern web. They also play an important role in the current wave of AI and ML technologies trying to find their way in the market (and in the research community too). By making computing cheap and transient, they’ve allowed many of these technologies to advance more rapidly than they may otherwise have done. It’s now possible to spin up a supercomputer-like machine in seconds and use it for a few dollars an hour. That level of cost and flexibility is pretty much magical.
Beyond the infrastructure itself, The Big Three are also playing a key role in the development of core technologies too, with the likes of MxNet (AWS), and TensorFlow (Google) being pillars of their respective cloud’s AI strategy. Each also has a growing range of technical, low-code and no-code ML products in their portfolio, and it seems clear that AI & ML are regarded as central aspects of these platform’s longer-term strategy. Given the amount of data these applications need to store, the amount of compute resources often needed to process this data and the rate at which both of these things are growing, it’s definitely a smart business move — provided there’s no new AI Winter.
By extension then, the attitudes of the Data Science community are likely to become increasingly important to these platforms, and are already illuminating. Let’s look at popularity here first:
Note: ‘Change (%)’ here (and below) refers to the year-over-year change in the total number of articles published that reference the given platform.
By this ranking, GCP comes out with a commanding lead in terms of popularity, and the number of articles it is mentioned in has grown quickly too. If you take a look at the popularity of the respective clouds over the last two years or so in figure 2, you can see that GCP has developed a relatively consistent lead in overall popularity over the last quarter.
However, if we flip this and take a look at absolute mentions, the story is a little different. Figure 3 shows the the number of mentions of each platform as a proportion of the total number of mentions of cloud platforms. In this case AWS clearly dominate, though there’s signs that this lead is slowly being eroded by growth in both GCP and Azure.
I suspect the competition for mindshare in the Data Science community will heat up in the coming years, particularly as the field continues to mature and ‘the ML stack’ begins to firm up. The majority of prime real-estate there is still up for grabs.
On a personal note, it’s interesting to see the relative popularity of GCP in the community. In my opinion, the onboarding process to GCP is cleaner and more straightforward than both Azure and AWS, the documentation is generally excellent (unlike), and the pricing transparent. With the established offerings of the likes of Cloud Run, DataFlow, Firebase (including Firebase ML), and tight integration with the TensorFlow ecosystem, plus their pedigree in the world of ML & AI, my instinct is that GCP is rapidly becoming the most compelling platform for AI infrastructure. I’ve certainly taken to using it in my personal projects too. If you’re interested in reading more in that vein, here’s one of my articles on the subject:
Serverless ML: Deploying Lightweight Models at Scale
Deploying ML models ‘into production’ as scalable APIs can be tricky. This post looks at one option to make model…
Data Science platforms
There have been many analytics ‘platforms’ over the last few decades. Indeed, there’s still a number of incumbent analytics companies playing for market share in the shiny ‘new’ Data Science world. However, there’s a growing number of ‘new kids’ too. Perhaps the most established global players of this new breed are Databricks and Domino Data Labs. There’s others too, like H2O.ai, Dataiku and DataRobot, all of whom are making a strong play for the ‘Enterprise AI’ crown, each with their own distinct flavour and perspective.
I’ve taken a look at several of these platforms and found Databricks and Domino to offer, though H2O.ai and Datarobot offer some interesting low-code features. Does the community agree? Here’s what they’ve been talking about:
Databricks leads the way here, though it’s also seen a smaller relative increase in articles referencing it. In contrast, the number of articles published referencing Domino has grown relatively rapidly, and it’s pretty much neck-and-neck with Databricks in terms of popularity. It’ll be interesting to see what the effect of the rapid growth in the number of publications referencing Dataiku and RapidMiner will be, and whether that’ll increase their relative popularity and awareness over the next few years.
Things look a little different if we look at absolute mentions over time, though, as shown in figure 4. In this case, Domino clearly has the lead in overall mentions. However, it’s also clear that for the last couple of years it has been pretty much a two-horse race: 70% or more of all platform mentions commonly fall to just Domino and Databricks. This could signal a growing duopoly on the Data Science community mindshare. That said, there have been a steadily growing number of articles posted about both Alteryx and DataRobot this year, so it’s far from a done deal.
See you next time
That’s it for this post. Make sure to check out Part 2, which will be available very soon. If you have any questions or feedback, or perhaps some ideas for what else I should have a look in the dataset for, feel free to drop me a message on Twitter or add me on LinkedIn.
This article was originally posted over here: