What software tools do scientist use to analyse data?

Bartosz Paszcza
from science import code
5 min readJan 18, 2017

There is plenty of software packages that can perform data and text analysis — which one are used by scientists? The answer strongly depends on discipline, but Excel, SPSS, R and Python dominate.

In order to attempt to answer this question, I’ll compare two surveys. The first one is Software Sustainability Institute’s (SSI) 2014 survey send to UK Russel Group universities researchers, with a total of 417 answers. The second is the “101 innovations in scholarly communications” 2015–16 survey which gathered a substantial number of 14,896 researchers’ responses.

The two surveys were structured differently. The SSI survey asked the open question “What software do you use in your research?”, leaving researchers to type their answers. The 101innovations survey, on the other hand, had a list of seventeen questions related to different parts of academic workflow. Here, I analyse only the question related to “tools used for data and text analysis”. The quesiton had seven pre-selected answers and ‘others’ category, where researchers could specify additional software. Please note that this means that software tools such as Endnote (which pops out in the SSI survey) are left out — they are not data/text analysis tools and hence found their place in other questions.

Overall counts

Let’s take a look at the totals. The 101innovations results are presented in figure below:

Excel seems to be the clear winner, with statistical tools SPSS and R (RTool and ROpenSci) coming later. Matlab and IPython environments are following.

How does that compare with the results from the SSI’s survey? Because of the open nature of the question asked, there is a significant long-tail of results, with a total of 566 specified tools! However, only 13 tools were chosen by ten or more respondents, which makes the distribution similar to a power-law distribution. Therefore (in a typical manner seen in the clickbait-articles on the Web) only the “top 12” tools were selected for analysis:

R, SPSS, Matlab and Python make it again to the top five. Excel is an interesting case, as it has significantly lower score than in the previously analysed survey. This could be explained by two reasons. Firstly, the SSI survey gathered only 417 responses — which is a number too low to draw quantitative conclusions (and also may explain differences in scores of other of the top tools when compared to the “101innovations” results). Secondly, it was an open question, asking scientists to manually list “research software” they are using — Microsoft Excel could be considered non-research software by some, as it is a general-purpose tool. The “101innovations” listed Excel among the pre-selected answers, thereby defining it as a research software in eyes of respondents.

What is interesting is the relatively high score of NVIVO, Stata and Mathematica, which were not present as pre-selected tools in the other survey. Hence I have taken a look at the responses to the 101innovations survey specified in “other tools” answer (details on how was the field structured can be found in my previous post). The results are, as follows:

Although — as the authors of the survey have warned themselves — we are unable to compare the sheer counts of answers of the “other tools” responses with the pre-selected answers (see details here), the fact that Stata and NVIVO found themselves on the top of the list is implying that they indeed are relatively popular software tools among the researcher’s community. So why were they not selected? I am not sure, but I’ll try to have my guess below.

Disciplinary breakdown

What is more interesting, however, is how disciplinary practices vary in terms of software tools usage. Both surveys asked respondents about their field of research and (unsurprisingly) both surveys draw the disciplinary boundaries in different places. In order to easily compare the results of both surveys, some discipline aggregation had to be made.

The first category was formed by joining “Biological, Mathematical and Physical Sciences” with “Agriculture, Forestry & Veterinary” to form “Physical, Mathematical and Life sciences”. Similarly, Education, Social Sciences, Administration and Business Studies were joined into one category. Finally, Humanities, Design and Creative Arts formed the last defined field of study, whilst Architecture and Planning was neglected (anyway, the category had only 6 respondents).

In the 101innovations survey, Physical Sciences had to be joined with Life Sciences to enable comparison. Here are the results:

Detailed stats with consistent categories

Although due to different nature of questions (open/closed), we cannot meaningfully compare the exact percentages, some trends are clearly visible. Popularity of SPSS among Social Sciences and Medicine persists in both studies. The unique popularity of Matlab and Python among Physical and Life Sciences and Engineering and Technology can also be seen. In both studies, Medicine stood out as a field where Excel is used most commonly.

There are a few interesting differences as well. First of all, the first figure shows clearly that popularity of NVIVO and Stata is owed to only two fields of scientific inquiry: Medicine and Social Sciences. My guess regarding omission of these two in the 101innovations survey is that the choice of the seven pre-selected tools was influenced by researchers mainly from other fields of research, who generally do not use those two tools. Which is a pity — sadly, two seemingly popular software tools are not comparable in the 101innovations data, which makes it hard e.g. to perform a clustering analysis of tools used to analyse data and then to archive/share it (which I’ll try to perform, at some point!).

--

--

Bartosz Paszcza
from science import code

PhD student at Web Science Institute, University of Southampton; writing for jagielloński24.pl