Academic Pathways

Mabb
6 min readMar 7, 2016

--

Making research on Open Access can be difficult for those who are not familiar with the issue. Our team wanted to investigate the opinions of those who are directly involved into Open Access publishing. We were interested in what scholars think about it, what the controversies that engage them in a debate are and how those have developed in time.

In order to get some insights about the issues listed above, we decided to look for the topics that are discussed the most by academic authors and to research how these topics were developed through time. We also wanted to find out if there have been authors being more influential than others or how topics were related between each other.

Finding a source

Our team chose academic articles as the best suiting source to represent academic authors’ opinions; in order to analyze them we needed to look into the content of academic articles. However, not every article is free to read, thus we decided to restrict our textual analysis to the articles’ abstracts. The fastest way to organize this kind of data was to look for articles metadata, as almost every indexing database allows users to download metadata from articles resulting from a search query.

We used Web of Science as the main database, thanks to its high compatibility of metadata from different databases and ease of query settings. By using “Open Access” and “publishing” as keywords to search inside academic articles topics from the past 10 years we obtained a search result of over 800 articles. By reading the abstracts of the resulting articles we made sure that the research did not include off-topic papers; we were then able to download all the metadata.

Extracting topics

The abstracts were analyzed using VosViewer, a software designed by a team of researchers from Leiden University for constructing and visualizing networks using bibliographic matching and co-authorship of articles indexed in Scopus or Web of Science. We used the text mining functionality of this software to construct and visualize a co-occurrence network of important terms extracted from the text corpus.

In order to obtain clean results, we compiled a thesaurus file covering the substitution of synonyms and duplicates. Text analysis was made using “15” as value of minimum occurrence and “binary” as count value (this means counting the presence or absence of a word in a period rather than doing a simple count). The software’s algorithm then applied a relevance factor to each word/engram, connecting and clustering them based on semantic proximity and co-occurrence in a sentence. The resulting visualization was then refined in Illustrator, resulting in a network where nodes (balloons) represent single topics, with scale directly proportional to relevance and connections (lines) representing the coupling with other topics in the corpus.

Findings

With this visual network we were able to identify the topics that were discussed inside academic publications and figure out their influence; we then grouped them into four macro-topics representing the discussion field:

• economic aspects (business model, subscription, copyright);

• scholarly publishing (developing models, repositories, archiving);

• publishing quality (peer review, plagiarism);

• scientific research quality (data, experiment results, bias);

This grouping was made manually basing on our knowing of the subject, in order to exclude from our research controversies not specifically relevant to the Open Access publishing model (such as the Open Data topic).

Further exploration

In addition to looking into what the topics discussed by academic researchers were, we wanted to understand a few more things: how these were developed through time, whether there are connections between discussion areas and if these topics are relevant to a specific field of research.

To do so, we refined our search in Web of Science by adding specific topics to our original query in a cascade structure (e.g. “Open Access”, “Publishing”, “Peer review”). We added a total of seven topics chosen from the top ones of the previous text corpus search: business model, subscription, copyright, discipline, peer review, impact and repository.

Mapping ways

The metadata collected from the refined queries was processed with CitNetExplorer, a software tool for visualizing and analyzing citation networks in scientific publications, developed by the same researchers who also developed VoSViewer. The tool allows citation networks to be imported directly from the Web of Science database, but it still doesn’t allow a full exploration of the network, nor does it allow further customization.

In order to fulfill our achievement we extracted the network as a pajek file, a file format used to convert datasets for networks, and switched to Gephi (an open source software tool made for network analysis and also allows visual customizations) for processing. This made it possible for us to keep the interactivity inside the network when exporting for the web.

Processing the network

Since CitNetExplorer builds a network of citations but loses every other metadata, such as field of research or the original query, we had to manually reinstate some of the parameters in the dataset: year of publication of the papers, field of research, original query or topic searched. In order to obtain cross-query connections our team had to process all the metadata at the same time.

Next we were able to organize our network along the x and y axis: on the x axis we displayed the year of publication of an article, while on the y axis we separated the different topics searched. We then applied colors based on the fields of research and defined the size of every article based on the citations (the more an article was cited by successive articles, the bigger the balloon).

Further findings

This allowed us to see that topics like “business model” and “discipline”, although being present from the start, never became the central controversy on Open Access; Copyright and Repository were stable topics, still they remained on a secondary level.

The debate over the subscription methods opposing Open Access remained strong and stable enough across the last decade; still the main topic discussed by authors was Peer Review, a controversial theme that was debated right from the start and that didn’t lose the authors’ attention through time.

Another result that emerged from this kind of research was that authors were not confined into a single discussion, and it allowed us to identify the key opinion leaders in the subject, such as Bjork, Solomon or Laakso.

Applicability

Ultimately this protocol proved to be effective for our research: its first stage, the identification of topics from a text corpus, can be applied to every English text corpus; for instance, the same procedure could be applied for extracting topics from a Reddit post, given that a previous collection of the whole text was provided.

The second stage of this protocol, based on the evolution of a specific topic through time, proves to be slightly less flexible since CitNetExplorer works exclusively with Web of Science metadata. This means it can be applied only to academic papers indexed in those databases. Still it proved to be extremely effective to explore the evolution of almost every topic treated in academic research papers; for instance, it could be applied to find out how many authors talked about lung cancer, how these authors were related between them, who were the most influential, in which field of research the topic is discussed or how much the topic was discussed through the years.

--

--