Part one: Introducing the Health Informatics category
What is Health Informatics and why is it interesting? Health Informatics is becoming an even bigger part of our modern healthcare sector, but existing communities perceive and use the concept differently, which might create contrasts in what constitutes the term and if it is being explicitly discussed on internet forums.
We, therefore, aim to explore the topic of Health Informatics on Wikipedia and obtain an overview of its network as well as knowledge on how it is understood by different actors. We identify several thematic clusters, however, it must be noted that the obtained data is biased towards English articles solely from the Wikipedia platform. Furthermore, the algorithms used for interpretation and visualisation of the data is opinionated and manipulative with a certain bias.
We explored the seed category of Health Informatics on Wikipedia and all its member pages by using a script that connects to the Wikipedia API (Application Programming Interface). The depth is set to 1, meaning that the script also looks for (crawls) the member pages of the sub-categories associated with our seed category. Calling the API provides us with access to the HTML code for all 931 pages in our category and sub-categories.
In this case, we have chosen to run two different scripts, one that obtains (scrapes) all hyperlinks in the body text and one that scrapes all hyperlinks on a page that refers to other Wikipedia categories or pages.
Both scripts produced two network files each, one with pages of the category connected to each other, and one where the pages are further connected to all other Wikipedia pages they cite (outside the category and sub-categories). We chose to proceed with the second type of files because we didn’t want to exclude the pages outside our category and sub-categories before we could see how they were connected.
The Health Informatics Network
To get an overview of the category, we chose to visualise the network of the category and subcategories’ pages (i.e. the corpus) in two different versions: one with connections by hyperlinks in the text (InText) and one with connections by all hyperlinks on a page (All_Links).
On these, we applied the layout ForceAtlas2 to explore the networks’ structure. Both contained multiple disconnected nodes and by applying the force-directed layout, these got pushed further away. The filter Giant Component solved this by removing the disconnected nodes.
For less excess noise, we added the partition filter to remove all the not a member nodes from the network and changed the edge colour to grey to enhance the visibility of the nodes.
A comparison between the two visualisations (visualisation 1 and 2) indicated a difference in the formation of clusters appearing in the All_Links but not in the InText, which might be caused by the Wikipedia templates. These allow editors to collaborate on covering a topic comprised of multiple related Wikipedia pages.
For further insight, we continued working with the All_Links network, filtering the degree range to 150–672, in order to remove the nodes that are not well connected, followed by rerunning ForceAtlas2 to prevent overlap between nodes. Hereafter, the node size range was set to 7–50 according to the degree, which is a measure for the number of edges to other nodes visualised by increasing/decreasing the size of the node. Then we calculated modularity and applied colours accordingly, as well as increased the upper margin of the node sizes to 120.
In order to juxtapose the two networks, the same procedure was applied to the InText network, which produced faint clusters compared to the All_Links network. This resulted in the suspicion that the difference is manifested in the way the two networks harvest hyperlinks because the All_Links include references in the Wikipedia templates which creates a media effect. By looking at some of the articles, we noticed a shared use of templates and hyperlinks connecting the pages.
The four big clusters might indicate some high degree nodes drawing attention to specific crowds in the network.
Six derivatives of the All_Links network
We tried to get an overview of what type of pages our clusters in the All_Links network contained and discussed what might be interesting to gain an insight into, which resulted in a list of 15 keywords we wanted to harvest data on in relation to our corpus.
We ran the keyword-search script using the “category members.json-file” as our input, which contains all 931 page titles. The script was run on all pages in our data corpus while using a wildcard, thus accepting different endings to the keywords (e.g. policy, policies).
Afterwards, we imported the results into our All_Links network and visualised each keyword to see which were the most interesting, and ended up choosing the following for further investigation: efficien*, legislat*, law, polic*, safe*, secur*.
To better understand how the keywords were used in their respective articles, we manually looked through them. We see law, legislat* and secur* mostly in the green cluster, in the context of data security and privacy. Meanwhile, polic* and safe* are mostly used in the purple cluster in the context of medical trials, and efficien* is prominent in both purple and green clusters, in the context of technology and data efficiency, with a slight mention of healthcare quality.
So in summation, there are indications of different focus areas in the green and purple clusters, where the former mostly deals with concerns about privacy of data used in health information systems, whereas the latter is more concerned with medicine and clinical trials and only scarcely addresses health information systems.
Timelines for edit history of two selected Wikipedia pages
We created two timelines of the pages Medical Record (MR) and Evidence-Based Medicine (EBM), respectively, in relation to revision count and unique members, based on the findings from our keyword search networks. These showed that the words legisla* and law is clearly related to MR, while polic* is related to EBM, which is located in another cluster. The words are somehow related to each other meaning-wise but they seem to be context or area specific.
In both diagrams we observe the occasional high peaks in the number of revisions within a short amount of time by a relatively low number of users, potentially indicating a dispute within the community. This can be further examined in the revision histories of the pages, where we find the specific revisions and their comments as well as the talk-pages where users discuss revisions.
We decided to look at revisions between 01–12–2007 and 01–01–2010 (dd-mm-yyyy) in MR. We later discovered that the page was merged in 2005, which might provide an explanation for why the number of revisions is so low. However, we did not find anything of interest here. Only a short remark regarding the content of the page in relation to ethics, showing bias against certain perspectives on the topic. Other than that the talk threads are relatively short with few and polite replies.
While examining the reviews between 01–07–2009 and 01–01–2010 in EBM, we can see from the Contents table that the “Criticism” section has been the topic of a long debate. This can explain the rise of revisions. Among the users, there is one prominent reviewer, who has left the majority of the comments in the talk-page, often at the end of the threads. This user’s activity is probably the reason for the low unique user count for this period. Additionally, in section 2 we notice an argument where a user boldly expresses his displeasure at how his edits were handled. He has a similarly high count of replies as the aforementioned user, though heavily concentrated in his own thread.
A network of co-occurring noun phrases extracted through semantic analysis
It seems that the semantic analysis reshapes the clusters into smaller but more specific ones, which we will shortly describe.
Firstly, number 1 mainly consists of health information and technology related pages. It has edges to number 2 made of information exchange and different health record systems as well as edges to number 3 which comprises informatics and different associations engaged within the field of informatics and health sciences. Secondly, number 4 is concerned with health information and management systems. This cluster is connected to the network through only two nodes: the health systems in number 5, and the health informatics node within number 3. The same is the case with number 6 concerning health care and care providers, which is connected to the rest of the network only by the test result node in the orange cluster. Number 7 is more mixed, with words mentioning different diseases along with data protection, patient safety, medicine and research, while number 8 is centred about the word management.
This reshaping of the clusters might indicate that even though some topics are well connected (and showed as one cluster in the All_Links network), the way they are discussed might be different, resulting in smaller and more specific groupings/clusters visualised in the network of the semantic analysis.
What might be interesting here is that the node health informatics only plays a minor role in the network since this is the Wikipedia category and thus also the title of the main article of this category. Even though the algorithms of CorTexT forces the clusters apart from each other, we still expected that health informatics would play a bigger role, and have a higher node degree.
Part two: Introducing user-revision networks
From the initial examination, we can say that Health Informatics as an overall topic does not leave clear signs of controversial issues on Wikipedia. We see this based on a stable network of the Health Informatics category with no clear clustering. Furthermore, a qualitative analysis of a selected sample from the data corpus showed a relatively stable discussion on the Wikipedia talk-pages, about page content. This sparks a curiosity, as we know from the world “outside” of Wikipedia, that certain topics, issues and cases that can be said to be within the framework of Health Informatics are controversial. In a Danish context, Sundhedsplatformen [Health Care Platform], Telemedicine and the implementation of clinical logistics systems could be mentioned as examples of such.
An explanation for the stable conditions of the category might be found in the way “guardian” users convene around protecting the topic from “outsiders” revisions. We see signs of the same unique users maintaining the pages included in our data corpus and how Wikipedia templates shapes page content. In order to conclude something about how certain elements create a media effect which influences the stability of our corpus we need to examine this further.
The idea is to gather connected items and encircling them in what Latour et al (2012) have called a monad. Thus we deal with a collecting activity, where the monads “gathers, assembles, specifies, grasps, encapsulates, envelops those attributes in a unique way” (Latour et al., 2012: 608), and these are what we wish to explore.
Looking at the media effects within the category makes it relevant to turn towards the Affirmative mode proposed by Marres & Moats (2015), that suggests the consideration of “how platform-specific dynamics may be an indicator of issue activity” (Marres & Moats, 2015: 9). As we have already seen how the Wikipedia templates might influence the structure of a network, we wonder, what else might platform-specific dynamics do?
Protocol: Exploring user-revision networks
- Input: “category members.json-file”
- Harvesting user and revision data on corpus pages
- Time period: 2004–2019
- Output: two networks with bots and two networks with bots excluded
- Monopartite network: nodes are pages and edges are users
- Bipartite network: nodes are both pages and users, where edges are the number of revisions
- Timelines of 10 selected pages which were harvested and created as explained previously, c.f. visualisation 1
- Input: “category members.json-file”
- Harvesting revision history of data corpus pages from 2004–2019
- Search for keywords in the harvested data
- Output: .csv file of pages and keywords in relation to their occurrence in each page’s revision history
Extended user-revision networks:
Script 1: Creating user list based on revisions on pages in the corpus
- Input: “category members.json-file”
- Harvesting a list of users based on revisions on pages in the corpus, with bots blacklisted
- Time period: 2001–2019
- Output: .json file with list of users
Script 2: Harvest revisions from users on member and non-member pages of data corpus
- Input: .json file with list of users
- Blacklisting bots and filtering users and revisions to only harvest top 10 most active users with a maximum of 100.000 revisions per user
- Output: .json file of users and their revisions
Script 3: Build an extended bipartite network
- Input: .json file of users and their revisions
- Filtering pages outside the corpus with under 10 revisions
- Filtering out “Wikipedia about” pages
- Output: Bipartite network file of users and revisions beyond our corpus
Script 4: Timeline of top active users’ revisions
- Input: .json file of users and their revisions
- Output: .csv file with top most active users and their revisions from 2001–2019
Revision history: the relationship between pages and number of users revising them
To gain insight into the relationship between the users and the category pages, we harvested data on all revisions made in the category, which produced a mono- and a bipartite network. The former visualises the pages as nodes and the edges being users doing revisions (c.f. visualisation 8), whereas the latter visualises both users and pages as nodes with the edges representing revisions (c.f. visualisation 9).
The monopartite network visualises an extremely interconnected network of 834 nodes and 232983 edges, which illustrates a very closely connected community. However, we can only see that the users tie the network together, but not how they are related to the pages and thus another type of network is needed.
According to the two bipartite networks (Visualisation 9 and 10) and the four tables below we can see how the presence of bots influence the list of top 10 users, seeing as they occupy four spaces in the list, whereas their presence does not affect the top 10 pages in the category (other than the weighted degree). This might be explained by the fact that some bots are created by users to enforce rulesets on pages, and thus just extends the users reach on specific pages, however, these bots must fulfil the Wikipedia bot policy and be approved by the Bot Approvals Group.
To understand the role of the bots, we adopted a precautionary mode instead of excluding them before starting the data analysis (Marres & Moats, 2015). The idea is to see whether the bots distorts the network or if they play an important role before eventually excluding them. However, it can be difficult to evaluate the influence of bots within the user revision network, seeing as they take on different tasks: fixing grammar, misspelling and references, while others revert vandalism and blanking of pages. The latter is, therefore, more likely to be sensitive to issues and threats that can destabilise Wikipedia pages, and thus play an important role in the network (Marres & Moats, 2015). However the exclusion of bots shows only minor differences in the interconnection of the network, thus we expect that the network is mostly maintained and kept stable through revisions conducted by the users.
The bipartite network provided an overview of the top 10 pages with the most users (not including bots) doing revisions, which we decided to look further into by creating a timeline for each page, showing how the pages evolved over time. Here we observe a mixture of peaks and slow periods as well as smaller peaks with a lot of users, of which we selected a sample of, the, in our opinion, most interesting ones, which could point us in the direction of what has caused them to attract a high number of unique users.
In summary, we see that some of the pages, especially ELIZA, Telemedicine and Robot-assisted surgery, have been frequent targets of vandalism including “blanking” a term connected to users deleting sections or entire pages within Wikipedia. This tends to be detected primarily by bots. Those aside, we also see some users maintaining the pages. But among the range of different users making revisions we only see few re-occurrences within the 10 pages, showing only a few initial signs of our supposed guardians.
On the main article, Health Informatics, an interesting peak is seen in 01–2009, where one user does the majority of edits, e.g. replacing external links with internal links, indicating an effort to keep users within Wikipedia’s sphere.
Furthermore, we found disputes over external links as to what kind of links should be added to the pages. We tried to see if it was possible to find a pattern of a “reference codex” within the chosen journals and sources of the category, by doing a co-reference network, but this data proved difficult to harvest without errors.
Extended keyword search
During our qualitative analysis of the spikes on the 10 timelines, we looked for specific words in the revision histories and talk-pages that were common indicators or triggers of disputes over the content of the pages. This resulted in a list of keywords, which we wanted to examine further, to see if they were present on other pages or unique to one, c.f. The keywords are the following: advertis*, COI, conflict of interest, edit war*, edit-war*, editwar*, meatpuppet*, sockpuppet*, revert*, self-promotion, self promotion, spam, vandalis*, blank*, disrupt*, repeating characters, undid.
Seeing as these keywords were mostly used in the comments for revisions, e.g. snapshot 5, we ran a script to search for the keywords on the revision history pages of our category pages.
Applying the results from the keyword search to our All_Links network allow us to visualise how many times the keywords are present on the revision history pages in the category. Due to the nature of our chosen keywords, we might be able to visualise the extent of issues such as vandalism or editing wars within our category.
We might say that if the occurrence of the selected keyword is low on in the category or in a page, then the category is relatively stable, while a high number would probably mean that the category (or the content) is somehow controversial or that someone tries to destabilise it. A few examples of this:
- “Spam” occurs several times in all clusters and categories, which indicates that the pages might be “under attack” by users who are not serious about the content. The same can be said about the occurrence of “vandalism”, even though it’s not equally frequent.
- “Undid” and “Revert” are synonyms, and both occur on several pages within the category, which might suggest that edit warring is going on or that some users are very pernickety about other users’ changes.
- “Self promotion” and “edit war” are only evident in one page each, which might indicate that our category is not very susceptible to users self-promoting or using sockpuppet accounts.
However, when looking at the visualisation we should keep some factors in mind, e.g. that a high vandalism count might not necessarily represent the existence of controversies within a field or page but might just show that some users are very fond of deleting or inserting random things, being so-called “internet-trolls” provoking reactions for the fun of it.
Timeline of unique users doing revisions
The timeline illustrates the top 10 users and the number of revisions done within the category. We can see that there are little to no revisions in the beginning with only a few users being active: The user Rjwilmsi became slightly active within the period 2006–2008, joined by Physadvoc, who was responsible for the peak in 2007. Obiwankenobi and Mikael Häggström started making a few revisions from late 2008–2009, but not a lot happened between late 2008–2010 until Häggström created a few peaks of 50+ revisions in 2011. The rest of the users sporadically started making revisions in the period of 2012–2015.
The timeline shows three top spikes of 300+ revisions connected to different users; Physadvoc, Obiwankenobi and Truebreath. The latter was only active within the peak period of 05–2014 to 08–2014 but was banned shortly after for plagiarising. From late 2013 and after it shows a more consistent flow of revisions.
The users’ most active periods differentiate almost in a sense that allows them to fill in the gaps, where a low number of revisions are made, nearly creating a revision baseline for the timeline, indicating a relatively stable number of revisions throughout the years.
Network of top 10 users and pages
To further examine our supposed 10 guardians’ relation to our category, we created a visualisation of a bipartite network containing the top 10 users and the pages both within and outside our category they have revised during the period 01–01–2001 until 01–03–2019. In the script scraping the users and pages, an edge-filter of 10 has been applied to the pages outside our category thus excluding pages with less than a total of 10 revisions made by the top 10 users. Visualisation 23 illustrates the base network and the following visualisations will be derivatives of this network, visualising different interesting aspects.
From visualisation 23 we can see that some users (e.g. Rjwilmsi and Jytdog) are connected to a vast amount of pages in relation to other users, which means that they revise many different pages, and at this point we estimate that the number of pages the two users have revised is higher than the number of pages within the category. To figure out if the users mainly focus on the pages within or outside our corpus, we applied a filter that makes it possible to make a distinction between the member and non-member pages, c.f. Visualisation 24.
Visualisation 24 illustrates the users being connected only by a hand-full of member pages and in terms of revisions on pages outside the corpus, they are somewhat divided. Thus it can be said that the top 10 users seem to have diverging foci outside the category since most of the non-member pages clusters tightly around each user. To gain insight into the vague division amongst the pages in our corpus and the sharp division in the pages outside, we look further into the two networks, one by one.
Visualisation 25 illustrates how the pages from the corpus, connecting the users are mostly broad topics, e.g. “Electronic Health Record” and “EHealth”. However, the pages bridging the users only represent a small amount of the corpus pages, indicating that the users’ main focus is not necessarily the same.
From the network in visualisation 26 we can see that the top 10 users primarily revise their own user-page, which in many cases contains a lot of information about how to conduct yourself on Wikipedia. The users are also concerned with Wikipedia specific pages; templates, talk-pages about e.g. AutoWikiBrowser. But overall there are a lot of medically related pages among those revised by the users in our top 10.
A qualitative examination of the top 10 users
By looking at the user pages of the top 10 users, we can see that they are all engaged in editing health and medicine related pages, which is further supported by the fact that six users have received “The Cure Award” for bringing “free, complete, accurate, up-to-date medical information to the public” (e.g. Rjwilmsi and Mikael Häggström). Five of the users have explicitly stated that they revise pages within the topics of health and medicine while five users also make sure that the pages are clean and concise, fixing grammar, references, links and moving stub-pages to the Wiktionary, which is a free web-based dictionary (e.g. ShelleyAdams and Bluerasberry).
Two of the most active users, Jytdog and Obiwankenobi have decided to retire from Wikipedia, Jytdog did so in December 2018 after making a “bad error in judgement” and being accused of engaging in several edit wars, which is the term used when users keep reverting or overriding each other’s contributions. However this decision has led to several comments from other users expressing sadness about the decision and it is obvious that Jytdog was committed to Wikipedia, showing guidelines and the common code for being a Wikipedian. Thus, it seems like Jytdog has been highly appreciated amongst fellow Wikipedians.
But if you look outside the user-page, you see a different side of Jytdog. For example, on Wikipediocracy there is a long thread of people criticizing Jytdog.
The user Truebreath was blocked after several cases of copying and pasting from other sites but was highly active on pages included in our data corpus within a short period of time, as seen on visualisation 22. Additionally, some of our top users’ user and talk-pages are empty, making it difficult to potentially analyze and understand their edits better (e.g Physadvoc and Obiwankenobi).
Controversies might not have the best conditions on Wikipedia, because the encyclopedia seems to have a stable group of users, that take care of the content, reverts vandalism, COI etc.. These users are supported by the bots that keep Wikipedia clean from the issues that might trigger disputes, especially seeing as most bots are created by prominent users. Furthermore, the users do some kind of peer-review on the pages, which seems to prevent controversies. Of course, some topics are controversial in themselves, and this will, of course, be present on Wikipedia as well, as this is what shapes the topic in general, but this is not the case with Health Informatics.
What seems to be a lack of controversies in the category of Health Informatics on Wikipedia might be down to users not having a specific interest within one particular category, thus not having a clear incentive to protect the category from other users and it seems like they have the opportunity to move on to other pages if a challenge arises (unless it is controversial topics in itself). So, even though we see some indications of guardians for the category, we can see that this is not their only concern.
Wikipedia is (still) controversial in itself. Users are not only doing revisions within a specific category, but all over, across many different pages and categories, and it is at a level where it is doubtful that the users have knowledge within all areas. We have seen some examples of users discussing what is the proper page content, but it is also clear that there exist websites that are sceptical about how Wikipedia is maintained by the users. Furthermore, we see cases of very active communities or specific users being self-affirmative, which gives fuel to an anti-claim about Wikipedia: Wikipedia is not something that you can trust, because everyone can edit it.
This is further supported by the fact that even though bots need to be approved by Wikipedia, they are still created to maintain the pages in the users’ image of what Wikipedia should be, which there are many opinions about. Also, the users trying to limit the number of external links on pages by replacing these with internal links poses a controversial bias in relation to a potential conflict of interest within the Wikipedia sphere.
Latour, B., Jensen, P., Venturini, T., Grauwin, S., & Boullier, D. (2012). ‘The whole is always smaller than its parts’ — a digital test of Gabriel Tardes’ monads.’ The British Journal of Sociology, 63(4), 590–615. https://doi.org/10.1111/j.1468-4446.2012.01428.x
Marres, N., & Moats, D. (2015). Mapping Controversies with Social Media: The Case for Symmetry. Social Media + Society, 1(2), 205630511560417. https://doi.org/10.1177/2056305115604176