A visualisation example — Histroy Flow

Published in

The Data Experience

9 min readOct 16, 2015

Background

In 2003, just two years after the online encyclopedia’s birth, Wikipedia was still not well known, and among those aware of it there was serious skepticism about its open authorship model. Researcher Martin Wattenberg and Fernanda Viégas felt some of this skepticism themselves, yet many of the articles were interesting and helpful. To find out how was such a haphazard process yielding a quality product and other related questions, they decided to investigate. They did a lot of works and most of them are masterpieces. I’ll focus mainly on the History Flow part which I think is beautiful and useful.

About the Data

To get start, they need the raw data of Wikipedia. Fortunately, Wikipedia’s data wasn’t a table of numbers in a database, but a set of document versions and edit histories. Actually, Wikipedia keeps a full version history of every page available to the public, which is one of the initial brilliant decisions by Wikipedia’s founders and undoubtedly a boon for researchers .

The data they appreciated are the edit history, from which we can have an overview of the articles in each version and trace the changes. Thus can perhaps find out the reason of the Wikipedia is of such a high quality and other questions.

Visual Design

I have to say that Wattenberg and Viégas had done a great work. The whole visual design is simple and clear, but still contains everything inside–contents, versions and authors displayed in a especially proper way. The figure below is the main panel of the History Flow.

What you see in the middle part is the history flow diagram, the core part which we may discuss later in detail. The right part is a panel showing the content of a specific article in a specific version, with different colors representing different contributors. The author list is placed in the left shown as a column, with selected one highlighted. Upper panel is some options about color and spacing of the diagram. The figure may be a little bit vague, but you can still notice that the now selected article is titled “chocolate”, shown both in the upper left of the main panel and the beginning part of the content.

Surely you can guess that the key part of the design is the diagram, and wondering what are these colored stripe means? Let’s find out.

Imagine a scenario where three people will make contributions to a Wiki page at different points in time. Each person edits the page and then saves their changes to what becomes the latest version of that page.

In figure 1, the vertical red line represents the first version of the document. Since Mary creates the page all of the contents in the page reflect her author color. The length of the line indicates the amount of text Mary has written.

Suzanne adds some text to the end of Mary’s original entry; note that Suzanne’s blue line is appended to the end of Mary’s red line indicating that Suzanne’s text was added at the end of the page shown in figure 2. Suzanne saves her changes and this becomes the latest version of the page.

Now in figure 3, Martin finds the original text too verbose; he deletes some of it and writes his own shorter version between the introductory text and Suzanne’s contribution.

On version 4 Suzanne comes back and makes a small contribution in the middle of what remains of the introductory text.

History Flow connects text that has been kept the same between consecutive versions; in other words, it connects corresponding segments on the lines representing versions. Pieces of text that do not have correspondence in the next (or previous) version are not connected and the user sees a resulting “gap” in the visualization; this happens for deletions and insertions.

Now you can scroll up to see the main panel figure, isn’t it clear enough to you? You can easily click a version and the corresponding content will show in the right together with the version highlighted (see the main panel figure and you will find a brown highlighted line). And as you may guess, the color pattern of the content and the diagram stripe are the same of course, so that you can match them up.

Visual Mode

This works has more than one visual mode in order to satisfy different requires. To switch among them, just click the option panel in the upper part of the main panel (which mentioned before). Let’s have a quick look.

Community view is the default mode and it shows all contributions from different authors, hues to indicate the author of each sentence. In this view we use just one property of color — the hues. Of course we can easily add brightness to represent the ages of the text with brighter colors being more recent.

Individual author view highlights the contributions of a single author and it depicts the persistence of these contributions over time.

Recent Changes View highlights the new content in each version of the Wiki page independent of authorship. This view allow us to see what portions of the text have been edited the most over time.

Age View this view has no colors representing authorship; instead, the focus is on the persistence of different contributions. A gray scale gradient goes from white (brand new contribution) to dark gray (very old contribution). As I mentioned before, this view can combine perfectly with the community view which will show more information and also bring more clutters. I suggest that use the combination in a small data set.

Something else should be given a word is that the space between the version is customized. You can either make it constant so that you can find out what happens between two version, or make the space length in direct proportion of the time duration between, which may depict more clearly how the content changes with the time elapsing. To switch between, the upper right check box will help. This is surprisingly helpful in one of the result which we will discuss in the next section.

Result

This great visualization bring out many decent results. We focus on the reason why there seems no vandalism in Wikipedia.

Vandalism is some kind of behavior that delete or ruin all the work done before, which happens in many open authorship circumstances. Is there any vandalism in Wikipedia? The result says yes. Look at the figure below.

This is the diagram of the article “Abortion”. Notice that there are few vertical gaps in the whole diagram. As you may guess, that is a whole deletion behavior. But why didn’t we ever open up an empty page? We notice that it is spaced by a constant length, how about changing a spacing mode to find more?

Above is the figure with “time spacing mode”. Seems that the gaps disappear? From the diagram, they are just refined so rapidly that you can’t even see it in the time spacing mode. So we didn’t see evidence of destructive behavior wasn’t that this behavior didn’t exist, but that it tended to be erased quickly from public view. This is a correct result, but is not so easily demonstrated. To get this result, the author did a lot of additional works. But it is undeniable that the figure is so intuitive that gives out a primary hypothesis and eventually lead to this result.

My Favorite Part, Pros and Cons

My favorite part is the diagram part of course. The first time I saw it I was deeply attracted, while confusions stroke me at the same time.

The primary confusion is why it is so redundant. I mean it is fully adequate to encode all the information without the stripes, only with the vertical lines. Then when data set grows larger, I realize that everything became spoiled without the stripes. Lines are so thin in the screen that you have to zoom in to watch the details, but to do so, you will miss the whole view of the flow. Moreover, without the stripe I can only focus on the constituent components of a single version, or changes between two adjacent version at most, let alone the contributions of a specific author through time. The stripe use additional area to show the information, strengthening the differences of changes, and most importantly, penetrating the whole version, making the hiding information, the cooperations and conflicts between authors for example, jump out of the screen. This is really a genius hack.

Despite the above, the diagram has some very good features. Colors are smartly handled, with hues discrete encoding and brightness consecutive encoding, which is proper for human perception. Changeable spacing and diverse views added extra usability. Especially the individual author view, which should be the honor badge of contributors for their long-lasting devotions.

By the way I was interested in the zigzag part of the “chocolate” article shown in the very beginning (the main panel figure). From the diagram, there are repeatedly insertion and deletion behaviors, which seems like two people quarreling and fighting each other. Actually the fact is so close to my imagination: two users fought over whether a kind of chocolate sculpture called “coulage” really existed and consequently, whether or not the paragraph about it should appear on the page. Interesting isn’t it! Such a small conflict draws such a apparent pattern in the diagram, which definitely give out inspirations for further research.

Advice for Improvement

There are three main points I want to mention about.

The first point is about zooming. Zooming should be the fundamental function in any visual design, while I didn’t find any description about it in this design. Maybe it is such a intrinsic part of visualization that the author just omitted the description. But if there aren’t, there should be.

The rest is all about colors. Since each color matches an author, there may be some clutters in the case of large-scale cooperation. Notice that not all the author appear at the same time, I mean, some authors were active in the former versions and others in the latter, while only few of them really went through all the versions. Can we just use the same color for different authors in the situation where they won’t shadow each other. It may cause some information lost but maybe a good idea, especially when a lot of authors are working on an article together.

If you look carefully at the figure of “abortion” above (in page 6), you can see the gaps in the right. Actually the middle right part is not a real gap, it is a stripe with it’s brightness nearly zero because it matches up a piece of long-lasting content from the article and brightness is used to encode the duration. Longer lasting Content become darker, but shouldn’t be so close to black. It is confusing of course, we can just give it a threshold to keep it a proper brightness, so that we won’t make mistakes.

Improvement will not have an end though, I’ll just stop here. Beyond all the above, I really appreciate this work. It brought inspiration both in visual design and in pattern analysis to me, which definitely broaden my eyesight and deepen my interests in visualization research.

reference

Fernanda B. Viégas, Martin Wattenberg, Kushal Dave. Studying Cooperation and Conflict between Authors with history flow Visualizations in proceedings of conference on human factors in computer system, 2004.

http://www.research.ibm.com/visual/projects/history_flow This is the project pages. Detailed information, news and more figures are available there.

http://www.bewitched.com/historyflow.html This is the Martin Wattenberg’s pages about this project. His descriptions about this and other interesting works are available there.

http://www.fernandaviegas.com/wikipedia.html This is the another author, Fernanda Viégas’s pages about this.

Beautiful Visualization, Looking at Data Through the Eyes of Experts This is the book where I met and select this visualization. In this book are lots of high quality visualizations with specific descriptions and discussions. I strongly recommend you to read about it.

Originally published at hacker-yhj.github.io on October 4, 2013.