The endless #ParsingTragedy

I’m trying this medium thing because I want to give my perspective on the somewhat chaotic #ParsingTragedy twitter thread from yesterday which was triggered by one of my tweets. Feel free to comment or even yoavgo me hehe

Note that this reflects only my opinion and not that of my co-authors.

After getting the official results, we saw we were 3 percentage points below the baseline although we were 2 percentage points above on the dev sets. We realised there was a dynet bug and we could not reproduce our dev results on tira. An unofficial table opened up and it was communicated to us that we could do fake runs on tira where we just copy files parsed on another machine. We thought all the groups assumed that that table could contain results obtained in a freestyle manner like Djamé puts it. Once the test data is out, we can no longer control what people do but it’s nice to keep results in one place. Clearly, we were wrong: people were not assuming this at all. This has been cleared with Dan Zeman who now updated the unofficial results page.

This is all explained in the paper but:
We first ran the exact same model as our official results but on the machine on which we trained them and got an f-score of 69.66 (which can still be seen in ‘all runs’ in the unofficial results page).

Edit: my story about the tokenization bug was partially wrong and I had omitted an important part. (I wasn’t the one dealing with tokenization and had glossed over this but in the wrong way).

During the test phase, we had some problems tokenizing some test sets and (debugging on the dev sets) discovered a bug in our tok/seg model. We then made a somewhat hacky fix. After the test phase we couldn’t interpret our tokenization results and saw our fix was not correct. We made a clean fix to the tokenization bug. This clean fix gave us an f-score of 70.49, which is our current score in the unofficial ranking and also the results that we discuss in our paper.

I have read that some people find this tokenization bug fix sneaky and I understand their feeling. As far as I’m concerned, it was a difficult dilemma. In the end, we favored interpretability of results over strictness of the experimental protocol.

The parsing code is out there ( for anyone who wants to verify our claims and the tokenizing code will be out soon (I can’t promise that but my co-author said he would release it ;))

EDIT: the tokenization code is now here:

Show your support

Clapping shows how much you appreciated Miryam de Lhoneux’s story.