I would like to claim that BLEU score is an appropriate metric for Chinese poems, regardless of the way it is used in this paper.
Chinese poems have to meet a very strict regularization that there should be only 5 or 7 words each line. In common there’re 4 lines per poem. Every individual character is meaningful and many rhetoric structures exist between words or word compositions.
If the goal is to generate similar content, it should be exactly the same as what the ancient people written (the gold standard). Well, that’s pretty like machine translation under this setting. That’s why using BLEU is OK.
Anyway, if our goal is to evaluate how a poem is written, to give its value in literature and to judge the creativity of the so-called AI, that will be a different story. And perhaps no other metric except human evaluation is appropriate.