The Unbearable Heaviness of Text Recycling

With apologies to Milan Kundera.

Yeah, it’s complicated. https://bit.ly/2Jj3uZB

In one instance or another, those of us who … let’s say ‘strenuously investigate’ published work, have a variable task.

Specifically, it is either very easy or very complicated.

Data re-use is easy. Easy to investigate, easy to identify. Same numbers!

Data thuggery is hard. By this, I mean splitting up complicated statistics or patterns within numbers to detect possible erroneous values. Often, you must use unknown methods to convince the unwilling to prosecute the untouchable. [1]

Plagiarism is easy. I stand by my previous description of plagiarism as “the dumbest crime”.

1. You must steal something that is already in public view, which means everyone else can see it too, or something that was sent to you privately, which means a record exists proving it was sent exclusively to you.
2. It is 100% possible to conclusively prove. You did the deed. This isn’t a matter of ‘those results are unlikely’ or ‘this analysis is highly improbable’. It’s absolute. A = B, you = plagiarist.
3. ‘Sloppiness’ is not really a defense even if it’s a possible explanation, which is rarely the case. “Whoops, I accidentally copied three-quarters of her chapter! How sloppy!” No.
4. You cannot redact your work, remove it from the public record or alter it to make it acceptable. The genie leaves the box, and does not leave a forwarding address.
5. For the deed to benefit you, a maximal amount of people must examine the evidence (i.e. must read what you ‘wrote’).
6. If you get caught once, the rest of your work will be heavily scrutinised — the whole thing will snowball if it ever starts.
This is a really, really bad crime.

Bad as in ‘very easy to prove, very damaging when proven’.

Text recycling (sometimes incorrectly referred to as self-plagiarism) is … well, it’s easy to detect/prove, then gets rapidly gets more complicated. In academia, where you work on the same topic for long periods of time, have many individual pieces of text, and send them to multiple and dissimilar outlets for publication, it’s really complicated. Here are the ballistics:

  1. Theft. Have you committed an intellectual theft or deception if you are reproducing your own work? Occasionally, this is likened to ‘breaking into your own house’. It’s your house. You are free to do what you like with it. This is why plagiarism is incorrect by necessity if you’re talking about something you wrote yourself. The etymology is straightforward — plagium, to kidnap, implying the fact that you are violating someone else’s rights. So, recycling it is. Language matters.
  2. Accuracy. If you write two papers from the same dataset, why would your description of the method and relevant part of the results differ between publications? If anything, they should be at least partly identical — they describe identical situations. Surely pretending they were different would be far more of a problem.
  3. Copyright. Like it or not (… OK, we don’t like it), published academic texts generally cede copyright to whoever is doing the publishing. This means that ‘your’ work ceases very quickly to be yours. If you write something then sign the rights to Wiley, then you do not own it, even if it has your name on the top and you stayed up nights and drank all your good Burgundy writing it. Which brings us to…
  4. Fair Use. This is the exception to copyright which allows limited reproduction of copyrighted material. It is affected by [a] the amount of copying being done [b] the intended commercial or non-commercial use of the copied product [c] the effect on market value of the first source [d] the satirical, commentative or parodic nature of the reproduction.
  5. Citation. Is there an acknowledgement that the reproduction occurs? Hence, is appropriate credit given to the previous reproduced source?
  6. Originality. We have an expectation — reasonable or otherwise — that a new publication, with a new name and topic, in a new venue, is newly created text. It may be a only close description of things which have happened previously, but we do assume that in a separate published text the act of writing has occurred.
  7. Corpus. You write long enough, you read long enough, and you’ll repeat yourself. There are only so many ways to put words in order.
  8. Nature of Reproduction. There is a continuum of identical vs. non-identical copying, usually expressed as the difference between patchwriting (note: Rebecca Howard has been writing about this for decades), the piecemeal reconstruction / partial rewording of multiple sources, and more direct forms of reproduction. The catch-all description which has always stuck in my head (which, in a massive twist of irony, I cannot remember the reference for!) is “patchwriting is when you fail at paraphrasing”.

Undoubtedly there are more elements, but that covers most of what needs consideration in a thuggin’ context.

Now, that brings us to a blog post by Nick Brown, WHICH YOU NEED TO READ FOR ANY OF THIS TO MAKE SENSE:

(Seriously, you gotta read it.)

So, let’s be very clear about the above.

Facts: this is the exact wholesale reproduction of multiple works, written by the central author with a cohort of coauthors, disseminated under the copyright of various academic publishers. This is absolute. You cannot ‘accidentally’ reproduce something which is 100% identical beyond a single sentence, two if we consider the absolute mind-meltingly uncommon upper bound of probability.

Conjecture: as these were very easy for Nick to locate, it is very likely that the post linked above does not represent the sum total of the text recycling problems available to be found.

I tested this myself, by choosing exactly two unusual sounding sentences from an article at random. It only needed two, because the second was “Moroccan and North American individuals remember patterns of Oriental rugs”, and it appears in:

Sternberg (2006) Cultural contexts of giftedness (where I found it), and

Sternberg (2007) Intelligence and culture.

Sternberg (2008) Culture, Instruction and Assessment.

The duplication goes beyond the single sentence, of course.

This process took six minutes. I timed it.

So, let’s go back through the factors:

  1. Theft. Not a factor for external authors. Potentially not a factor for co-authors, we can’t tell much about the relative involvement.
  2. Accuracy. The reason to accurately reproduce specific details of a scientific method is not relevant here, as these papers are not presenting data — they are review papers. The analytically-driven decision to reproduce a procedure precisely is not relevant.
  3. Copyright. Publishers often get permission to reproduce graphs, tables, figure, or text from other sources. This is either cited in text (“This Table originally from XYZ, used with permission”) or in summary (“The above remarks were published elsewhere as the ABC Lecture”). Neither of these apply broadly here. The reproductions appear to mostly be in violation of the copyright of the existing text. It IS possible that permission is either [a] not required due to previous agreements or [b] officially given but not stated.
  4. Fair Use. This can be tricky. Fair Use generally applies to small amounts of text (not the case here) used for a different purpose than the copyrighted material (also not the case here), and is an easier case to make if the alleged violation is not for money (probably the case). There is another currency at work here, of course — fame, citations, publications, i.e. reputation — but the intent of any individual article is not to directly make money, as academics do not generally sell text. This can change if the text is for a textbook. And obviously the intent is not satirical, although that would be hilarious. This reproduction is extensive, bordering in some cases on entire. I don’t think this qualifies as fair use.
  5. Citation. This, to my mind, is the biggest problem here. Sternberg is obviously quite comfortable with self-citation (which is why a hundred-odd people recently signed a letter complaining about his propensity to do so, and the former editor of the journal he now edits just publicly called for him to be fired). But in all but one of these cases he has somehow omitted to self-cite the precise works which are being reproduced.
  6. Originality. Has an ‘act of writing’ occurred? In some cases, the absolute best we can do is partially. How much of an original contribution is necessary to define a new ‘work’ as being truly new?
  7. Corpus. Obviously, these are drawn from a tremendous body of source material.
  8. Nature of Reproduction. Save for copy editing related changes (“2” into “two”) the reproduction is absolute.

Now.

Is it OK?

Let’s examine this from three perspectives — first, from formal guidelines concerning reproduction, second, from the latest academic literature published, third, from the perspective of university policy more generally.

FORMAL GUIDES

There are certainly more, but I’ve gone with four:

  1. The Office of Research Integrity definition
  2. The APA Publication Manual
  3. The MLA Style Manual
  4. COPE Guidelines

ORI

^ They don’t have a section on ‘text recycling’, but that’s what they mean. Let’s get this out of the way first:

Well, that’s quite explicit. Now…

So, from the ORI perspective: ‘overlapping’ publication does not constitute misconduct, but is described clearly in the above as ‘malpractice’ and ‘laying in a continuum’ which varies by extent, type, and severity.

Is it OK? Unclear.

APA

The APA Code of Ethics does not mention text recycling or self-plagiarism. But the APA Publication Manual does (p.14, 6th Ed).

The test is ‘scientific necessity’, and meeting that has three relevant criteria — length, acknowledgement at point of duplication, and acknowledgement in reference section. My reading is that the work in question fails all of them.

Is it OK? No.

MLA

The above section directly cites MLA 8th as saying:

Again, I disagree with the etymology but never mind that now. The MLA Style Guide is not available online (at least, not to me) and I don’t have time to go and get the book out of the stacks. But I think we can assume the MLA website is an accurate representation of their own book.

Is it OK? No.

COPE

The relevant Committee on Publication Ethics (COPE) guidelines are extremely explicit about text recycling. They’re quite short and you should read the whole thing.

With regards as to what to do about it, which the other guidelines are very light on, the following is offered:

If text recycling is discovered in a published article, it may be necessary to publish a correction to, or retraction of, the original article. This decision will depend on the degree and nature of the overlap, and several factors will need to be considered. As for text recycling in a submitted manuscript, editors should handle cases of overlap in data according to the COPE flowchart for dealing with suspected redundant publication in a published article [2].
Journal editors should consider publishing a correction article when:
Sections of the text, generally excluding methods, are identical or near identical to a previous publication by the same author(s);
The original publication is not referenced in the subsequent publication; but
There is still sufficient new material in the article to justify its publication.
The correction should amend the literature by adding the missing citation and clarifying what is new in the subsequent publication versus the original publication.
Journal editors should consider publishing a retraction article when:
There is significant overlap in the text, generally excluding methods, with sections that are identical or near identical to a previous publication by the same author(s);
The recycled text reports previously published data and there is insufficient new material in the article to justify its publication in light of the previous publication(s).
The recycled text forms the major part of the discussion or conclusion in the article.
The overlap breaches copyright.

Is it OK? No.

So, total: one unclear, three no.

LATEST LITERATURE

The most recent paper on this topic is Moskovitz (2018). It is both careful and comprehensive, and if you ever need to know a great deal about what does or does not constitute appropriate re-use, it is great.

Towards the end, the paper considers the COPE Guidelines (as above) on recycling as having crucial definitions hinging around the following terms:

  • an author’s own”: sole author or other authors involved?
  • sections of text: major sections, patches, or occasional pieces of re-use?
  • “same”: identical, illegitimate paraphrase, legitimate paraphrase, or ‘sufficiently different?
  • text”: written words, or pics/diagrams/tables/videos/etc?
  • un-attributed”: necessary and sufficient citation?
  • publications”: are conference proceedings or web-published notes ‘publications’?

Does the material in question meet the definitions of recycling given these (very appropriate) six nuances?

Is it OK relative to these 6 points? Well, respectively, I would say — in order — mostly, yes, yes, yes, usually, and yes.

Thus: is it OK? No.

ANOTHER TEST

There is an obvious analogue to this situation at a university, and it is: what would it mean if a student did this?

So here we go with…

The Institutional Test a.k.a. Your Own Institution’s Academic Integrity Policy That It Enforces On Students

Every university has an academic integrity policy. Yours does, mine does. It’s a natural consequence of having grading and submission policies.

If you don’t know yours, Google *university name here* academic integrity and it should be the first link.

Here’s the one from my university:

This, if anything, is stricter than many of the formal guidelines. It lumps text recycling in with plagiarism — it would qualify as ‘other original academic material’ — and includes paraphrasing, and then applies the ‘any reuse without attribution’ test. The works in question would violate this policy.

Here’s my alma mater on the same subject.

The hinge here is ‘work already assessed’ or ‘received credit’, hence this is very similar to the ‘act of writing’ test.

And just for fun, here’s the policy at Sternberg’s institution (Cornell):

Here, submission is the key. Once something is submitted, it cannot be submitted in the same form elsewhere without agreement, and obviously this cannot be given without notice. The publication analogue of this is straightforward, as we also have a text with a crucial act of submission.

So— is it OK? No, no and no.

This work would likely violate the academic standards we set for undergraduates.

Try it on your own local academic integrity policy.

CONCLUSION

How does the work in question fare with my home-brewed definition of text recycling? Not very well.

Is it at odds with formal ethical standards? Three yes, one unclear.

Does it meet a very nuanced and recent conception of text recyc? Yes.

Is it at odds with the academic standards we set for students? Yes.

So what happens next?

I don’t know, but I suspect nothing whatsoever.

That’s right, I do not expect any of the above to result in concrete action from any party.

That probably bothers some of you, so I’ll explain what would have to happen if the story didn’t end here.

First, people must agree there is a problem.

Regardless of any of the above, a critical mass of people must agree that the above, and issues like it, are an actual problem rather than something we should shout about for twenty minutes and then have some tea. This is very hard to establish with text recycling due to, well, read the above 3000 words. It’s a slippery, difficult issue which compels the reader to make individual decisions about acceptability.

There are academic acts we all consider problematic under all circumstances (i.e. they involve demonstrable dishonesty) and recycling isn’t one of them. Academics [a] rarely care much about the enforcement of lesser standards on other academics, [b] are not reactive or militant people (rather cautious and often terrified of giving offence), [c] communicate with each other poorly and infrequently.

A stupendous amount of checking must be done.

If this is indeed a career-wide problem for Sternberg, the corpus of work to be evaluated is massive. 105 publications would be a big task, 105 CV PAGES is the kind of task that would have make Hercules junk the Seven Labors and take up knitting.

Trust me, I have done similar tasks before. Nick has done even more of them. And I’m not doing this one. Error in the bite/chew continuum. Nick has a very modest follow-up planned, I believe, but certainly nothing even approaching what would be necessary.

To look into this properly would take a coordinated effort between a lot of individual researchers making a collective decision to do so. And they’d need tools that don’t exist to do the task efficiently — software for detecting duplication is severely limited, and you’d need a network graph to keep track of the information and permutations of recycling (there are many potential types). So, for anything to happen, a large amount of people would have to collectively decide to do it manually.

Why? Because no other formal body exists to do this. There is no-one to help you, and no-one to appeal to.

Journals and publishers don’t care, and their gatekeepers (journal and book editors) are both generally suspicious of ‘troublemakers’ and often academically related to or affiliated with the authors they publish. Cornell certainly won’t care (we’ve seen that elsewhere). There are no external bodies (ORI, COPE, etc.) who could be compelled to pay attention.

And here’s one you don’t know, but I’ll level with you: no facility exists that we have found so far to fund this work. We could pay people to do it, of course.

Would you think that error detection tasks are critical in establishing the nature of the reproducibility crisis? Raising the general consciousness of the average researcher (who pays no attention whatsoever, regardless of what you’ve heard on Twitter), would you think that would be the start of an impetus to collectively focus on improved methods, reporting, openness?

We would.

But no-one will pay for it.

Not even the most progressive funders.

We have made both formal and informal overtures (mostly informal, because grant schemes relevant to error detection are vanishingly few). They have received a one hundred percent knockback rate for all projects that involve wholescale assessment of various problems in the scientific literature.

This is obviously disproportionate to the general interest. We might have had error detection projects we’ve been involved in reported in three dozen major news outlets.

But there is no formal support, zero. Working on something like this needs it — unless, and I reiterate, you have some pretty serious people power.

Rights-holders and concerned parties then need to be contacted.

In many cases, it will be unclear who this even is. Who do you appeal to? Journal editors change, book editors move on to other projects, publication rights get bought. Can you find them? And can you explain an issue like this which might be 10–15–20 years old adequately to their replacement? This is why COPE has a retrospective note on their guidelines, urging editors to consider that things were ‘different’ before about 2004.

Think of the environment.

Recognise the fact that, until now, nothing about this body of work has even been noticed. Or, if it’s been noticed, it hasn’t been said publicly (which is worse). There is no public discussion of this issue. Duplication publications which are retracted (and there are so many that RetractionWatch admits they can’t cover them all) are overwhelmingly standard scientific papers rather than reviews or commentaries (as these are).

The corollary to this, of course, is that there is no mechanism or collective will to deal with an issue like this. It is foreign to the people involved, it is foreign to the systems we use to disseminate and capture academic information.

Hence, the center of my conclusion: absolutely nothing whatsoever will happen at present, unless the responsibility to look into this is shouldered by individuals.

And that won’t happen, because everyone is scared and tired.

Because it’s always someone else’s problem.

Because it ‘isn’t that bad’.

I may be proven wrong, but I don’t think I will be.

Enjoy your Wednesday.

My Twitter.

[1] Sorry, Oscar. https://www.goodreads.com/quotes/793129-the-unspeakable-in-pursuit-of-the-uneatable

Like what you read? Give James Heathers a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.