The Science of Detecting LLM-Generated Texts

Ruixiang Tang
5 min readFeb 3, 2023

--

Recent advancements in natural language generation (NLG) technology have significantly improved the diversity, control, and quality of LLM-generated texts. A notable example is OpenAI’s ChatGPT, which demonstrates exceptional performance in tasks such as answering questions, composing emails, essays, and codes. However, this newfound capability to produce human-like texts at high efficiency also raises concerns about detecting and preventing misuse of LLMs in tasks such as phishing, disinformation, and academic dishonesty.

Full Paper Link

Figure 1. An overview of the LLM-generated Text Detection

Existing detection methods can be roughly grouped into two categories: black-box detection and white-box detection, black-box detection methods are limited to API-level access to LLMs. They rely on collecting text samples from human and machine sources, respectively, to train a classification model that can be used to discriminate between LLM- and human-generated texts. An alternative is white-box detection, in this scenario, the detector has full access to the LLMs and can control the model’s generation behavior for traceability purposes. In practice, black-box detectors are commonly constructed by external entities, whereas white-box detection is generally carried out by LLM developers.

Figure 2. The top-k overlay using the visualization tool GLTR. There is a notable difference between the two texts. The human-written text is from Chalkbeat New York.

Black-box Detection

To construct an effective detector, black-box methods require the collection of text samples from both human-generated and machine-generated sources. Subsequently, a classifier is trained to differentiate between the two categories based on chosen features.

Some commonly used detection features include statistical disparities and linguistic patterns. For example, GLTR [1] has been developed to detect generation artifacts across common sampling methods, as demonstrated in Figure. 2. Perplexity is another commonly used metric for LLM-generated text detection. It measures the quality of the language model by quantifying the negative average log-likelihood of the texts under the LLM. Studies have shown that language models tend to concentrate on common patterns in the texts they were trained on, resulting in low perplexity scores for LLM-generated text. Conversely, human authors have the ability to express themselves in a wide range of styles, resulting in higher perplexity values.

Figure 3. Inference Time watermark

White-box Detection

In white-box detection, the detector has full access to the target language model, allowing the embedding of secret watermarks into its outputs for monitoring any suspicious or unauthorized activity. A representative example of this method can be found in research conducted by Kirchenbauer et al. [2]. During the next token generation, a hash code is generated based on the previously generated token, which is then used to seed a random number generator. This seed randomly divides the whole vocabulary into a “green list” and a “red list” of equal size. The next token is subsequently generated from the green list. In this way, the watermark is embedded into every generated word, as depicted in Figure. 3. To detect the watermark, a third party with knowledge of the hash function and random number generator can reproduce the red list for each token and count the number of violations of the red list rule, thus verifying the authenticity of the text. The probability that a natural source produces N tokens without violating the red list rule is only (1/2)^N, which is vanishingly small even for text fragments with a few dozen words. To remove the watermark, adversaries need to modify at least half of the document’s tokens.

A taxonomy of LLM-generated text detection.

Authors’ Concerns:

(1) Data collection plays a vital role in the development of black-box detectors, as these systems rely on the data they are trained on to learn how to identify detection signals. However, it is important to note that the data collection process can introduce biases that can negatively impact the performance and generalization of the detector. These biases can take several forms. For example, many existing studies tend to focus on only one or a few specific tasks, such as question-answering or news generation, which can lead to an imbalanced distribution of topics in the data. Additionally, human artifacts can easily be introduced during data collection, as seen in the study conducted by Guo et al. [3], where the lack of style instruction led to OpenAI’s ChatGPT producing answers with a neutral sentiment. These spurious correlations can be captured and even amplified by the detector, leading to poor generalization performance when deployed in real-world applications.

(2) Current detection methods are based on the assumption that the LLM is controlled by the developers and offered as a service to end-users, this one-to-many relationship is conducive to detection purposes. However, the possibility of developers open-sourcing their models or the models being stolen by hackers poses a challenge to these detection approaches. Once the end user gets full access to the LLM, the ability to modify the LLMs’ behavior hinders black-box detection from identifying generalized language patterns. Embedding a watermark in the open-sourced LLM is a potential solution. However, it can still be defeated as users have full access to the model and can fine-tune it or change sampling strategies to erase the watermark. Currently. the cost and effort involved in LLMs training make it unlikely that developers will release their most powerful LLMs. Nonetheless, detecting LLM-generated texts from open-sourced LLMs remains a critical issue that needs to be addressed in the future.

Conclusion

While black-box detection works at present due to detectable signals left by language models in generated text, it will gradually become less viable as language model capabilities advance and ultimately become infeasible. In light of the rapid improvement in LLM-generated text quality, the future of reliable detection tools lies in white-box watermarking detection approaches.

References

[1] Gehrmann, Sebastian, Hendrik Strobelt, and Alexander M. Rush. “GLTR: Statistical Detection and Visualization of Generated Text.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 2019.

[2] Kirchenbauer, John, et al. “A Watermark for Large Language Models.” arXiv preprint arXiv:2301.10226 (2023).

[3] Guo, Biyang, et al. “How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection.” arXiv preprint arXiv:2301.07597 (2023).

--

--