Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.01637 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912622915354624 |
|---|---|
| author | Xie, Liyan Siddeek, Muhammad Seif, Mohamed Goldsmith, Andrea J. Wang, Mengdi |
| author_facet | Xie, Liyan Siddeek, Muhammad Seif, Mohamed Goldsmith, Andrea J. Wang, Mengdi |
| contents | Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_01637 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking Xie, Liyan Siddeek, Muhammad Seif, Mohamed Goldsmith, Andrea J. Wang, Mengdi Machine Learning Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization. |
| title | Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking |
| topic | Machine Learning |
| url | https://arxiv.org/abs/2510.01637 |