It’s straightforward to tamper with watermarks from AI-generated textual content


AI language fashions work by predicting the following seemingly phrase in a sentence, producing one phrase at a time on the premise of these predictions. Watermarking algorithms for textual content divide the language mannequin’s vocabulary into phrases on a “green list” and a “red list,” after which make the AI mannequin select phrases from the inexperienced record. The extra phrases in a sentence which can be from the inexperienced record, the extra seemingly it’s that the textual content was generated by a pc. Humans have a tendency to write down sentences that embody a extra random mixture of phrases. 

The researchers tampered with 5 totally different watermarks that work on this approach. They have been in a position to reverse-engineer the watermarks through the use of an API to entry the AI mannequin with the watermark utilized and prompting it many instances, says Staab. The responses enable the attacker to “steal” the watermark by constructing an approximate mannequin of the watermarking guidelines. They do that by analyzing the AI outputs and evaluating them with regular textual content. 

Once they’ve an approximate thought of what the watermarked phrases is likely to be, this enables the researchers to execute two sorts of assaults. The first one, known as a spoofing assault, permits malicious actors to make use of the knowledge they realized from stealing the watermark to provide textual content that may be handed off as being watermarked. The second assault permits hackers to clean AI-generated textual content from its watermark, so the textual content might be handed off as human-written. 

The staff had a roughly 80% success fee in spoofing watermarks, and an 85% success fee in stripping AI-generated textual content of its watermark. 

Researchers not affiliated with the ETH Zürich staff, equivalent to Soheil Feizi, an affiliate professor and director of the Reliable AI Lab on the University of Maryland, have additionally discovered watermarks to be unreliable and susceptible to spoofing assaults. 

The findings from ETH Zürich verify that these points with watermarks persist and prolong to probably the most superior forms of chatbots and huge language fashions getting used in the present day, says Feizi. 

The analysis “underscores the importance of exercising caution when deploying such detection mechanisms on a large scale,” he says. 

Despite the findings, watermarks remain the most promising way to detect AI-generated content, says Nikola Jovanović, a PhD student at ETH Zürich who worked on the research. 

But more research is needed to make watermarks ready for deployment on a large scale, he adds. Until then, we should manage our expectations of how reliable and useful these tools are. “If it’s better than nothing, it is still useful,” he says.  

Update: This analysis will likely be introduced on the International Conference on Learning Representations convention. The story has been up to date to replicate that.


Please enter your comment!
Please enter your name here