Benchmarking Watermarking

An open-source standard for evaluating modern watermarks.

TrufoAI
Trufo
6 min readFeb 20, 2024

--

As part of our mission to embed trust in digital media, Trufo has released an open-source watermark benchmark library. Detailed instructions, for developers and researchers, on how to use the library can be found in the README file in the repository.

Why a New Standard?

Digital watermarking has been around for half a century, and there is a substantial collection of literature on various techniques and technologies, as well on the assessment of said methods. In the past few years, however, the raison d’être of digital watermarking has evolved significantly.

The culprit is the advancement of content generation capabilities. One on hand, the availability of phone cameras and the omnipresence of social media means that the main content generators are no longer just a handful of companies but rather the general public at large. On the other hand, the rapid development of generative AI capabilities means that digital content will be increasing easy to create, mimic, or alter.

Existing evaluations for watermarks are tailored to the old priorities. While they are designed well for their purpose, many of the tested properties are not all that relevant. Here are a few examples:

  • When assessing robustness, two common tests are the addition of (1) Gaussian noise and (2) salt & pepper noise. In practice, (2) most often arises in the analog-to-digital conversion, and is rare nowadays due to the widespread use of error-correction in data transmission. And while (1) may naturally arise at very small levels from color space rounding, in practice color filters (including simple gamma correction) are more common and more impactful.
  • Many concerns are raised over security, and in particular, over secrecy. Oftentimes, security is dependent solely on the secrecy of the specific methods and parameters of the watermark, which does not scale well with mass adoption. The data that is carried is likewise restricted to a private indication of ownership, with the occasional inclusion of an ID to track down an offending user in the case of a copyright violation.
  • PSNR is probably the most common way of assessing perceptibility. It is simple, and it is great because it is simple (we use it too!). However, dig a little deeper and it is not all that representative of how human visual perception works.
  • Overall, the focus is on preventing adversarial attacks in the context of controlled distribution. If a cable company wants to send a paywalled movie to a DVR console in a known AV encoding, they do not need to worry about the content being resized and compressed when sent over text or email, but they do need to worry about a hacker trying to pirate the content.

The watermark benchmark library Trufo is releasing instead focuses on addressing the modern problem. This problem includes:

  • Distinguishing generated (AI) content from genuine (camera, human) content.
  • Mitigating misinformation, by increasing the difficulty of creating and reducing the impact of receiving.
  • Allowing the average creator to assert their content rights.

There are four primary characteristics associated with a good watermark for these purposes. This library serves as an effective tool to evaluate two of them: durability and invisibility.

The Dataset

Currently, only images are supported. There are four image datasets.

IMG_1: This dataset contains ten images, across a range of sizes, styles, and formats. Attributions are in the dataset mirror.

Note: two images here are missing because their file formats (.bmp, .tiff) are not supported here.

IMG_VOC*: This is a dataset of 132 medium (HD-) images.

IMG_BIG*: This is a dataset of 17 large (4K+) images.

IMG_ART*: This is a dataset of 47 artistic works.

*These are compiled by Angela Tan, CS@Princeton ’25. The VOC and BIG selections are balanced across natural landscapes, urban landscapes, and people-centric photos.

Evaluating Durability

The focus here, at least for the initial release of the library, is on benign content alterations. The first reason is that watermarks, and in particular cryptographic watermarks, serve as a mark of authenticity, so users are generally incentivized to keep it on. The second reason is that benign alterations, such as file downsizing and screenshotting, constitute the lion’s share of content alterations in practice.

A number of image edits are evaluated. Due to the large number of possible edits and the large number of test images, some degree of reproducible randomness is added to obtain a reasonable balance. The full list of edits include:

  • Simple exporting: lossless (PNG) and lossy (JPEG, 95Q).
  • Various JPEG compression levels: 99Q, 90Q, 80Q, …, 10Q.
  • Cropping: one side (2), all sides (2), random selections (4).
  • Resizing: enlarge (2), shrink (2), aspect change (2), fixed sizes (2).
  • Rotation: 90-degrees + reflections (3), fixed small angles (4), random angle (1).
  • Filters: brightening (2), saturation + hue (2), grayscale (1), negative (1), blurring (1), sharpening (1), discretizing (1).
  • Alterations: box overlay (2), lines overlay (2), lines removal (2), text addition (2).
  • Composites: “post” ~ filter + alter + compress (2), “share” ~ resize + alter + compress (2), “screenshot” ~ resize + crop (2).

There are a total of 60 tests. Smaller test selection settings are also available (2 tests, 12 tests), and so is a false positive test setting.

Evaluating Invisibility

Three separate metrics are used. We have found that the composite score is usually quite representative of true visibility.

The first is PSNR, which is a log-scale of the average magnitude of the noise (i.e. watermark) added to an image.

The second is SSIM, which factors in luminance, contrast, and structure into a similarity metric. The calculation is done over a number of windows covering an image.

To roughly place it onto a 0–100 scale (like PSNR), a transformation is applied in the analysis to obtain a scaled decimal version.

The third is an in-house metric, which we call PCPA (perceptibility A). The first step is to calculate a local perceptibility score.

Because the human eye resolves luminosity at a higher precision than color, a larger blend parameter is used for the color channels. The scores across the three color channels are then summed, and a weighted mean is taken across the entire image. In order to achieve a reasonable balance between omnipresent light signals and localized heavy signals, a power factor of 1.5 is used in the mean.

Furthermore, because small images and large images are different, to both humans and computers, the computed scores of the raw image and of a resized version are combined. Finally, the value is placed onto a 0–100 scale.

The computation is neither simple nor elegant, unfortunately, but it does the job well enough.

Analysis

The library also includes an analysis module, which provides convenient summaries of the raw benchmark data. Three summaries are provided:

  • Overall encoding and decoding scores by (watermark / dataset / evaluation).
  • Breakdown of decoding scores by (watermark / dataset / evaluation / edit type).
  • List of encoding and decoding scores by (watermark / dataset / evaluation / image).

We have already run the analysis on a number of watermarks, including our own. If you have watermarks you would like to analyze (or results you want to share), let us know at engineering@trufo.ai!

--

--