How do Trufo’s Watermarks Stack Up?

An evaluation of Trufo’s watermarks.

TrufoAI
Trufo
3 min readFeb 20, 2024

--

Using the open-source watermark benchmark library, we evaluated Trufo’s watermarking capabilities. For comparison, we also evaluated the popular invisible watermark library, which has over 1000 stars in GitHub and is officially used by established generative AI companies such as Stable Diffusion.

Methodology

There are three primary metrics being measured.

  • Durability: the proportion of edited images from which the watermark payload was recoverable without errors. For each image in the dataset, 60 pseudo-random tests were conducted, ranging from a simple 95-Q JPEG export to a composition of resizing, overlays, compression, and more.
  • Invisibility: an even blend of PSNR, SSIM (scaled to 0–100), and a new perceptibility metric PCPA that addresses some of the weaknesses of the other two.
  • Scalability: the amount of time, in milliseconds divided by the image size, the watermark encoding and decoding processes require.

These metrics are developed based on first principles, and have not been calibrated or manipulated to benefit Trufo’s watermarks.

The IMG_1 dataset (10 images) was treated as a training set. The IMG_VOC dataset (132 images) was treated as a blind test set (the evaluation was run once and only once).

Note that Trufo’s watermarks do come with a suite of useful functionalities, such as content provenance and fuzzy authentication, that are not tested in the evaluation.

Results

Trufo’s watermarks outperformed the reference implementation by a few orders of magnitude.

*This is our current guess at the capabilities of other private watermarking services, and is not based on actual benchmark results. Those that are not optimized for durability and invisibility may see lower scores. Those that are not optimized for reliability and scalability may see higher scores.

For a deeper dive, here is the summary table of results.

There are a couple of analytical takeaways here:

  • The reference implementations slow down significantly for larger image sizes (IMG_1 contains two ~4K images; IMG_VOC contains none). Trufo’s watermarks do not; in particular, the heavy watermark that is substantially slower on small images becomes substantially faster on large images (by 2x and 8x versus DwtDct and DwtDctSvd, respectively; RivaGAN crashed the kernel).
  • The PCPA metric is very friendly towards DwtDct and very unfriendly towards DwtDctSvd. From a quick human inspection, DwtDct is indeed far less visible than DwtDctSvd. On the flip side, Trufo’s medium and heavy watermarks have similar invisibility scores, but in practice the heavy version does tend to be more noticeable.
  • The AI-only model (reference RivaGAN) requires a large amount of compute. With a hearty setup of H100 graphics cards and proper GPU optimization, the watermark decoding time can probably be reduced to a practical level, but it is tough. Trufo’s decoding time, in contrast, is consistently quick.

There is still plenty of room for research & improvements in the space of digital watermarking. Stay in tune for updates from Trufo!

--

--