Image Similarity: PDQ algorithm for real-time similarity comparison against image store
Image Fuzzy Matching: The summary
Darwinium risk assessments allows:
- Ingesting and comparing an image to all those seen previously
- Matching to all perceptually similar images, not just exact matches
- Adjustable threshold of similarities and lookback time periods
- Generating resulting features to use in signals and models
… to use for:
- Abuse detection: Repeated use of similar images
- Authenticity: Determine if images are likely genuine or from bad actors
- Safety: 3rd party lookups of known bad content
… with the benefit of being:
- Comprehensive
- Quick
- Adjustable and extensible
- Incorporated with overall digital risk assessment
- Cost effective
- Privacy preserving
Analyse and compare images for similarity, in real time
Wouldn’t it be great to analyse images as soon as they are uploaded? And compare them against previous similar ones you’ve ever seen, with accommodation for matching when similar? And compare them against 3rd party stores of abusive content? And to do this all quickly and cheaply?
Why: Abuse prevention, authenticity and safety
Image screening can be made quicker and better integrated into the customer journey, while moving the assessment of the risk of images closer to the upload point.
- Abuse detection: Repeated use of same or similar images
- Authenticity: Determine if images are likely from genuine users or provided by bad actors
- Safety: Match to 3rd party database lookup of known illicit content
- Liability: Don’t allow known bad content into your digital estate, often a legal requirement
Solution: Lightweight hashing, similarity and fuzzy search algorithms
Darwinium have ported an image transformation and similarity algorithm (‘PDQ’) into Rust. Then provided the resulting hashes to a generalised fuzzy storage and match framework that enables retrieving all potential similar matches when required in real-time.
How is it done?
1. Ingestion at point of upload, as a stream or in browser
Darwinium’s decision engine architecture leveraging Rust and Web Assembly can ingest and act at point of image upload attempt, as a stream and even in the browser, prior to submit.
‘As a stream’ refers to applying transformations to subsets of the image during processing, making it quicker and more secure. The point in time memory required is also reduced, becoming fixed and independent from image size.
The image, metadata and properties are consumed for risk assessment and analysis.
2. Algorithm choice for image representation and comparison
The PDQ algorithm was developed and open-sourced by Facebook (now Meta) in 2019. It specifies a transformation which converts images into a binary format (‘PDQ Hash’) whereby ‘perceptually similar’ images produce similar outputs. It was designed to offer an industry standard for representing images to collaborate on threat mitigation.
Comparing two images reduces to computing distance (for example, Hamming distance) between their representations, or as % bit similarity.
16 bits are just used here for easier interpretation; PDQ hashes represent 256 bits.
3. Consider additional image transformations
Additionally, PDQ hashes for rotations and mirrors of the original image can be inferred efficiently, by just manipulating the Discrete Cosine Transform created in latter stages of processing.
4. Offering similarity resilience
The resulting hashes are resilient to certain transformations, some more so than others, to detect additional attempted manipulation. Generally, images retaining overall structure are more resilient than changes to pixel positions and larger areas of pixel change.
Transformations that result in similar hashes include: File format change, Quality reduction, Resizing, Rotations and Mirrors (when additional hashes compared), Noise or Filter applied, Small Crops and Shifts, Light Watermarks and Logos.
5. Store to allow quick fuzzy search and retrieval
Hashes are then stored such that similar hashes are returned when queried against a current hash of interest. This enables a low latency, real time search of previous match candidates, speeding up the time to compute distance between current and candidate images.
In fact, the fuzzy matching procedure is generalised to any form of data with property of sample-wise exact matches being indicative of overall similarity.
The technique contributes positively to the principle of preserving data privacy; only the hash is stored and used to compare similarity.
6. Produce features for use in real-time decisions
When a new image is uploaded, the similarity searching process is performed according to one or more defined features with configurable similarity %, lookback timeframe and optionally 3rd party callout, if needed. The results form features (numbers) to feed into powerful models or in standalone signals. Some examples include:
The Result: Analyse and compare images in real-time
- Darwinium can screen images during real-time risk assessment
- Features can compute number of similar image matches in a timeframe
- These features can be incorporated into powerful signals and models