Image Fuzzy Matching: Incorporating Images into Digital Risk Assessments

Mike Brooks

2 August 2022

Image Fuzzy Matching: The Summary

Darwinium risk assessments allows:

Ingesting and comparing an image to all those seen previously
Matching to all perceptually similar images, not just exact matches
Adjustable threshold of similarities and lookback time periods
Generating resulting features to use in signals and models

This approach can be used for:

Abuse detection: Repeated use of similar images
Authenticity: Determine if images are likely genuine or from bad actors
Safety: 3rd party lookups of known bad content

And Delivers the Benefit of Being:

Comprehensive
Quick
Adjustable and extensible
Incorporated with overall digital risk assessment
Cost effective
Privacy preserving

Analyze and Compare Images for Similarity, in Real Time

Wouldn’t it be great to analyze images as soon as they are uploaded? And compare them against previous similar ones you’ve ever seen, with accommodation for matching when similar? And compare them against 3rd party stores of abusive content? And to do this all quickly and cheaply?

Why: Abuse Prevention, Authenticity and Safety

Image screening can be made quicker and better integrated into the customer journey, while moving the assessment of the risk of images closer to the upload point.

Abuse detection: Repeated use of same or similar images
Authenticity: Determine if images are likely from genuine users or provided by bad actors
Safety: Match to 3rd party database lookup of known illicit content
Liability: Don’t allow known bad content into your digital estate, often a legal requirement

Solution: Lightweight Hashing, Similarity and Fuzzy Search Algorithms

Darwinium has ported an image transformation and similarity algorithm ('PDQ') into Rust. Then provided the resulting hashes to a generalized fuzzy storage and match framework that enables retrieving all potential similar matches when required, in real time.

How is it Done?

1. Ingestion at point of upload, as a stream or in browser

Darwinium’s decision engine architecture leveraging Rust and Web Assembly can ingest and act at point of image upload attempt, as a stream and even in the browser, prior to submit.

'As a stream' refers to applying transformations to subsets of the image during processing, making it quicker and more secure. The point in time memory required is also reduced, becoming fixed and independent from image size.

The image, metadata and properties are consumed for risk assessment and analysis.

2. Algorithm choice for image representation and comparison

The PDQ algorithm was developed and open-sourced by Facebook (now Meta) in 2019. It specifies a transformation which converts images into a binary format ('PDQ Hash') whereby 'perceptually similar’ images produce similar outputs. It was designed to offer an industry standard for representing images to collaborate on threat mitigation.

Comparing two images reduces to computing distance (for example, Hamming distance) between their representations, or as % bit similarity.

16 bits are just used here for easier interpretation; PDQ hashes represent 256 bits.

3. Consider additional image transformations

Additionally, PDQ hashes for rotations and mirrors of the original image can be inferred efficiently, by just manipulating the Discrete Cosine Transform created in latter stages of processing.

4. Offering similarity resilience

The resulting hashes are resilient to certain transformations, some more so than others, to detect additional attempted manipulation. Generally, images retaining overall structure are more resilient than changes to pixel positions and larger areas of pixel change.

Transformations that result in similar hashes include: File format change, Quality reduction, Resizing, Rotations and Mirrors (when additional hashes compared), Noise or Filter applied, Small Crops and Shifts, Light Watermarks and Logos.

5. Store to allow quick fuzzy search and retrieval

Hashes are then stored such that similar hashes are returned when queried against a current hash of interest. This enables a low latency, real time search of previous match candidates, speeding up the time to compute distance between current and candidate images.

In fact, the fuzzy matching procedure is generalised to any form of data with property of sample-wise exact matches being indicative of overall similarity.

The technique contributes positively to the principle of preserving data privacy; only the hash is stored and used to compare similarity.

6. Produce features for use in real-time decisions

When a new image is uploaded, the similarity searching process is performed according to one or more defined features with configurable similarity %, lookback timeframe and optionally 3rd party callout, if needed. The results form features (numbers) to feed into powerful models or in standalone signals. Some examples include:

Feature	Signal
Number of Exact Image Matches this Week	Image uploaded this week
Number of Images 97% Similar Today	Image is 97% similar to 3+ today
Number of Images 99% Similar from this Device Ever	Device uploaded very similar image before
Number of Images 95% Similar from this IP Address in last 10 Minutes	Potential spam from this IP address
Match Result to 3rd Party Database	Image matched to a Threat Exchange

The Result: Analyze and Compare Images in Real Time

Darwinium can screen images during real-time risk assessment
Features can compute number of similar image matches in a timeframe
These features can be incorporated into powerful signals and models