If you have a lot of image data to manage, then you know: identifying and avoiding duplicate images is the key to maintain the integrity of your image collection. Depending on which detection technique you choose, this can be error-prone or not applicable to large volumes of image data.
So, what is the best technique for detecting duplicate images? It always depends on your image collection and your requirements. How large is your collection? Do you want to detect exact duplicates only or also near-duplicates? Can the detection run in background or must it work in real-time?
Today, we’re going to show you five techniques to detect duplicate images, from simple to sophisticated. We hope this will help you to find the best approach for your image collection.
1. File Name
Only works if you have the naming scheme of the files under control.
The comparison of file names is obviously the easiest way to find duplicate images, but it can quickly become useless. Different images may have the same file name, and identical images in different folders may have different file names. Therefore, it’s important to control the naming scheme of your files if you want to use this simplest type of duplicate detection.
2. File Hash
Can handle file identities very well, but the files must be binary equal.
A file hash is a fingerprint to identify files that have the identical binary content. Obviously, the file hash is more reliable than just the file name to detect duplicates because it represents the binary content of a file. Creating and comparing file hashes is very fast, therefore this technique can be easily applied to large image collections.
However, it cannot deal with any file modifications. In fact only a single changed bit in a file results in a different file hash. Image files with same pixel data do not have the same binary content when they are encoded in JPG or PNG format. Additionally, any differences to embedded metadata like EXIF or IPTC leads to a different file hash.
3. Perceptual Hash
Good for finding exact duplicates or duplicates with tiny changes.
A perceptual hash tries to overcome the limitations of file hashes. Perceptual hashes are based on the pixel data and not their binary representation. While file hashing just can tell if files are identical or not, perceptual hashes can handle different file formats and file sizes. It’s fast to compute and lookup is as fast as with a file hash.
The possibility to calculate a distance between two perceptual hashes allows to detect not only identical images, but also close matches with tiny changes. Small differences in hashes reflect small differences in image content.
1. Microsoft Azure Machine Learning x Udacity — Lesson 4 Notes
2. Fundamentals of AI, ML and Deep Learning for Product Managers
3. Roadmap to Data Science
4. Work on Artificial Intelligence Projects
However, the problem with perceptual hashing is that it can produce many false positive hits (images falsely recognized as duplicates). Perceptual hashes take neither image details nor the semantic meaning of an image into account. This can lead to similar looking images with completely different content being evaluated as duplicates.
4. Image Embedding
Highly reliable in finding exact duplicates and near-duplicates with adjustable detection sensitivity.
Nowadays, deep learning techniques can produce an embedding from pixel data that can be used to identify duplicates just like a human being would look at images. An image can be detected as a duplicate even if it has another image size, file type or other modifications to its appearance (like brightness, gamma, saturation etc). Furthermore, the semantic content of the image can be considered to overcome the limitations of perceptual hashes.
For example, if you have an image of a red balloon and you search for duplicates using perceptual hashes all types of somehow red in the middle images (tomato, red ball, strawberry) may be detected as duplicates. The deep learning embedding will stay in the context of balloons.
Using embeddings as representation of images allows you to detect near-duplicate images and to control the detection sensitivity. This is why our duplicate detection uses this technique.
5. Interest Points
Excellent for finding near-duplicates and parts of images, but not suitable in real-time operation.
All detection techniques mentioned above calculate one fingerprint to represent the complete image. But if you want to find images where a part of an image is used, the interest point technique is what you need.
Instead of creating a single embedding representing the complete image, this technique identifies interest points (significant regions of an image like corners), and creates an embedding for each region. When searching for duplicates the embeddings of all interest points and their relative positions are compared to another image. If a certain amount of embeddings and their relative positions occur in two images, they are considered duplicates. This allows to identify near-duplicates as well as images that contain only some cropped part of the search image.
Since an image is represented by hundreds of embeddings and they all have to be compared, this technique is several orders of magnitude slower than using a single embedding to represent an image.
Therefore, this technique can be applied for background processing tasks that run overnight, rather than real-time use cases where you try to detect duplicates before they enter your system.
Regardless of the technique you use, it is necessary to take action after duplicate content is detected. Depending on the type of your image collection and the image management platform you use or provide, different strategies can be applied and may make sense for you or your users. Here are some examples from our practical experience:
- Reject: As soon as recognizable duplicates are to be included in the collection, prevent this by rejecting the upload.
- Associate: Automatic linking of an image duplicate with the other image version.
- Merge: Merging two duplicates into one can be useful to store all information from both images in one place.
- Delete: To clean up your image collection and keep it reliable, it may be necessary to simply delete found duplicates.