Sunday, March 7, 2021
  • Setup menu at Appearance » Menus and assign menu to Top Bar Navigation
Advertisement
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News
No Result
View All Result
NikolaNews
No Result
View All Result
Home Neural Networks

What Is The Best Technique to Detect Duplicate Images? | by pixolution | Sep, 2020

September 17, 2020
in Neural Networks
What Is The Best Technique to Detect Duplicate Images? | by pixolution | Sep, 2020
586
SHARES
3.3k
VIEWS
Share on FacebookShare on Twitter
Photo by Malte Wingen on Unsplash

If you have a lot of image data to manage, then you know: identifying and avoiding duplicate images is the key to maintain the integrity of your image collection. Depending on which detection technique you choose, this can be error-prone or not applicable to large volumes of image data.

So, what is the best technique for detecting duplicate images? It always depends on your image collection and your requirements. How large is your collection? Do you want to detect exact duplicates only or also near-duplicates? Can the detection run in background or must it work in real-time?

You might also like

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021

Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021

5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021

Today, we’re going to show you five techniques to detect duplicate images, from simple to sophisticated. We hope this will help you to find the best approach for your image collection.

1. File Name

Only works if you have the naming scheme of the files under control.

The comparison of file names is obviously the easiest way to find duplicate images, but it can quickly become useless. Different images may have the same file name, and identical images in different folders may have different file names. Therefore, it’s important to control the naming scheme of your files if you want to use this simplest type of duplicate detection.

Big Data Jobs

2. File Hash

Can handle file identities very well, but the files must be binary equal.

A file hash is a fingerprint to identify files that have the identical binary content. Obviously, the file hash is more reliable than just the file name to detect duplicates because it represents the binary content of a file. Creating and comparing file hashes is very fast, therefore this technique can be easily applied to large image collections.

However, it cannot deal with any file modifications. In fact only a single changed bit in a file results in a different file hash. Image files with same pixel data do not have the same binary content when they are encoded in JPG or PNG format. Additionally, any differences to embedded metadata like EXIF or IPTC leads to a different file hash.

3. Perceptual Hash

Good for finding exact duplicates or duplicates with tiny changes.

A perceptual hash tries to overcome the limitations of file hashes. Perceptual hashes are based on the pixel data and not their binary representation. While file hashing just can tell if files are identical or not, perceptual hashes can handle different file formats and file sizes. It’s fast to compute and lookup is as fast as with a file hash.

The possibility to calculate a distance between two perceptual hashes allows to detect not only identical images, but also close matches with tiny changes. Small differences in hashes reflect small differences in image content.

1. Microsoft Azure Machine Learning x Udacity — Lesson 4 Notes

2. Fundamentals of AI, ML and Deep Learning for Product Managers

3. Roadmap to Data Science

4. Work on Artificial Intelligence Projects

However, the problem with perceptual hashing is that it can produce many false positive hits (images falsely recognized as duplicates). Perceptual hashes take neither image details nor the semantic meaning of an image into account. This can lead to similar looking images with completely different content being evaluated as duplicates.

Similar images but no duplicates

4. Image Embedding

Highly reliable in finding exact duplicates and near-duplicates with adjustable detection sensitivity.

Nowadays, deep learning techniques can produce an embedding from pixel data that can be used to identify duplicates just like a human being would look at images. An image can be detected as a duplicate even if it has another image size, file type or other modifications to its appearance (like brightness, gamma, saturation etc). Furthermore, the semantic content of the image can be considered to overcome the limitations of perceptual hashes.

For example, if you have an image of a red balloon and you search for duplicates using perceptual hashes all types of somehow red in the middle images (tomato, red ball, strawberry) may be detected as duplicates. The deep learning embedding will stay in the context of balloons.

Using embeddings as representation of images allows you to detect near-duplicate images and to control the detection sensitivity. This is why our duplicate detection uses this technique.

5. Interest Points

Excellent for finding near-duplicates and parts of images, but not suitable in real-time operation.

All detection techniques mentioned above calculate one fingerprint to represent the complete image. But if you want to find images where a part of an image is used, the interest point technique is what you need.

Instead of creating a single embedding representing the complete image, this technique identifies interest points (significant regions of an image like corners), and creates an embedding for each region. When searching for duplicates the embeddings of all interest points and their relative positions are compared to another image. If a certain amount of embeddings and their relative positions occur in two images, they are considered duplicates. This allows to identify near-duplicates as well as images that contain only some cropped part of the search image.

Lukas Mach at Wikipedia

Since an image is represented by hundreds of embeddings and they all have to be compared, this technique is several orders of magnitude slower than using a single embedding to represent an image.

Therefore, this technique can be applied for background processing tasks that run overnight, rather than real-time use cases where you try to detect duplicates before they enter your system.

Regardless of the technique you use, it is necessary to take action after duplicate content is detected. Depending on the type of your image collection and the image management platform you use or provide, different strategies can be applied and may make sense for you or your users. Here are some examples from our practical experience:

  • Reject: As soon as recognizable duplicates are to be included in the collection, prevent this by rejecting the upload.
  • Associate: Automatic linking of an image duplicate with the other image version.
  • Merge: Merging two duplicates into one can be useful to store all information from both images in one place.
  • Delete: To clean up your image collection and keep it reliable, it may be necessary to simply delete found duplicates.

Credit: BecomingHuman By: pixolution

Previous Post

How to find value for your martech stack

Next Post

Leveraging Machine Learning for Gut Microbiota-Based CVD Screening

Related Posts

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021
Neural Networks

Deploy AI models -Part 3 using Flask and Json | by RAVI SHEKHAR TIWARI | Feb, 2021

March 6, 2021
Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021
Neural Networks

Labeling Service Case Study — Video Annotation — License Plate Recognition | by ByteBridge | Feb, 2021

March 6, 2021
5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021
Neural Networks

5 Tech Trends Redefining the Home Buying Experience in 2021 | by Iflexion | Mar, 2021

March 6, 2021
Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021
Neural Networks

Labeling Case Study — Agriculture— Pigs’ Productivity, Behavior, and Welfare Image Labeling | by ByteBridge | Feb, 2021

March 5, 2021
8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021
Neural Networks

8 concepts you must know in the field of Artificial Intelligence | by Diana Diaz Castro | Feb, 2021

March 5, 2021
Next Post
Leveraging Machine Learning for Gut Microbiota-Based CVD Screening

Leveraging Machine Learning for Gut Microbiota-Based CVD Screening

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recommended

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

Plasticity in Deep Learning: Dynamic Adaptations for AI Self-Driving Cars

January 6, 2019
Microsoft, Google Use Artificial Intelligence to Fight Hackers

Microsoft, Google Use Artificial Intelligence to Fight Hackers

January 6, 2019

Categories

  • Artificial Intelligence
  • Big Data
  • Blockchain
  • Crypto News
  • Data Science
  • Digital Marketing
  • Internet Privacy
  • Internet Security
  • Learn to Code
  • Machine Learning
  • Marketing Technology
  • Neural Networks
  • Technology Companies

Don't miss it

Linux distributions: All the talent and hard work that goes into building a good one
Internet Security

Linux distributions: All the talent and hard work that goes into building a good one

March 7, 2021
Enhance your gaming experience with this sound algorithm software
Machine Learning

Enhance your gaming experience with this sound algorithm software

March 7, 2021
Check to see if you’re vulnerable to Microsoft Exchange Server zero-days using this tool
Internet Security

Check to see if you’re vulnerable to Microsoft Exchange Server zero-days using this tool

March 7, 2021
How Optimizing MLOps can Revolutionize Enterprise AI
Machine Learning

How Optimizing MLOps can Revolutionize Enterprise AI

March 6, 2021
Cyberattack shuts down online learning at 15 UK schools
Internet Security

Cyberattack shuts down online learning at 15 UK schools

March 6, 2021
Facebook enhances AI computer vision with SEER
Machine Learning

Facebook enhances AI computer vision with SEER

March 6, 2021
NikolaNews

NikolaNews.com is an online News Portal which aims to share news about blockchain, AI, Big Data, and Data Privacy and more!

What’s New Here?

  • Linux distributions: All the talent and hard work that goes into building a good one March 7, 2021
  • Enhance your gaming experience with this sound algorithm software March 7, 2021
  • Check to see if you’re vulnerable to Microsoft Exchange Server zero-days using this tool March 7, 2021
  • How Optimizing MLOps can Revolutionize Enterprise AI March 6, 2021

Subscribe to get more!

© 2019 NikolaNews.com - Global Tech Updates

No Result
View All Result
  • AI Development
    • Artificial Intelligence
    • Machine Learning
    • Neural Networks
    • Learn to Code
  • Data
    • Blockchain
    • Big Data
    • Data Science
  • IT Security
    • Internet Privacy
    • Internet Security
  • Marketing
    • Digital Marketing
    • Marketing Technology
  • Technology Companies
  • Crypto News

© 2019 NikolaNews.com - Global Tech Updates