6

Use Perceptual Hashes / Fingerprinting to detect Similar Duplicate Files


Avatar
raretrick

Adding a perceptually similar file search is one beneficial implementation to highlight files with the same or similar names. Still, for videos, Phash is the end-all-be-all for finding identical files.

This means the files don’t need to be identical and will be identified even with different bitrates, resolutions, and introductions/credits.

To achieve this US needs to generate what’s called a phash, or perceptual hash. US will generate a set of 25* images from fixed points in the video file. These images will be stitched together and then hashed using the phash algorithm. The phash can then be used to find videos that are the same or similar to others in the user's offline US phash database. Phash generation can be run during the first US scan or as a separate task. Generation can take a while due to the work involved with extracting screenshots. Depending on the number of video files that are found, the database could take up considerable space, or very little

A

Activity Newest / Oldest

Avatar

Team TreeSize

Merged with: Duplicate image search

Avatar

D H

I'd like to see your implementation of a duplicate image search algorithm. IE, images that have been resized, up/downscaled, rotated, cropped slightly, format converted, etc. I know of two, one hasn't been updated in many years (ImageDupeless) and the other is buggy (Ashisoft Duplicate Photos Finder)..


Avatar

Kevin Freels

With AI being tossed into everything except my clothes dryer (at least so far), it won't be long before search expectations ramp up to the next level.... That level will be the ability to find "similar" files.

For example, having spent the last 36 years as a photographer/PC & Network tech/ home flipper/ mortgage banker/ ISP operator /web designer /artist /...OK. I'll stop.... The number of duplicate, but not really duplicate files I have strung out over multiple systems, devices, and clouds is beyond absurd. The problem is that they often have not just different filenames, but they're versioned in various formats and with slight differences that appear to the standard "image search "algorithms to be a totally separate image. The same happens with documents where a letter may be written and then later simply edited for re-use, then edited again and again, with only minor changes.
Where now, after having bought my first Windows 3.1 machine in 1993, I have 30 YEARS of this stuff that's piled up and just occupying space. It's impossible to clean it all up because it never fails that the 15 yr old file that was "lost" and never needed, WILL be needed within 30 days of deletion. lol.

With photo files this is worse than the documents. Same with logos and other web image files for websites.

And AI is already beig worked into some new tools on the horizon. I read an article with a list of them recently but the only one that comes to mind at the moment is Rewind.AI

I have a feeling that AI enabled search is about to radicaly change the game in a number of ways and will hopefully be able to solve my issue and I'm sure I'm not alone. And I'm certain that MS is in the process of working it into Explorer as well.

So to remain relevant in this new search world we're about to enter, it's going to require you to once again differentiate yourself.

I've been working with these AI tools for nearly two years and taking what I know and applying it to the search market, it's going to be necessary to at the very least, include this type of functionality just to remain relevant.

Of course it can also be seen as an opportunity for dominating the market. To do that, you'll need to take it all a step further. I haven't put a whole lot of time and thought into it yet, but I imaging that the top dogs will have features that allow for continual refinement of searches with the ability to control how strict the search is using natural language. The refinement process would allow for things like "Exclude from the current results, any pics where the trees don't have leaves". The search wouldn't be a fexed set of results, but instead, a flexible result that allows for adjusting the parameters slightly and watching it in realtime as the number of results fluctuates.

It would also likely call for a way to save various searches and reference them in full or in part later. Doing things like "Do this search and cross reference with last week's search for X".

Anyways, I'm sorry this was so long. I didn't plan it that way. it just happened. lol I don't see any way that this isn't going to be the route that search takes in the future and that presents a unique opportunity for you if you decide to take it. :-)


Avatar

Team TreeSize

Merged with: Similar file-matching... not soon but something to work towards.

Avatar

Team TreeSize

Thank you for submitting your feature request. There are currently no plans to add a Duplicate File search to UltraSearch, but we already have one in TreeSize, and there are already ideas of how to find similar files. AI is not necessarily the solution but just one option. I will therefore merge your request with a similar one in the feature voting board of TreeSize.


Avatar

Team UltraSearch

Post moved to this board

Avatar

Team UltraSearch

Thank you for submitting this feature request. We think it matches better to our products TreeSize and SpaceObServer, which both already contain a similar / duplicate file search. In the end, this would be one more option to detect duplicate files. I will therefore move this entry from UltraSearch to TreeSize.