[ 652713 ] group together same files in search
Moderator: Moderators
[ 652713 ] group together same files in search
Now, that's a really useful feature, don't you think?
It would make things a LOT better, because you could easily see how many users have the same file (more changes of getting the file faster).
Grouped by file size, extension and maybe some name string comparison.
Like Kazaa basically.
It would make things a LOT better, because you could easily see how many users have the same file (more changes of getting the file faster).
Grouped by file size, extension and maybe some name string comparison.
Like Kazaa basically.
-
- Posts: 147
- Joined: 2003-01-04 02:20
- Location: Canada http://hub-link.sf.net
- Contact:
-
- Posts: 147
- Joined: 2003-01-04 02:20
- Location: Canada http://hub-link.sf.net
- Contact:
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
Once again we come back to hashes
The proper way to group files together is to ensure that their hashes are the same. I think any grouping before that is just guesswork, and will lead to more confusion about "rollback inconsistency" errors - "They were grouped, why won't it resume to the partial I've already downloaded?!?!!"
Of course, there's no easy way to return hashes for a given file using the $SR in the current DC protocol.
Of course, there's no easy way to return hashes for a given file using the $SR in the current DC protocol.
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
bueller?
That's not quite what I said, tron. I said there's no way to return a hash value with search results, so that you can know for sure that two files are the same. DC++ doesn't have any hashing now, though BlackClaw is working on it. So in one sense, it is hard to calculate.Tr0n wrote:Hm, hash is hard to calculate? Why?
I didn't check the source codes, but it sounds weird.
about hashing.
Hashing takes ALOT of CPU time on big shares.
need proof get the XS client.
it's a filesharing community that has hashing.
You have to wait forever for it to hash everything.
It has to read EVERY file you share to compute the hashs.
Yes it is a great feature but building the filelist would take an eturnity.
Also filelist size would like double if they were stored there.
And the only time it is usefull is on small files.
on larger files it is pretty obvious which are which..
should set a limit to how big a file can be to get hashed.
maybe 20 MB?
Hashing takes ALOT of CPU time on big shares.
need proof get the XS client.
it's a filesharing community that has hashing.
You have to wait forever for it to hash everything.
It has to read EVERY file you share to compute the hashs.
Yes it is a great feature but building the filelist would take an eturnity.
Also filelist size would like double if they were stored there.
And the only time it is usefull is on small files.
on larger files it is pretty obvious which are which..
should set a limit to how big a file can be to get hashed.
maybe 20 MB?
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
XS? Ah, http://xs.tech.nu/. Well, eMule, Shareaza, and Kazaa all use various types of hashes. Can you generalize hash performance on them too? (Of course it takes CPU time, but that can be handled by priorities and some throttling.)Melkor wrote:about hashing.
Hashing takes ALOT of CPU time on big shares.
need proof get the XS client.
Well, DC++ does not have to block on completion of hashes. The DC protocol doesn't need them, so the extra functionality enabled by hashes can be disabled until they're available. If file hashes (root TTH) are stored in the file list, it will add another 24 bytes per file. The full tree takes more, but that only has to be exchanged with clients who're downloading the file from you..Melkor wrote:it's a filesharing community that has hashing.
You have to wait forever for it to hash everything.
It has to read EVERY file you share to compute the hashs.
Yes it is a great feature but building the filelist would take an eturnity.
Also filelist size would like double if they were stored there.
Hashing is useful on all size files. With larger files, incremental verification will be more useful, because you can figure out which segment of the file is corrupt, and which section to re-get.Melkor wrote:And the only time it is usefull is on small files.
on larger files it is pretty obvious which are which..
should set a limit to how big a file can be to get hashed.
maybe 20 MB?
I hope this helps clear up your misconceptions.
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
A Hash is the improved version of a CRC, so you can simply group by hash. Two different sized files should never create the same hash (if it's a good hash function), that would be a collision, which is not good for a hash function to have.
For clarification, I'm talking about running the contents of a file through a function to hash them (like Tiger), and arriving at a string/number that uniquely identifies the file. Tiger Tree Hash (TTH) and SHA1 are two that are good and would both serve their purposes in DC++.
For clarification, I'm talking about running the contents of a file through a function to hash them (like Tiger), and arriving at a string/number that uniquely identifies the file. Tiger Tree Hash (TTH) and SHA1 are two that are good and would both serve their purposes in DC++.
Re: Once again we come back to hashes
And what does "Search for Alternate Locations" do?GargoyleMT wrote:The proper way to group files together is to ensure that their hashes are the same. I think any grouping before that is just guesswork, and will lead to more confusion about "rollback inconsistency" errors - "They were grouped, why won't it resume to the partial I've already downloaded?!?!!"
Re: Once again we come back to hashes
It searches for other files with a... similar size and with the same filename (well, almost) as the file searched for.mai9 wrote:[snip]
And what does "Search for Alternate Locations" do?
Sarf
---
Confidence: a feeling peculiar to the stage just before full comprehension of the problem.
-
- Forum Moderator
- Posts: 1420
- Joined: 2003-04-22 14:37
-
- Forum Moderator
- Posts: 587
- Joined: 2003-05-07 02:38
- Location: Sweden, Linkoping
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
Re: Once again we come back to hashes
You have a point, but I think that if files are grouped together, users will understand even less why one of them will not resume to an existing download, should they not be compatible.mai9 wrote:And what does "Search for Alternate Locations" do?
ParanoidOne: the number in the topic is the RFE # on the sourceforge tracker witness the glory
-
- Forum Moderator
- Posts: 1420
- Joined: 2003-04-22 14:37
Re: Once again we come back to hashes
Glory witnessed. Duly noted.GargoyleMT wrote: ParanoidOne: the number in the topic is the RFE # on the sourceforge tracker witness the glory
Re: Once again we come back to hashes
similar? why not exact size?sarf wrote:It searches for other files with a... similar size and with the same filename (well, almost) as the file searched for.mai9 wrote:[snip]
And what does "Search for Alternate Locations" do?
if we can't be sure that two files with exact filenames and size are the same file, how come DC++ decides to continue downloading from another file which is 'similar' but clearly not the same?
Now that I think of it, I feel I had this 'similarity' problem when downloading mp3s, half from one rip, half from another.
I understand that this is the reason of this function be optional, but can I ask for an exact search for alternatives?
Re: Once again we come back to hashes
Because there is currently no way of searching for files with exactly the same size.mai9 wrote:[snip]
similar? why not exact size?
DC++ does the search like this - if the file is below a certain number of bytes it searches for files with "at most" that amount of bytes - if it is beyond (or equal to) that number, it searches for files with "at least" that number of bytes. Or somesuch. Thus the reason why I said "similar".mai9 wrote:if we can't be sure that two files with exact filenames and size are the same file, how come DC++ decides to continue downloading from another file which is 'similar' but clearly not the same?
DC++ only adds alternate files that have a filesize that is EXACTLY the same as the file it searches for, though, so it should (theoretically) be "safe".
Sure. People have asked for the moon, too, but it's still puttering around old Tellus.mai9 wrote:[snip]
I understand that this is the reason of this function be optional, but can I ask for an exact search for alternatives?
Any change made to the way the searches work would only be effective on clients which supported this - there have been several discussions about how the current search could be extended to include exact filesize searches (in fact, I believe some clients uses this). For more information, consult your pineal gland (and the Search button).
Sarf
---
Oh, drat these computers. They are so naughty and complex. I could just pinch them.
Once again we come back to sizes
Oh, I understood that the option "Search for alternated locations" added files with same filename but similar size, but now I understand it only adds files with same filename and size.sarf wrote:DC++ only adds alternate files that have a filesize that is EXACTLY the same as the file it searches for, though, so it should (theoretically) be "safe".
Sure. People have asked for the moon, too, but it's still puttering around old Tellus.mai9 wrote:[snip]
I understand that this is the reason of this function be optional, but can I ask for an exact search for alternatives?
(I am posting again to make sure that this time I got it right)
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
Re: Once again we come back to sizes
True, that's how I understand it. A manual "Search for Alternates" search will search for files with substrings that match the queued file, and that are at least the same size. Only files that match size wise and contain the same substrings will be added to the file... So if someone renames it so that it has more, like tagging an AVI with some group's name, it will still match in an alternate search. However, if they correct a misspelling or remove one of the words, it will no longer be found.mai9 wrote:Oh, I understood that the option "Search for alternated locations" added files with same filename but similar size, but now I understand it only adds files with same filename and size.
However, if we had searching by hashes, you could find the same file regardless of filename, and you could group them with absolute confidence.
(or something... I'm tired, and didn't verify the code just to make sure I'm right.)