[ 652713 ] group together same files in search

Archived discussion about features (predating the use of Bugzilla as a bug and feature tracker)

Moderator: Moderators

Locked
Tr0n
Posts: 16
Joined: 2003-04-30 02:22

[ 652713 ] group together same files in search

Post by Tr0n » 2003-04-30 12:34

Now, that's a really useful feature, don't you think?

It would make things a LOT better, because you could easily see how many users have the same file (more changes of getting the file faster).

Grouped by file size, extension and maybe some name string comparison.

Like Kazaa basically.

HaArD
Posts: 147
Joined: 2003-01-04 02:20
Location: Canada http://hub-link.sf.net
Contact:

Post by HaArD » 2003-04-30 13:19

hmm....

Have you tried pressing the search button? It look s like a magnifying glass, you can also get to it by click FILE then SEARCH, or CTRL-S.

Try that....

HaArD
Posts: 147
Joined: 2003-01-04 02:20
Location: Canada http://hub-link.sf.net
Contact:

Post by HaArD » 2003-04-30 13:22

Once you get that working, try clicking on the column headings...

User | File | Type | Size | etc etc

Tr0n
Posts: 16
Joined: 2003-04-30 02:22

Post by Tr0n » 2003-04-30 16:26

D0h! :roll:

Many of the files don't have the same name, tho they are the same.

It would just make things much more easier and it ain't so hard to implement.

mai9
Posts: 111
Joined: 2003-04-16 23:02

Post by mai9 » 2003-04-30 17:33

I think this is a nice feature.

All files with exactly the same filename and size grouped with a '+' sign

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Once again we come back to hashes

Post by GargoyleMT » 2003-05-01 08:32

The proper way to group files together is to ensure that their hashes are the same. I think any grouping before that is just guesswork, and will lead to more confusion about "rollback inconsistency" errors - "They were grouped, why won't it resume to the partial I've already downloaded?!?!!"

Of course, there's no easy way to return hashes for a given file using the $SR in the current DC protocol.

Tr0n
Posts: 16
Joined: 2003-04-30 02:22

Post by Tr0n » 2003-05-02 03:04

Hm, hash is hard to calculate? Why?

I didn't check the source codes, but it sounds weird.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

bueller?

Post by GargoyleMT » 2003-05-02 07:50

Tr0n wrote:Hm, hash is hard to calculate? Why?
I didn't check the source codes, but it sounds weird.
That's not quite what I said, tron. I said there's no way to return a hash value with search results, so that you can know for sure that two files are the same. DC++ doesn't have any hashing now, though BlackClaw is working on it. So in one sense, it is hard to calculate.

Tr0n
Posts: 16
Joined: 2003-04-30 02:22

Post by Tr0n » 2003-05-02 17:29

IC.

Thanks for the info and the hard work.

It's appreciated!

Melkor
Posts: 24
Joined: 2003-02-23 03:38
Contact:

Post by Melkor » 2003-05-02 21:11

about hashing.
Hashing takes ALOT of CPU time on big shares.
need proof get the XS client.
it's a filesharing community that has hashing.
You have to wait forever for it to hash everything.
It has to read EVERY file you share to compute the hashs.
Yes it is a great feature but building the filelist would take an eturnity.
Also filelist size would like double if they were stored there.

And the only time it is usefull is on small files.
on larger files it is pretty obvious which are which..

should set a limit to how big a file can be to get hashed.
maybe 20 MB?

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-05-02 23:01

Melkor wrote:about hashing.
Hashing takes ALOT of CPU time on big shares.
need proof get the XS client.
XS? Ah, http://xs.tech.nu/. Well, eMule, Shareaza, and Kazaa all use various types of hashes. Can you generalize hash performance on them too? ;-) (Of course it takes CPU time, but that can be handled by priorities and some throttling.)
Melkor wrote:it's a filesharing community that has hashing.
You have to wait forever for it to hash everything.
It has to read EVERY file you share to compute the hashs.
Yes it is a great feature but building the filelist would take an eturnity.
Also filelist size would like double if they were stored there.
Well, DC++ does not have to block on completion of hashes. The DC protocol doesn't need them, so the extra functionality enabled by hashes can be disabled until they're available. If file hashes (root TTH) are stored in the file list, it will add another 24 bytes per file. The full tree takes more, but that only has to be exchanged with clients who're downloading the file from you..

Melkor wrote:And the only time it is usefull is on small files.
on larger files it is pretty obvious which are which..
should set a limit to how big a file can be to get hashed.
maybe 20 MB?
Hashing is useful on all size files. With larger files, incremental verification will be more useful, because you can figure out which segment of the file is corrupt, and which section to re-get.

I hope this helps clear up your misconceptions.

Wisp
Posts: 218
Joined: 2003-04-01 10:58

Post by Wisp » 2003-05-14 10:36

maybe you could group files by filesize+crc

than you would be sure they're the same

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-05-14 23:17

A Hash is the improved version of a CRC, so you can simply group by hash. Two different sized files should never create the same hash (if it's a good hash function), that would be a collision, which is not good for a hash function to have.

For clarification, I'm talking about running the contents of a file through a function to hash them (like Tiger), and arriving at a string/number that uniquely identifies the file. Tiger Tree Hash (TTH) and SHA1 are two that are good and would both serve their purposes in DC++.

ender
Posts: 224
Joined: 2003-01-03 17:47

Post by ender » 2003-05-15 04:56

It is theoretically possible for two different files to have the same hash (after all, hashes are limited in the number, while files aren't), and actually I've had a few cases when two different files had the same hash with lMule (20000 files, 350 GB total and the probability becomes reality :mrgreen:)

mai9
Posts: 111
Joined: 2003-04-16 23:02

Re: Once again we come back to hashes

Post by mai9 » 2003-05-18 20:15

GargoyleMT wrote:The proper way to group files together is to ensure that their hashes are the same. I think any grouping before that is just guesswork, and will lead to more confusion about "rollback inconsistency" errors - "They were grouped, why won't it resume to the partial I've already downloaded?!?!!"
And what does "Search for Alternate Locations" do?

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Re: Once again we come back to hashes

Post by sarf » 2003-05-19 10:26

mai9 wrote:[snip]
And what does "Search for Alternate Locations" do?
It searches for other files with a... similar size and with the same filename (well, almost) as the file searched for.

Sarf
---
Confidence: a feeling peculiar to the stage just before full comprehension of the problem.

TheParanoidOne
Forum Moderator
Posts: 1420
Joined: 2003-04-22 14:37

Post by TheParanoidOne » 2003-05-19 11:24

Just to be totally random, does anyone know what the numbers stand for in the title of this thread?

joakim_tosteberg
Forum Moderator
Posts: 587
Joined: 2003-05-07 02:38
Location: Sweden, Linkoping

Post by joakim_tosteberg » 2003-05-19 11:51

TheParanoidOne wrote:Just to be totally random, does anyone know what the numbers stand for in the title of this thread?
Not a clue.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Re: Once again we come back to hashes

Post by GargoyleMT » 2003-05-19 21:35

mai9 wrote:And what does "Search for Alternate Locations" do?
You have a point, but I think that if files are grouped together, users will understand even less why one of them will not resume to an existing download, should they not be compatible.

ParanoidOne: the number in the topic is the RFE # on the sourceforge tracker witness the glory

TheParanoidOne
Forum Moderator
Posts: 1420
Joined: 2003-04-22 14:37

Re: Once again we come back to hashes

Post by TheParanoidOne » 2003-05-20 04:21

GargoyleMT wrote: ParanoidOne: the number in the topic is the RFE # on the sourceforge tracker witness the glory
Glory witnessed. Duly noted. :)

mai9
Posts: 111
Joined: 2003-04-16 23:02

Re: Once again we come back to hashes

Post by mai9 » 2003-05-20 14:27

sarf wrote:
mai9 wrote:[snip]
And what does "Search for Alternate Locations" do?
It searches for other files with a... similar size and with the same filename (well, almost) as the file searched for.
similar? why not exact size? :shock:

if we can't be sure that two files with exact filenames and size are the same file, how come DC++ decides to continue downloading from another file which is 'similar' but clearly not the same?

Now that I think of it, I feel I had this 'similarity' problem when downloading mp3s, half from one rip, half from another. :evil:

I understand that this is the reason of this function be optional, but can I ask for an exact search for alternatives? :roll:

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Re: Once again we come back to hashes

Post by sarf » 2003-05-20 16:20

mai9 wrote:[snip]
similar? why not exact size? :shock:
Because there is currently no way of searching for files with exactly the same size.
mai9 wrote:if we can't be sure that two files with exact filenames and size are the same file, how come DC++ decides to continue downloading from another file which is 'similar' but clearly not the same?
DC++ does the search like this - if the file is below a certain number of bytes it searches for files with "at most" that amount of bytes - if it is beyond (or equal to) that number, it searches for files with "at least" that number of bytes. Or somesuch. Thus the reason why I said "similar".
DC++ only adds alternate files that have a filesize that is EXACTLY the same as the file it searches for, though, so it should (theoretically) be "safe".
mai9 wrote:[snip]
I understand that this is the reason of this function be optional, but can I ask for an exact search for alternatives? :roll:
Sure. People have asked for the moon, too, but it's still puttering around old Tellus. :)

Any change made to the way the searches work would only be effective on clients which supported this - there have been several discussions about how the current search could be extended to include exact filesize searches (in fact, I believe some clients uses this). For more information, consult your pineal gland (and the Search button).

Sarf
---
Oh, drat these computers. They are so naughty and complex. I could just pinch them.

mai9
Posts: 111
Joined: 2003-04-16 23:02

Once again we come back to sizes

Post by mai9 » 2003-05-21 20:48

sarf wrote:DC++ only adds alternate files that have a filesize that is EXACTLY the same as the file it searches for, though, so it should (theoretically) be "safe".
mai9 wrote:[snip]
I understand that this is the reason of this function be optional, but can I ask for an exact search for alternatives? :roll:
Sure. People have asked for the moon, too, but it's still puttering around old Tellus. :)
Oh, I understood that the option "Search for alternated locations" added files with same filename but similar size, but now I understand it only adds files with same filename and size.

(I am posting again to make sure that this time I got it right)

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Re: Once again we come back to sizes

Post by GargoyleMT » 2003-05-21 22:16

mai9 wrote:Oh, I understood that the option "Search for alternated locations" added files with same filename but similar size, but now I understand it only adds files with same filename and size.
True, that's how I understand it. A manual "Search for Alternates" search will search for files with substrings that match the queued file, and that are at least the same size. Only files that match size wise and contain the same substrings will be added to the file... So if someone renames it so that it has more, like tagging an AVI with some group's name, it will still match in an alternate search. However, if they correct a misspelling or remove one of the words, it will no longer be found.

However, if we had searching by hashes, you could find the same file regardless of filename, and you could group them with absolute confidence.

(or something... I'm tired, and didn't verify the code just to make sure I'm right.)

Locked