Looking in Filelists\ for alternates

Archived discussion about features (predating the use of Bugzilla as a bug and feature tracker)

Moderator: Moderators

Locked
wickedsun
Posts: 11
Joined: 2003-02-27 13:52

Looking in Filelists\ for alternates

Post by wickedsun » 2003-03-17 10:30

I know someone said that he would like to see DC++ look in filelists for alternates... I think that DC++ should check in all the filelists downloaded for alternates. That way, even if the users arent online, at least they're added to your download queue.


Just a thought ;)

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Offline Search

Post by GargoyleMT » 2003-03-17 11:37

wickedsun wrote:I know someone said that he would like to see DC++ look in filelists for alternates... I think that DC++ should check in all the filelists downloaded for alternates. That way, even if the users arent online, at least they're added to your download queue.
I agree. A couple implementation questions came up the first time I thought of this. The primary one is that some file lists (on my system Athlon 1900+, 768mb) take a long time to open... Storing them all in memory would be a bad idea, but so would 3-7 second pauses while the lists are opened. How should this be dealt with?

ivulfusbar
Posts: 506
Joined: 2003-01-03 07:33

Post by ivulfusbar » 2003-03-17 12:33

it will be a wicked amout of string-matching, ivuly amount of them actually. Lets say i have 1000 files in queue, and i download a queue with 100000 files.... yaiks.. could mean 1000*1000000 string matches have to be done.. ;)) well, ofcourse this is a stupid way of doing it, you should search on "size" and the match strings ofcourse to reduce matching alot with some binary scheme or something like it. But i think hashes will be implimented before this is done ;))
Everyone is supposed to download from the hubs, - I don´t know why, but I never do anymore.

[KUN.NL]mepmuff
Posts: 73
Joined: 2003-01-06 09:32

Post by [KUN.NL]mepmuff » 2003-03-17 12:34

I don't quite see why those few seconds would be much of a problem. Can you explain?

ivulfusbar
Posts: 506
Joined: 2003-01-03 07:33

Post by ivulfusbar » 2003-03-17 14:15

[KUN.NL]mepmuff wrote:I don't quite see why those few seconds would be much of a problem. Can you explain?
my point is that it should be done in a effecient way, and it would be alot easier if we had filehashes. I have nothing against this at all, i actually wrote a small script which did this offline, i.e added to queue.xml from .bz2 dclist from filelist directory. Mostly to learn how you parsed xml. I'll see if i have saved it somewhere.
Everyone is supposed to download from the hubs, - I don´t know why, but I never do anymore.

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Post by sarf » 2003-03-17 16:03

Whilst waiting for the filehashes, we could just create a hashtable where the key is the file size.

Sarf
---
"Observe, reason and experiment. (if you are too dumb, just pray)" - Daniel Rutter

ivulfusbar
Posts: 506
Joined: 2003-01-03 07:33

Post by ivulfusbar » 2003-03-17 16:29

well almost atleast, there exists many files with same size ;))
Everyone is supposed to download from the hubs, - I don´t know why, but I never do anymore.

wickedsun
Posts: 11
Joined: 2003-02-27 13:52

Re: Offline Search

Post by wickedsun » 2003-03-17 19:20

GargoyleMT wrote:
wickedsun wrote:I know someone said that he would like to see DC++ look in filelists for alternates... I think that DC++ should check in all the filelists downloaded for alternates. That way, even if the users arent online, at least they're added to your download queue.
I agree. A couple implementation questions came up the first time I thought of this. The primary one is that some file lists (on my system Athlon 1900+, 768mb) take a long time to open... Storing them all in memory would be a bad idea, but so would 3-7 second pauses while the lists are opened. How should this be dealt with?
I totally agree, this is why I didnt post this a while ago. The drawback is that the lists are slowly parsed. Maybe saving the lists in another format once downloaded would fix this (i.e.: DC++ saves the list in a more convenient format (.xml) after you download the list once)

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Re: Offline Search

Post by GargoyleMT » 2003-03-17 20:13

wickedsun wrote:I totally agree, this is why I didnt post this a while ago. The drawback is that the lists are slowly parsed. Maybe saving the lists in another format once downloaded would fix this (i.e.: DC++ saves the list in a more convenient format (.xml) after you download the list once)
Well, I'm not sure what that intermediate file would be. I'm fairly certain that it wouldn't be a straight XML conversion of the file list. Sarf's probably right about a hash with the key value of the size... Even if you just stored the file sizes (like a (person's name).filesize) that DC++ opened up to see if there were any matches, that might get ugly very quickly. Or... not?

File hashes are something that's on my plate that I haven't made much progress on yet (other than deciding to just use the code to bitcollider)... A similar offline file search by hash would also, likely, need a digest-type file created from the full bz2 list.

I dunno, just thinking out loud and exposing my fledgling ideas to criticism. ;-) (but I'm glad this has attracted such heavyweight posters as fusbar and sarf).

wickedsun
Posts: 11
Joined: 2003-02-27 13:52

Post by wickedsun » 2003-03-18 01:48

The whole reason to not doing any sort of hashing is because it is not implemented in the real client. Now, I am not sure if over 50% of the people are using DC++ nowaday, but if they are, MD5 should be implemented.

Another problem would be that people that have NMDC would be skipped, which is, obviously, a draw back. Maybe the hash check should be used on a "if available basis". But that doesnt really solve what I said, yet ;) We're still back to square one as we need to find a way to implement this without a hash.

Or maybe the hash could be built using the filename+size. It would be a fairly good way to implement it. It would, however, be done on the side of the downloader.. That way, the protocol would not have to be changed. The hash would be only used to compare the filename/size, nothing more.

I'm just thinking outloud too, so if I dont make any sense, dont bash me :(


;)

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-18 11:52

wickedsun wrote:The whole reason to not doing any sort of hashing is because it is not implemented in the real client. Now, I am not sure if over 50% of the people are using DC++ nowaday, but if they are, MD5 should be implemented.
Well, going to file hasing is for the betterment of the DC++ community, as well as those clients deriving from it (oDC++, BCDC, DC++k, ZPoc). The impact on other clients of adding the BZip2 lists wasn't really important, it didn't break compatibility.

Bitzi's Bitcollider source calculates about 5 or so hashes of the file, which would be very useful if we flesh out sarf's integration of direct download urls - sig2dat, edonkey, ???. MD5 is not encouraged for use in new software (by RSA). TigerTree, in a format compatible with Shareaza and Bitzi, is probably the best hash for the future... But storing all of them probably wouldn't be a problem.
Another problem would be that people that have NMDC would be skipped, which is, obviously, a draw back. Maybe the hash check should be used on a "if available basis". But that doesnt really solve what I said, yet ;) We're still back to square one as we need to find a way to implement this without a hash.
Well, with auto-search for alternatives, accomodations need to be made until clients with hashing are widespread enough to ignore those not supporting the feature. You'll have to do, most likely, two searches, for files in your queue with hashes (what to do with files without is not so important initially [i.e. whose hash do you pick for the file if the original source doesn't support hashes]) - one to search by hash, and once to search the "normal" way. You can blindly add everyone with the same size file, and use the hash to determine (at transfer time) whether they're valid sources or not, after transferring whatever the minimum segment size is.
Or maybe the hash could be built using the filename+size. It would be a fairly good way to implement it. It would, however, be done on the side of the downloader.. That way, the protocol would not have to be changed. The hash would be only used to compare the filename/size, nothing more.
Well, if you've got a structure that search results come into, that only has file hashes in it, why do you need to add the size in at all? The length of the file already plays in, because all the bytes inside are fed to the hash function. If you sent them as two separate values you could use file size to eliminate looking "too much," but I'd imagine that the size would be in a hash table too. One lookup to see if we have files matching the size (then additional search per hash), or simply one lookup to see if the hash matched or not?
I'm just thinking outloud too, so if I dont make any sense, dont bash me :(
:) We're all human, I'm not going to bash you when you'll have ample opportunities at one time or another to bash me. Plus, I try to phrase criticism in such a way that it doesn't imply "bashing"... or something.

Plus conversation is a remarkable asset to problem solving... Ugh that sounds like a buzz-phrase.

wickedsun
Posts: 11
Joined: 2003-02-27 13:52

Post by wickedsun » 2003-03-19 07:39

What I meant is that it could be a "self-service" rather than depending on others. You don't send/receive hashes, you just build a database file with the name+filesize as a hash. Im saying name+size because obviously, there are more than 1 file with the same size. Keep in mind that this kind of thing can be calculated without having to download a single byte of the files. So it is possible to make a database with only a filelist. Also, you *could* use a file format where its writen like "<filename> <size>" but its less eficient, in my opinion, than combining the 2 of them in 1 string.

I think this would be the best idea to implement some kind of hash without touching the structure of the protocol.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-19 07:55

You seem to be talking about hashes for looking up file list contents, not hashes of the contents (what I was referring to).

wickedsun
Posts: 11
Joined: 2003-02-27 13:52

Post by wickedsun » 2003-03-20 07:37

Well, yes. I was trying to find a solution for the first problem (speed of the filelists). Not start another thread about hashes ;) Hash would greatly help (making this thread pretty much useless except for the first 4 posts. ;)

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-20 08:35

Ah, then back with the point. If you hashed file sizes and filenames, it would be very useful for alternative searching - but only if you had an exact filename. If the remote person renamed it and added a word, it would be missed as an alternate source. And it would not be so good for offline searching in general, which seems like another related and useful feature. Is a compromise possible that would take care of both cases (alternate searching + normal offline searching)? Hmmmm.

wickedsun
Posts: 11
Joined: 2003-02-27 13:52

Post by wickedsun » 2003-03-20 11:05

Erm, true. Forgot about that :/ Well now, it seems that without hashes, this wouldnt be a very good option :/

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-20 18:57

wickedsun wrote:Erm, true. Forgot about that :/ Well now, it seems that without hashes, this wouldnt be a very good option :/
Well, it'd just have to be a more robust feature, not a quick one. Personally, I think there's probably something to be learned from how search engines work, and pre-tokenizing filenames, etc. (This is partly inspired by an idea to allow eventual searching on meta-data, like ID3 tags.) And the answer of "we can deal with offline searching generally some other time" is okay too - I'm just all for an elegant two-birds-with-one-stone approach, if there is one.

Locked