How to make hashing more practical

Archived discussion about features (predating the use of Bugzilla as a bug and feature tracker)

Moderator: Moderators

Locked
LordSqueak
Posts: 10
Joined: 2003-11-04 18:02
Location: Sweden
Contact:

How to make hashing more practical

Post by LordSqueak » 2005-01-20 05:55

TTH is a nice feature , but im sure many like me , are getting tired of having your computer being buttfucked by hashing everytime you do something in DC++ .

So this is an attempt at an serious thread about how to make TTH hashing a bit more practical in DC++.



A few things that is'nt very practical with TTH .

    hashdata.dat
( a huge file that just keeps growing as it adds TTH's , ( /rebuild can remove old TTH's shrinking it a bit ))
Just what is this file good for? , does it actualy have any practical use?

    renaming = rehashing
renaming a file or an dir makes DC++ rehash.

    not being able to download to... if the other file has different TTH

Something that used to work.
Yes, we DO know its a different file , but we dont give a dmn , for whatever reason I think people should be able to download whatever they want , thats what filesharing is about.


lets suggest practical solutions.
Image

ullner
Forum Moderator
Posts: 333
Joined: 2004-09-10 11:00
Contact:

Re: How to make hashing more practical

Post by ullner » 2005-01-20 06:50

Have you actually looked in Hashdata.dat?
LordSqueak wrote:renaming = rehashing
renaming a file or an dir makes DC++ rehash.
How else should DC++ know it's the same file?
LordSqueak wrote:not being able to download to... if the other file has different TTH
Something that used to work.
Yes, we DO know its a different file , but we dont give a dmn , for whatever reason I think people should be able to download whatever they want
Sure. Go ahead. Download the file separetly. Don't destroy the community by adding yet another file that is different. It comes down to file integrity.

Naga
Posts: 45
Joined: 2003-12-02 11:24
Location: Sweden

Re: How to make hashing more practical

Post by Naga » 2005-01-20 06:53

LordSqueak wrote:
    hashdata.dat
( a huge file that just keeps growing as it adds TTH's , ( /rebuild can remove old TTH's shrinking it a bit ))
Just what is this file good for? , does it actualy have any practical use?

It holds the TTH leafs (if I'm not mistaken)

LordSqueak wrote:renaming a file or an dir makes DC++ rehash.

How can DC++ know it's the same file if it doesn't rehash it?

LordSqueak wrote:
    not being able to download to... if the other file has different TTH
Something that used to work.
Yes, we DO know its a different file , but we dont give a dmn , for whatever reason I think people should be able to download whatever they want , thats what filesharing is about.

Why on earth would you want to download a two different files into one?

Edit: Damn Ullner, posting while I'm typing 8)
Thanks to all open source programmers!
They enable the rest of us to learn a lot!

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Re: How to make hashing more practical

Post by GargoyleMT » 2005-01-20 13:26

LordSqueak wrote:TTH is a nice feature , but im sure many like me , are getting tired of having your computer being buttfucked by hashing everytime you do something in DC++ .

If you're having perfomance/usability issues, try following the suggestions to improve your computer's I/O performance: http://www.dslreports.com/faq/9677

LordSqueak wrote:Just what is this file good for? , does it actualy have any practical use?

It stores the leaves of the files in your share, and the leaves of the files you're downloading. The leaves are transferred between clients and allow sections of the file to be verified independent of the whole.

LordSqueak wrote:renaming a file or an dir makes DC++ rehash.

NTFS seems to have a persistent ID, but this seems to be for internal use, and is impossible to get at, so it cannot be used. Guessing by filesize, name, and timestamp is not foolproof, and may lead to your DC++ thinking a file has a different hash than it actually does. When someone tries to request it remotely, they'll end up repeatedly requesting the same file from you, because it will fail its integrity check.

LordSqueak wrote:Yes, we DO know its a different file , but we dont give a dmn , for whatever reason I think people should be able to download whatever they want , thats what filesharing is about.

Well, hashing is intended to preserve file integrity. Allowing you to download an incompatible file to an existing one may be supported at some point in the future, but it will end up throwing out the portions that aren't the same as the original - due to the aforementioned TTH leaf checking. This on-the-fly file repair is a ways off, since it's not a feature people will use every day - and it will require user interaction.

LordSqueak
Posts: 10
Joined: 2003-11-04 18:02
Location: Sweden
Contact:

Re: How to make hashing more practical

Post by LordSqueak » 2005-01-20 18:10

GargoyleMT wrote:
LordSqueak wrote:TTH is a nice feature , but im sure many like me , are getting tired of having your computer being buttfucked by hashing everytime you do something in DC++ .

If you're having perfomance/usability issues, try following the suggestions to improve your computer's I/O performance: http://www.dslreports.com/faq/9677

Not since i got a new puter , but the old one repeatedly got buttfcked by the tth gimp =)
it seems to me that the TTH checking could be optimized.

GargoyleMT wrote:
LordSqueak wrote:Just what is this file good for? , does it actualy have any practical use?

It stores the leaves of the files in your share, and the leaves of the files you're downloading. The leaves are transferred between clients and allow sections of the file to be verified independent of the whole.

A good answer.
but does dc++ as it is now , actualy have any use for this?
does dc++ realy check the "portions" of the file it downloads?
This seems mostly like a feature intended for multisource downloads.

GargoyleMT wrote:
LordSqueak wrote:renaming a file or an dir makes DC++ rehash.

NTFS seems to have a persistent ID, but this seems to be for internal use, and is impossible to get at, so it cannot be used. Guessing by filesize, name, and timestamp is not foolproof, and may lead to your DC++ thinking a file has a different hash than it actually does. When someone tries to request it remotely, they'll end up repeatedly requesting the same file from you, because it will fail its integrity check.

again , a good answer.
I would suggest a feature in dc++ to rename files/dirs , that way dc++ would know its the same file without having to rehash.

Edit: fixed the lack of [quote] tags -GMT
Image

LordSqueak
Posts: 10
Joined: 2003-11-04 18:02
Location: Sweden
Contact:

Post by LordSqueak » 2005-01-20 18:30

As for downloading 2 different sources to one file ......

its not like it going to destroy the community if i went and downloaded , for example ,,, 2 mp3's with different tags.
infact , the "community" isnt bothered by this at all , atleast not until i decide to share the file.

there are many examples of why you'd want to "download to file...." without the tth checking.
Try searching the forums....

It used to be possible to do this , now its being blocked because the other source has a different tth.

How about an setting in configuration that lets you disable the tth blocking.
of course the blocking would be on as default.



Finaly , I have been reading TTH threads , and basicaly , they all boil down to ...
Complainer: TTH sucks!
Fanboy: TTH is the greatest thing since sliced bread!!
Complainer: But i wanna slice my bread myself!!!!
Fanboy: NO!!! that would ruin the whole TTH concept!!!!!!

you get the idea i hope.

This thread is intended to discuss the issues people have , and to comeup with practical solutions.
( suggesting that someone should read the hashdata.dat is like suggesting that someone should read the phonecatalog when they lament that they are calling a busy number.)
So please , if the fanboys doesnt have anything useful to say , dont say anything at all. dont ruin this thread.
Image

ullner
Forum Moderator
Posts: 333
Joined: 2004-09-10 11:00
Contact:

Post by ullner » 2005-01-20 19:02

LordSqueak wrote:its not like it going to destroy the community if i went and downloaded , for example ,,, 2 mp3's with different tags.
Yes it will. Not if one person does it, but if a second, a third, a forth and so on it will destroy the community.

TheParanoidOne
Forum Moderator
Posts: 1420
Joined: 2003-04-22 14:37

Post by TheParanoidOne » 2005-01-20 19:11

LordSqueak wrote:So please , if the fanboys doesnt have anything useful to say , dont say anything at all. dont ruin this thread.

Going slightly off topic, how are you differentiating between advocacy and "fanboy-ism" ?
The world is coming to an end. Please log off.

DC++ Guide | Words

paka
Posts: 45
Joined: 2004-12-27 19:20

Re: How to make hashing more practical

Post by paka » 2005-01-20 19:12

LordSqueak wrote:
GargoyleMT wrote:The leaves are transferred between clients and allow sections of the file to be verified independent of the whole.

but does dc++ as it is now , actualy have any use for this?

Yes. The development version 0.669 makes use of segment checking:
arnetheduck wrote:* Added advanced resume that detects and tries to repair rollback inconsistencies using tiger trees

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Re: How to make hashing more practical

Post by GargoyleMT » 2005-01-21 12:18

LordSqueak wrote:it seems to me that the TTH checking could be optimized.

It's been adjusted a couple times, and I honestly don't see it on my slowest computer, a P3/866. I suspect that's rather low-end hardware compared to the computers of some of the people complaining, so I think that the aforementioned FAQ and computer optimization is probably good for any user to do.

LordSqueak wrote:A good answer.
but does dc++ as it is now , actualy have any use for this?
does dc++ realy check the "portions" of the file it downloads?
This seems mostly like a feature intended for multisource downloads.

Yes the leaves and HashData.dat are used. DC++ exchanges the leaves between 0.402+ clients, and uses them to check the file. It also serves the leaves up to any client that asks, and that includes ReverseConnect and its multisource clients, which use them to make multisource downloading safe. The CVS version goes a bit further with its use of the leaves.

LordSqueak wrote:again , a good answer.
I would suggest a feature in dc++ to rename files/dirs , that way dc++ would know its the same file without having to rehash.

That's a fine suggestion. Once there is a "My library" window, that will certainly be an option, and will/should let DC++ avoid rehashing the files - assuming they weren't moved between drives (which is possible even if the drive letter doesn't change, thanks to NTFS junctions and mount points).

LordSqueak wrote:its not like it going to destroy the community if i went and downloaded , for example ,,, 2 mp3's with different tags.
infact , the "community" isnt bothered by this at all , atleast not until i decide to share the file.

It was my impression that this problem was largely video files. MP3 tags can drastically change the offsets of the data inside the file, especially when ID3v2 tags are used - since they're used at the beginning of the file, and a lot of programs change the amount of padding.

Video files, on the other hand, are nearly always identical lengths, but can be corrupted anywhere. If you look at the archives of the forum, there've been a number of threads about intelligent file repair - which is work, as I've said, but may eventually be implemented.

LordSqueak wrote:How about an setting in configuration that lets you disable the tth blocking.

It is true that this "reduces" (for some clever redefinition) the number of sources you can have for a file, since they're not actually the same file, but doesn't help file integrity. File integrity is pretty important, and without quality files, what user wants to use any file sharing network (Kazaa)?


LordSqueak wrote:( suggesting that someone should read the hashdata.dat is like suggesting that someone should read the phonecatalog when they lament that they are calling a busy number.)

I think Ullner suggested you look at it (and HashData.xml) so you had an idea on your own. He's curious, and likes trying to discover things on his own. It was a helpful suggestion, not a "OMG, here is answer, STFU idiot" answer.

LordSqueak wrote:So please , if the fanboys doesnt have anything useful to say , dont say anything at all. dont ruin this thread.

Er... yeah. Well, I don't think there has been any "fanboy"ism in this thread. If you keep the discussion highbrow enough, they don't jump in, and the very same people post thoughtful responses. Trying to ward them off this way is just rude.

LordSqueak
Posts: 10
Joined: 2003-11-04 18:02
Location: Sweden
Contact:

Post by LordSqueak » 2005-01-23 07:48

ullner wrote:
LordSqueak wrote:its not like it going to destroy the community if i went and downloaded , for example ,,, 2 mp3's with different tags.
Yes it will. Not if one person does it, but if a second, a third, a forth and so on it will destroy the community.

Not, unless they decide to share the offending file.
[irony]Then the community is going to be destroyed , everyone is going to share their own TTH , and DC is going to die in favor of BT. [/irony]
seriously!!!, TTH is a new thing , and the "community" has done ok this far.
Image

LordSqueak
Posts: 10
Joined: 2003-11-04 18:02
Location: Sweden
Contact:

Re: How to make hashing more practical

Post by LordSqueak » 2005-01-23 08:09

GargoyleMT wrote:
LordSqueak wrote:again , a good answer.
I would suggest a feature in dc++ to rename files/dirs , that way dc++ would know its the same file without having to rehash.

That's a fine suggestion. Once there is a "My library" window, that will certainly be an option, and will/should let DC++ avoid rehashing the files - assuming they weren't moved between drives (which is possible even if the drive letter doesn't change, thanks to NTFS junctions and mount points).

Perhaps it could even be taken one step further and include a tool for moving files, even between HD's (or any other storage media for that matter)


Another thing that has been slightly annoying me.
Before TTH , you used to be able to see from what hub you qued files , this is no longer so.
Appart from no long being able to "remember" what hub the file was qued from,,
it makes me wonder if it could possibly confuse 2 users with the same nick on 2 different hubs.
Image

cologic
Programmer
Posts: 337
Joined: 2003-01-06 13:32
Contact:

Post by cologic » 2005-01-23 13:14

LordSqueak wrote:Not, unless they decide to share the offending file.
[irony]... snip silly, even in irony, strawman ...[/irony]
seriously!!!, TTH is a new thing , and the "community" has done ok this far.

I rue my deletion of those screenshots showing such gems as 5 TTHes across the six instances of a large file found on a hub every time this comes up. I had thought people had realized TTH's value by now...

The 'community', loathe as I am to use that word for a (largely) loosely affiliated group of filesharers, had been destroyed, insofar as save proper RAR releases and the like one had no assurance of a noncorrupted file, and because of this corrupted files did spread and mutate. I count that as destroyed in a filesharing (*gag*) community. TTH can help fix it.

Twink
Posts: 436
Joined: 2003-03-31 23:31
Location: New Zealand

Post by Twink » 2005-01-24 05:22

cologic wrote:
LordSqueak wrote:Not, unless they decide to share the offending file.
[irony]... snip silly, even in irony, strawman ...[/irony]
seriously!!!, TTH is a new thing , and the "community" has done ok this far.

I rue my deletion of those screenshots showing such gems as 5 TTHes across the six instances of a large file found on a hub every time this comes up. I had thought people had realized TTH's value by now...


Wouldn't most the leaves of the tth be the same though? I thought that was part of the port of using that particular hashing scheme so that if 90% of the file was the same dc++ could use both as sources for the parts that were the same, the parts that are different it picks one of the sources (likely the one with the most matching TTHs). I presume this is a more advanced feature that hasn't made it into dc++ yet, however I think it is a fairly good solution to what most people seem to be complaining about.

Just because a file is partially corrupt does not mean it is totally useless.

PseudonympH
Forum Moderator
Posts: 366
Joined: 2004-03-06 02:46

Post by PseudonympH » 2005-01-24 11:50

However, that would require writing code to download and check arbitrary parts of files, which is a decent percentage of the work needed for multisource. Because of this, it probably won't be implemented until multisource downloads are.....

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Re: How to make hashing more practical

Post by GargoyleMT » 2005-01-24 12:26

LordSqueak wrote:Another thing that has been slightly annoying me.
Before TTH , you used to be able to see from what hub you qued files , this is no longer so.
Appart from no long being able to "remember" what hub the file was qued from,,
it makes me wonder if it could possibly confuse 2 users with the same nick on 2 different hubs.

Yes, users can be confused between hubs - that's part of where the idea for the CID in ADC came from. DC++ has always lost the hub that a user was on between instances - if you'd like it changed, vote for:

[Bug 209] Remember and display hub each source was seen in

(It had nothing to do with TTH being added, it's just coincidence that you noticed it then.)


cologic wrote:I rue my deletion of those screenshots showing such gems as 5 TTHes across the six instances of a large file

I have that picture, I believe, just not on my laptop. I'll check my desktop when I am able.


Twink wrote:Wouldn't most the leaves of the tth be the same though?

Yes, and your idea is a sound one - and has been talked about before. Adding such sources to a file must remain a manual thing (or completely discouraged, once DC++ has the ability to take advantage of such sources), because then we're back to the era of autosearches taking up more CPU time, which is already a problem for some.

-=Mr B=-
Posts: 3
Joined: 2004-08-18 19:43

Post by -=Mr B=- » 2005-01-26 12:02

I even tried sugesting an option for files with there own checksums allready included, one that has been used longer then the TTH, and i got pretty much shouted at for even sugesting it. I still belive the TTH checking needs to be reworked, specificly for files going to a dir with a *.sfv present, IF a file is verifyed agenst the sfv, it dosent match and is discarded, why on earth would you grab a second identical file to download? The TTH is apperently the tth of a bad file, and will never validate agenst the sfv no mater how many times its checked, then for heavens sake, REMOVE the TTH,or better yet, add it to a list of what TTH's NOT to get for a specific file, once down, and passing the sfv check, trash the info and move on. But nooooo. "If your so freaking stoopid that you add a corrupt file to download suite ya self" That was basicly the response i got the last time i sugested it.. Right, like you know its a corrupted file, you grab a file list, you add the folder struckture you want downloaded, you DONT walk around comparing all the TTH's to different sources.

Yes, it agitates me.

A second thought, IF a file is qued, and the source goes away, after a specifyed time, there should be a automated search, adding the file with the same name, and the most comon TTH insted of the old entry, and offcourse, restarting download.
The source of the first download might very well be a guy who removes his file due to noticing his is corrupt, then your darn sure out of luck, anyhow, if a file has the same name, IDENTICAL, its "pretty likely" that the user downloading wont ever notice its been a switch other in the way that he actually got it down.

Im prolly going to get shouted at again. Tough luck.
B!

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2005-01-26 13:17

-=Mr B=- wrote:I even tried sugesting an option for files with there own checksums allready included, one that has been used longer then the TTH

I'm not sure which you suggested, was it CRC32 or MD5? You can construct duplicate files with identical CRC32s easily, and from the math papers that cologic has mentioned, you can attack MD4 (and probably MD5) in the same way. Tiger hasn't had any such problems, but if it does, then the effort to move to another hash algorithm will be worth it (DC++'s hashing code is written in an changable manner).

-=Mr B=- wrote:I still belive the TTH checking needs to be reworked, specificly for files going to a dir with a *.sfv present, IF a file is verifyed agenst the sfv, it dosent match and is discarded, why on earth would you grab a second identical file to download?

If someone shares files that don't check against the .SFV, you'll have problems whether you're using TTH or not. If you don't use the TTH for identifying files, then all your non-SFV downloads break...

-=Mr B=- wrote:Right, like you know its a corrupted file, you grab a file list, you add the folder struckture you want downloaded, you DONT walk around comparing all the TTH's to different sources.

True, but you can make a fairly good check by searching for just one file of the set, and sorting by the number of users. If a corrupt file ends up having 5 - 20, etc. users, then that bodes poorly for the quality of files on the DC network.

-=Mr B=- wrote:its "pretty likely" that the user downloading wont ever notice its been a switch other in the way that he actually got it down.

Anything like this should really be done with the user's consent, so they know what's happened, and if something unexpected happens, they'll kow why...

paka
Posts: 45
Joined: 2004-12-27 19:20

Post by paka » 2005-01-26 15:15

PseudonympH wrote:However, that would require writing code to download and check arbitrary parts of files, which is a decent percentage of the work needed for multisource. Because of this, it probably won't be implemented until multisource downloads are.....

Please read my previous post in this topic. The development version uses TTH for checking parts of a file when it comes across rollback problems. So it's probably only a matter of porting this code for the purpose of searching and downloading partially damaged files.

PseudonympH
Forum Moderator
Posts: 366
Joined: 2004-03-06 02:46

Post by PseudonympH » 2005-01-26 22:19

Based on my reading of the code (DownloadManager::getResumePos() or similar), it only checks the most recently-downloaded chunk and only checks the ones before it if the check fails, continuing until it finds a valid chunk. Of course, at some point the code never actually called that function, but it's the intent that counts. :)

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2005-01-27 12:26

PseudonympH wrote:Based on my reading of the code (DownloadManager::getResumePos() or similar), it only checks the most recently-downloaded chunk and only checks the ones before it if the check fails, continuing until it finds a valid chunk.

Indeed, this feature is written with the assumption that the file's contents are downloaded in sequence. It could be adapted to create an array of good or bad segments of the file, but the QueueManager and DownloadManager aren't cut out to make use of such lists. It'd be more than just a little reworking.

paka
Posts: 45
Joined: 2004-12-27 19:20

Post by paka » 2005-01-27 17:06

Right, not that easy then. My mistake that I haven't analysed how the new resume works... Thanks for checking it.

Locked