Adding file hashes to the protocol

Technical discussion about the NMDC and <a href="http://dcpp.net/ADC.html">ADC</A> protocol. The NMDC protocol is documented in the <a href="http://dcpp.net/wiki/">Wiki</a>, so feel free to refer to it.

Moderator: Moderators

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Post by sarf » 2003-03-05 17:20

GargoyleMT wrote:[snip]
Hmm. Well, we definitely need to be able to return hashes with $Search results sometime, to get the hash to passive clients. Or do we? I suppose a passive user might be the only one of many sources of a specific file online at a given time, and you need the hash from him/her to add it to your queue. Is this a resonable scenario?
What you mean is that we can get the hash from someone other than the passive user, right? Well, in that case, let's ignore the passive user as a source for hashes. Problem solved (and yes, I have an "active" connection myself ;))
GargoyleMT wrote:When searching by HASH>[value], do we need to return hashes? The client can just request the hash when we connect to that source and verify that it's the right file. And non-hash enabled clients shouldn't send back results at all.
If we search by hash we should never get a different hash back. That'd be like searching for "Britney Spears" and getting search results without "Britney Spears" in their filenames/path, an indication of a faulty client. Thus, we can just assume that the search results returned did, indeed, match the hash we specified, but as we'll check this when we connect and retrieve the file anyway it's no biggie.
GargoyleMT wrote:Do we need to overload $Search at all? If we just use the notation above in keywords...
We might just put this here thingy within the size field and put T and F in the file sizes (or was it F and T) fields... thus allowing other clients to just ignore it. I'd recommend using a new filetype (0xFF would be good) - this would make it really easy for clients just to ignore any filetypes they do not recognize.
GargoyleMT wrote:This is the problem with the one-true-hash idea, which of the web direct file links have a TT hash as their key?
None, as far as I've heard.
GargoyleMT wrote:[snip]True, but I think saving the "bad" sources is probably a good idea. It could be reused in at least another feature I've seen, to "switch" to another source if the speed is below a certain threshold. It also seems like the right thing to do the first time... If someone has a limited number of download slots, then they would probably not appreciate one of those being used to connect to a user that isn't a valid source anyway.
There has already been a DC++ clone/branch which implemented a "blacklist" (Operas version, perhaps?) - a variant of that could be used.
GargoyleMT wrote:I was kinda hoping that someone else would jump into this conversation, but if nobody is, we must have our heads on straight. ;-)
Well, I'm jumping in now, but only since you asked so nicely. ;)

The reason I've not commented on this thread before is that there have not been anything for me to say something about - I have close to none knowledge about hashes, and since the collective "we" decided to dive headlong into TigerTree hashes, Merkle hashes and SHA1 and... well, you get the idea.
sandos wrote:I would much like to see metadata like this transferred over the hub, actually. I, as a passive mode user, might want to get the TTH for a file, or even SHA1, from another passive user, so that I can use that to search.
If metadata was passed over the hub... woweee! Have anyone done any calculations of how much this would increase the bandwidth usage on the hub? Since most hubs seem to put broad-pipe connections on their knees with simple chat+commands, getting them to shuffle XML metadata would be a death stroke... to either them or the clients that try to get the hub to do things it doesn't want to.
sandos wrote:Theres also the option to add metadata to the filelist. I heard Arne mention xml-based filelists, and that sounds very good to me, should make it possible to add arbitrary metadata.
A much better way of handling things... this way one could even add alternates (based on the possible "hash" metadata) in an efficient way.
sandos wrote:[snip quote about segment hashing]
This also makes partial filesharing and "swarming" possible using hashes.
AFAIK swarming is about downloading segments from users... since users in our system are very likely to have a complete file (or at least very unlikely just to have loads of segments lying around their storage medium), we might as well search on complete files. Complete files have a slightly bigger "longevity", too, as unfinished/temporary files are deleted with impunity while people are more likely to store the "MegaLegal MusicMovie from the Group With No Name.avi" which they spent a while collecting, but are not as likely to store the 33rd, 48th and 75th segments of it.

My main reasons to support ED2K/sig2dat/mumbojumbo links is that they are useful in and of themselves - with such a link I could have DC++ in its current form search for alternates and download the latest releases automatically. Sure it would be nice to support the hashes they use, but they are not necessary to be useful. Merely helpful.

Hope this helps confuse the issue further.

Sarf
---
When you have had all that you can take, put the rest back.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-06 03:02

GargoyleMT wrote:This is the problem with the one-true-hash idea, which of the web direct file links have a TT hash as their key?
sandos wrote:I'm also for one primary hash, and I would then go for TTH. My second option would be SHA1+TTH.
Can't you create a Merkle hash tree with SHA1 as well? Just a random thought.
Well you can certainly use any hash to make a tree, but using SHA1 wont buy us anything: the tree root-SHA1 hash will not be the same as a entire-file SHA1 hash. I know only of real-world examples for SHA1 for the entire file, and trees using Tiger hash, and the less-good variants: md5, ed2k, fasttrack id, all entire-file (not quite for some, dont remember which). I think ed2k and fasttrack ones are pretty quick to compute, especially if you do it in "parallell" with other hashes so you dont read data from disk multiple times.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Re: Updates on TTH

Post by GargoyleMT » 2003-03-06 20:33

sandos wrote:Shareaza 1.8.1.1 has been released, with a new TTH. It adds a 0x00 byte to every 1024 byte chunk of filedata, and a 0x01 byte to every 24 byte hash before feeding into the hash again. The author, Mike, says that this is finalized and is what will be used by bitzi.com aswell.
Bitzi hasn't updated the tigertree hash on sourceforge.net. Have you seen diffs against that code yet? I also haven't seen much in the way of changelogs on Sharaza's 1.8.1.1 version... got a reference with more detail?

volkris
Posts: 121
Joined: 2003-02-02 18:07
Contact:

Post by volkris » 2003-03-06 20:37

So how do we flag a $Search as being for a hash?
You don't have to, just specify the hash as a title search string.
I promise you not many people are going to have long filenames of random characters :)

We investigated this on lichlord and it would have worked well.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-06 21:06

volkris wrote:
GargoyleMT wrote:So how do we flag a $Search as being for a hash?
You don't have to, just specify the hash as a title search string.
I promise you not many people are going to have long filenames of random characters :)

We investigated this on lichlord and it would have worked well.
Well, in that case, it's time for some code instead of all these posts. :)

Well, some implementation ideas, a quasi-summary:
  • Expanding the structure for each file in the SearchManager to accept a hash
    XML saving of the state of the share manager
    XML file lists?
    Identification of the HASH>[value] keyword
    Some kind of structure to make searching by hash as non-cpu intensive as possible (stl gurus, want to suggest something?)
    $Supports feature adversising
    Mini-xml exchange for $MetaInfo <file> or $Get <file>.metainfo/.xml?
    Queue item saving of hashes
    Auto-search by hash support (is there a way to have only one search per alternative, instead of a normal and hash one?)
Aside, isn't there a not too difficult solution for returning the hash from a passive to passive client? I don't like putting that in a SEP field (somebody else's problem). And how about the difficulty of calculating more than one hash at the same problem (searching by non TTH hashes is something I'm comfortable putting in a SEP field)?


Well... w00t. Or something.

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Post by sarf » 2003-03-07 05:47

GargoyleMT wrote:Well, in that case, it's time for some code instead of all these posts. :)
Yay!
GargoyleMT wrote:Well, some implementation ideas, a quasi-summary:
And some of my notes on each... great initiative, though, GargoyleMT.
GargoyleMT wrote:Expanding the structure for each file in the SearchManager to accept a hash
Or maybe several hashes? Why not support more than one hash - some sort of future hash might be even better as well as having the possibility to support MD4 hashes so that ShareReactor links can be used... but I digress.
GargoyleMT wrote:XML saving of the state of the share manager
What state are you talking about?
GargoyleMT wrote:XML file lists?
What fields do we need? Size? Date created/changed/used/read by WinAMP? Name (duh!) ? Compressed or not? Data streams associated with it (using this would help enormously for those of use who have NTFS/"modern" operating systems so we could shuffle our files any way we liked) ?
GargoyleMT wrote:Identification of the HASH>[value] keyword
Specification of HASH/"type of hash">[value] perhaps, so we could support TTH hashing by having the "HASH/TTH>[value]" search command which would be totally disregarded by the clients supporting Sarfs Mega Hashing Thingy which would use "HASH/SMHT>[value]". :)
GargoyleMT wrote:Some kind of structure to make searching by hash as non-cpu intensive as possible (stl gurus, want to suggest something?)
Use <snicker> hash tables. No, seriously. :)
GargoyleMT wrote:$Supports feature advertising
Define hash commands and specify what hashes the client supports. Add metadata supports string(s).
GargoyleMT wrote:Mini-xml exchange for $MetaInfo <file> or $Get <file>.metainfo/.xml?
Yes! Use $GetMetaInfo <file> - much more intuitive in my humble opinion.
$GetMetaInfo should not be necessary if you have downloaded the XML list, though, since that file should contain all the data necessary to decide whether you want to download a file or not.
GargoyleMT wrote:Queue item saving of hashes
Eh? Saving the hashes of queued files, you mean?
GargoyleMT wrote:Auto-search by hash support (is there a way to have only one search per alternative, instead of a normal and hash one?)
Yes, there is - simply use the fact that a file has a hash to mean that it should not be searched by the normal search for alternatives. :)
GargoyleMT wrote:Aside, isn't there a not too difficult solution for returning the hash from a passive to passive client? I don't like putting that in a SEP field (somebody else's problem).
If you can't use $GetMetaInfo, you will not get the hash. Why should we code this when DC++ has no support for it?
GargoyleMT wrote:And how about the difficulty of calculating more than one hash at the same problem (searching by non TTH hashes is something I'm comfortable putting in a SEP field)?
Not hard at all... simply allow the user to specify what hashes they wish to store locally, then calculate all the specified hashes whenever necessary (the TTH would simply store more data than the other hashes).
GargoyleMT wrote:Well... w00t. Or something.
Hmm... how about "woohooo! more work!" ?

Sarf
---
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." - Rich Cook

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-08 11:44

sarf wrote:Or maybe several hashes? Why not support more than one hash - some sort of future hash might be even better as well as having the possibility to support MD4 hashes so that ShareReactor links can be used... but I digress.
I wondered about this. I'm unfamiliar with the services that Bitzi has, but maybe we can use the hashes from the direct download links you're supporting to get back a bitprint (and thus TTH) from bitzi? This (optional) integration might allow us for some other nice features.
GargoyleMT wrote:XML saving of the state of the share manager
Saving of the list of shared files into an XML format. Strike that, that's the item below. This bullet must encapsulate the idea of caching the hashes... That's got to be the second thing implemented from a testing point of view, right after calculating the hashes. The average time I spend running a modded DC++ is on the order of minutes... I don't want to wait for hashes to complete so I can test another feature of this idea.
GargoyleMT wrote:XML file lists?
What fields do we need? Size? Date created/changed/used/read by WinAMP? Name (duh!) ? Compressed or not? Data streams associated with it (using this would help enormously for those of use who have NTFS/"modern" operating systems so we could shuffle our files any way we liked) ?
Well, just a basic name/size/hash is probably good for now. But I was wondering if something like this would save time (vs. the way the queue is aved)

Code: Select all

<directory name="parent dir name">
  <file name="blah">
    <hash type="TTH" value="[value]"/>
  </file>
  <directory name="subdir>
     <file name="oh-so-creative">
     </file>
   </directory>
</directory>
Note, I don't even know if that's valid XML. Arne mentioned that XML file lists would be slower, but I don't know how he formatted his, or if he was just using his experience to estimate.

But what's that about being compressed (NTFS Compressed?) and about alternate data streams? I know about them in theory, but how're they useful to know about in DC++?
GargoyleMT wrote:Identification of the HASH>[value] keyword
Specification of HASH/"type of hash">[value] perhaps, so we could support TTH hashing by having the "HASH/TTH>[value]" search command which would be totally disregarded by the clients supporting Sarfs Mega Hashing Thingy which would use "HASH/SMHT>[value]". :)
Yes, that's a good and necessary extension. :)
GargoyleMT wrote:Some kind of structure to make searching by hash as non-cpu intensive as possible (stl gurus, want to suggest something?)
Use <snicker> hash tables. No, seriously. :)
Hahaha, I should've seen that one coming. I must've guessed that "hashes + hash tables = The End Of Teh[sic] Universe"
Define hash commands and specify what hashes the client supports. Add metadata supports string(s).

Yes! Use $GetMetaInfo <file> - much more intuitive in my humble opinion.
$GetMetaInfo should not be necessary if you have downloaded the XML list, though, since that file should contain all the data necessary to decide whether you want to download a file or not.
Ok, I imagine your first suggestion to be something like: "$Supports BZList TTH MD4 SHA1"? Or a separate command like "$Supports BZList SupportsHash" with $SupportsHash returning a list of hash types... but what does that gain us?

I think a $GetMetaInfo <file> should return a mini-xml centered around the <file> type above... probably with a few things removed, like the real path on disk.
GargoyleMT wrote:Queue item saving of hashes
Eh? Saving the hashes of queued files, you mean?
No, just simply that the queue.xml needs to have a hash and hash type in it... :)
GargoyleMT wrote:Auto-search by hash support (is there a way to have only one search per alternative, instead of a normal and hash one?)
Yes, there is - simply use the fact that a file has a hash to mean that it should not be searched by the normal search for alternatives. :)
[/quote

Initially, or if the patch never makes it to the main tree, this would limit the usefulness of the hash. If we can find an identically sized file with a similar name, it's probably a source, and we can verify that with the hash. :) Eventually, I'd that searching by hash might be the only way it makes sense to search for alternatives, but that's after it's been widely accepted.
GargoyleMT wrote:Aside, isn't there a not too difficult solution for returning the hash from a passive to passive client? I don't like putting that in a SEP field (somebody else's problem).
If you can't use $GetMetaInfo, you will not get the hash. Why should we code this when DC++ has no support for it?
Hmmm... yeah. Too difficult for now. But it makes me think that those calling for extensions to the client-server protocol aren't very far off... If we had the hash and there was a standard way of returning it, then we could do so.
GargoyleMT wrote:And how about the difficulty of calculating more than one hash at the same problem (searching by non TTH hashes is something I'm comfortable putting in a SEP field)?
Not hard at all... simply allow the user to specify what hashes they wish to store locally, then calculate all the specified hashes whenever necessary (the TTH would simply store more data than the other hashes).
Now, too much choice here would be a problem. There really are only 4 or 6 hashes that might be supported, but since people are going to be unhappy with the performance hit of hashing, I'd like to only calculate one, if possible... ??
GargoyleMT wrote:Well... w00t. Or something.
Hmm... how about "woohooo! more work!" ?

---
"Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning." - Rich Cook
Yes, definitely. Or... Oooh, Ocarina of Time on the Gamecube! Wheee!

And the tagline seems to fit pretty well - unsurprising if you pick them by hand, and surprising if it's a randomly picked one.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-08 16:39

GargoyleMT wrote: I wondered about this. I'm unfamiliar with the services that Bitzi has, but maybe we can use the hashes from the direct download links you're supporting to get back a bitprint (and thus TTH) from bitzi? This (optional) integration might allow us for some other nice features.
You can only lookup files at bitzi by SHA1 or SHA1.TTH, afaik. (not only TTH)

eHast
Posts: 18
Joined: 2003-01-09 02:36
Location: Lund, Sweden

Post by eHast » 2003-03-08 16:43

I know people must hate me for this by now ;-), but do check out what the OCN people have done with Tree Hashes and such. (This is what Tiger Tree Hashes are based on.)http://open-content.net/specs/.

There's stuff for detecting partial files and downloading as well. Although that system is designed for HTTP.

And IIRC there was some talk a few months ago on the dev list for OCN that there were vulnerabilities in TTH. Not sure what came out of it in the end though.

Nice to see that some of the old crew are still active here with the ideas though. :-)
[/url]

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-09 03:09

And IIRC there was some talk a few months ago on the dev list for OCN that there were vulnerabilities in TTH. Not sure what came out of it in the end though.
There is a new draft out, 02 that fixes collisions between internal nodes and leaf nodes. That might be it?

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Post by sarf » 2003-03-09 10:35

Main points I want to take up:
I prefer to calculate as many hashes as necessary at the same time because the CPU hit is really irrelevant - it's the I/O that will take time.
GargoyleMT wrote:[snip]
I wondered about this. I'm unfamiliar with the services that Bitzi has, but maybe we can use the hashes from the direct download links you're supporting to get back a bitprint (and thus TTH) from bitzi? This (optional) integration might allow us for some other nice features.
Hmmm... I don't think so, but someone (that nice person which is always selected to do the dishes and make dinner) should look into that.
GargoyleMT wrote:Saving of the list of shared files into an XML format. Strike that, that's the item below. This bullet must encapsulate the idea of caching the hashes... That's got to be the second thing implemented from a testing point of view, right after calculating the hashes. The average time I spend running a modded DC++ is on the order of minutes... I don't want to wait for hashes to complete so I can test another feature of this idea.
Heh. Caching hashes is a must, agreed. I hope that no one thought we'd recalculate them everytime DC++ started?
GargoyleMT wrote:Well, just a basic name/size/hash is probably good for now.
Perhaps... but I would like to have the hashes put into file streams - those are associated with the file itself (and are independent of its location) and would make it unnecessary to recalculate the hash when a file is moved... this is not a big issue, but a thing that we could do with this is calulating hashes on downloading files, putting the data a file stream, thus making it easy (= not frustrating for the user) to share downloaded files. This could be done with XML data too (and would have to be since older file systems (FAT/FAT32, for example) does not support file streams). Like this:

Code: Select all

<directory name="parent dir name"> 
 <file name="blah"> 
  <hash type="MD4" value="[value]"/> 
  <hash type="MD5" value="[value]"/> 
  <hash type="SHA1-TTH" value="[value]"/> 
  <hash type="TTH" value="[value]"/> 
  <nodehash type="TTH-nodes">
    <node number="1" value="[value]"/>
    <node number="2" value="[value]"/>
    <node number="3" value="[value]"/>
    <node number="4" value="[value]"/>
  </nodehash>
 </file> 
 <directory name="subdir> 
   <file name="oh-so-creative-file-with-filestreams"> 
     <filestream type="TTH" name="TTH-hash">
     <filestream type="MD4" name="MD4-hash">
     <filestream type="MD5" name="MD5-hash">
     <filestream type="TTH-nodehash" name="TTH-nodehash">
     <filestream type="SHA1-TTH" name="SHA1-TTH-hash"/> 
   </file> 
  </directory> 
</directory>
Hope this... umm... helps.
GargoyleMT wrote:But I was wondering if something like this would save time (vs. the way the queue is saved)
[snip xml data]
Yes... I kinda thought that this way would be obvious, though. :)
GargoyleMT wrote:Note, I don't even know if that's valid XML. Arne mentioned that XML file lists would be slower, but I don't know how he formatted his, or if he was just using his experience to estimate.
XML-stored data require more CPU time to interpret/retrieve than data that is stored according to a specific, fixed way.
GargoyleMT wrote:But what's that about being compressed (NTFS Compressed?) and about alternate data streams? I know about them in theory, but how're they useful to know about in DC++?
Well, the compressiong part is irrelevant, but DC++ could refuse to share NTFS encrypted files (which may be unusable to others)... and as said, store data about the file (such as metadata and hashes) in them to make it independent of where the file is stored.
GargoyleMT wrote:Ok, I imagine your first suggestion to be something like: "$Supports BZList TTH MD4 SHA1"? Or a separate command like "$Supports BZList SupportsHash" with $SupportsHash returning a list of hash types... but what does that gain us?
Other clients may support only TTH hashes. Or (if someone makes a good mldonkey client) only MD4 hashes.
GargoyleMT wrote:I think a $GetMetaInfo <file> should return a mini-xml centered around the <file> type above... probably with a few things removed, like the real path on disk.
Yes... though it would, of course, have to use the real information instead of referring to filestreams. Duh. <bonks own head>
GargoyleMT wrote:Initially, or if the patch never makes it to the main tree, this would limit the usefulness of the hash. If we can find an identically sized file with a similar name, it's probably a source, and we can verify that with the hash. :) Eventually, I'd that searching by hash might be the only way it makes sense to search for alternatives, but that's after it's been widely accepted.
Also known as the "make it a users choice feature and change the default value when we feel its time". Sure, I'd go for that.
GargoyleMT wrote:Hmmm... yeah. Too difficult for now. But it makes me think that those calling for extensions to the client-server protocol aren't very far off... If we had the hash and there was a standard way of returning it, then we could do so.
But, as always, this would have to be done through a third-party - and since the hub would not like to be overloaded we'd have to specify a third-party protocol for an active client to act as a "gateway" between the two passive clients... I don't like it. Of course, if we want to be backwards-compatible, we can always support the sending of a private message containing a set string with the data. This would be evil, though, as it would, eventually, overload the hubs.
GargoyleMT wrote:Now, too much choice here would be a problem. There really are only 4 or 6 hashes that might be supported, but since people are going to be unhappy with the performance hit of hashing, I'd like to only calculate one, if possible... ??
As I said, calculating several hashes is not too hard. If we are going to support several hashes, we need to calculate them... just set the different hash methods to "on" by default, and people will never notice. :)
GargoyleMT wrote:Yes, definitely. Or... Oooh, Ocarina of Time on the Gamecube! Wheee!
Or rather, "Hmm... that online game looks really good... <chomp> Eeek! Where's my free time?!"
GargoyleMT wrote:And the tagline seems to fit pretty well - unsurprising if you pick them by hand, and surprising if it's a randomly picked one.
Heh, I always pick them by hand... I just let randomness control it every now and then.

Sarf
---
Pro'-gram 1) n. A magical spell cast over a computer which transforms user input into error messages. 2) vt. An activity similar to banging one's head against a wall, but with less opportunity for relief.

ender
Posts: 224
Joined: 2003-01-03 17:47

Post by ender » 2003-03-09 11:41

sarf wrote:I prefer to calculate as many hashes as necessary at the same time because the CPU hit is really irrelevant - it's the I/O that will take time.
I'm not so sure about this - it took about 10 hours for eDonkey to hash 280 GB of data on my VIA Eden 533 MHz, meaning speed of about 8 MB/s... Then again, I could never get my server to read from those disks faster than ~15 MB/s (and the computer isn't really powerful either).
sarf wrote:Perhaps... but I would like to have the hashes put into file streams - those are associated with the file itself (and are independent of its location) and would make it unnecessary to recalculate the hash when a file is moved... this is not a big issue, but a thing that we could do with this is calulating hashes on downloading files, putting the data a file stream, thus making it easy (= not frustrating for the user) to share downloaded files. This could be done with XML data too (and would have to be since older file systems (FAT/FAT32, for example) does not support file streams). Like this:
IMO, it is enough to transmit only the master hash in the filelists, and to let the clients exchange partial hashes directly on connect. It would make the filelists smaller, and their processing would probably be faster.
sarf wrote:Well, the compressiong part is irrelevant, but DC++ could refuse to share NTFS encrypted files (which may be unusable to others)... and as said, store data about the file (such as metadata and hashes) in them to make it independent of where the file is stored.
The NTFS compression & encryption are transparent to programs - I have a number of compressed or encrypted directories on my disk, the programs work just fine with them. DC++ could refuse to share the files it doesn't have permission to read though (they don't have to be encrypted to be unreadable for a user).

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Post by sarf » 2003-03-10 10:30

ender wrote:I'm not so sure about this - it took about 10 hours for eDonkey to hash 280 GB of data on my VIA Eden 533 MHz, meaning speed of about 8 MB/s... Then again, I could never get my server to read from those disks faster than ~15 MB/s (and the computer isn't really powerful either).
Well, I don't think calculating "yesterdays hashes" is too mean. The hashes should be computed once, and the hashing part may be done in the background (yes, we'll loose a few searches this way, but it will be more acceptable by the user).
Ehm... what I mean by this is that if DC++ supports the MD4/MD5/TTH/TTH-SHA1 hashes we're sure to get folks from other networks to go over to DC... as well as setting up a nice way to allow global searching.
ender wrote:IMO, it is enough to transmit only the master hash in the filelists, and to let the clients exchange partial hashes directly on connect. It would make the filelists smaller, and their processing would probably be faster.
Hmm... well, the reason I wanted all the data to be transferred in the filelist is that a) then people will start downloading the filelists of other people which will hopefully lead to arne implementing a "search downloaded filelists for alternate sources", b) people would not waste bandwidth by sending $GetMetaInfo back and forth all the time and c) with good compression, having a wee bit more data in the filelist will not matter... though when I think about it, the nodehashes does not have to be included in the filelist. Aw, what the heck, why not? It's not like you re-download the filelist every hour or anything.
ender wrote:The NTFS compression & encryption are transparent to programs - I have a number of compressed or encrypted directories on my disk, the programs work just fine with them. DC++ could refuse to share the files it doesn't have permission to read though (they don't have to be encrypted to be unreadable for a user).
Ummm... what's the reason to have encrypted files if any program can read them? Oh well, that's M$ thinking for you.

Sarf
---
HELP: The feature that assists in generating more questions. When the Help feature is used correctly, users are able to navigate through a series of Help screens and end up where they started from without learning a damn thing. - Computer Definitions

ender
Posts: 224
Joined: 2003-01-03 17:47

Post by ender » 2003-03-10 11:09

sarf wrote:Ummm... what's the reason to have encrypted files if any program can read them? Oh well, that's M$ thinking for you.
Well, if you have two or more administrative users, and want to prevent them from viewing your files...

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-10 13:32

ender wrote:
sarf wrote:Ummm... what's the reason to have encrypted files if any program can read them? Oh well, that's M$ thinking for you.
Well, if you have two or more administrative users, and want to prevent them from viewing your files...
Couldnt they just change eachothers password and login using the other account? Anyway, I think the biggest reason is to have the data encrypted on the disk, so that even if you boot a linuxfloppy with NTFS reading abilities you wont be able to read the data, or if you steal the HDD. Also, reading encrypted files from a fileserver wont leak information to someone sniffing the network.

ender
Posts: 224
Joined: 2003-01-03 17:47

Post by ender » 2003-03-10 15:53

sandos wrote:Also, reading encrypted files from a fileserver wont leak information to someone sniffing the network.
Actually, it will, as the data is decrypted before sent over the network (isn't Micro$oft great?:twisted:)

3ddA
Posts: 2
Joined: 2003-07-18 07:41

Hashing is great but it will not solve all the problems....

Post by 3ddA » 2003-07-19 03:30

Great that you have been discussing hashing and segmenting of file, wich must be apart of DC soon, but there is one thing that still needs to be discussed in my opinion.

With hashing I can only verify that the individual file is correct/complete, the point of SFV is to verify that I can verify that a group of files correct/complete.

When you discuss hashing I also think that it should be taken into consideration when you discuss hashing that inventing or using any other scheme than CRC will loose one important point. The problem when I do have the SFV-file, but one file is missing. I want to take that CRC-code and download the file. Using any other hash-code will make this impossible and will force me to download a file with the same name (which maybe turns out to be the same song but not the same version that I wanted).

I hope you really can work this out, hash would be a big step forward, with some compability with SFV it could be great.

By the way don't think so much about hashing performance just make sure that the program can give a decent guesstimate of how long it will take and show it with a progressbar to the user. Later on when we have hashing we can discuss performance problems (Any user with 100GB+ of files will understand that it takes time to hash it).

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Re: Hashing is great but it will not solve all the problems.

Post by sandos » 2003-07-19 06:46

3ddA wrote:When you discuss hashing I also think that it should be taken into consideration when you discuss hashing that inventing or using any other scheme than CRC will loose one important point. The problem when I do have the SFV-file, but one file is missing. I want to take that CRC-code and download the file. Using any other hash-code will make this impossible and will force me to download a file with the same name (which maybe turns out to be the same song but not the same version that I wanted).
The problems is basically that of keeping "collection" information. There are several ways to do this. 1) Archive all the files and send out the hash of that. You dont NEED to split files up into .rars anymore with hashes and multi-source parallel downloading. 2) Use collection files. The author of shareaza said he would introduce collection files, basically XML files listing files and theire hashes. This was some time ago, dont know what happened with it. 3) Use another collection format, winamp playlists for example.

Da8add1e
Posts: 30
Joined: 2003-02-04 13:17
Location: Saddams Bunker :)

Post by Da8add1e » 2003-07-28 21:54

i think

1/ hashes should be on demand, lots of stuff i share never gets downloaded due to the number of other people sharing the same file or other peoples lack of good taste :P

2/ simple hashes should be good enough as dc++ already has file size checking and files tend to from the same origins a good sa1 system should be enough

3/ XML tables or good old txt files for the previously requested files would save them being done again but that causes the update annomally to be taken into consideration and deliberate tampering of the XML/txt files and the client would be a problem, therefore plain txt hashes in an encrypted file or in stored in memory would be only way to secure it from the hax0r5

4/ just do it - the sooner its done the quicker we can have multiple sources, its only a matter of time b4 the DC++ clones get multi sourcing and if they don't someone will come up with another p2p that does, yeh dc++ is good but that doesn't mean its perfect it WOULD be better with multi-sourcing

Done :)
Need NOT Greed (don't abuse poor countries)
Pay the Poor (increase minimum wage)
Tax the Rich (100% SuperTax rate)
(Do ya think thats maybe a little left-wing?)

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-07-29 19:28

Your items one and two don't really relate. Even on the fastest system, on-demand hashing isn't possible without lag (think CD sized files on a non Ultra-320/SATA drive). "On demand" doesn't make much sense if we work out how to return hashes with search results, as well.

The concept of "simple" hashes doesn't relate either - the implementation thus far uses a mathematically proven and pre-existing algorithm - the Tiger hash.

Be patient, please.

Multi-source is already in other DC clients (there are no DC++ clones as far as I've seen) - specifically the DCTC library under linux (used in DCGUI-QT), and the DCPro client under Windows. Neither of those has enough "market share" (so to speak) as DC++, and neither has offered a solution for hashes and file integrity

distiller
Posts: 66
Joined: 2003-01-05 18:05
Location: Sweden
Contact:

Post by distiller » 2003-08-02 21:49

Just wanted to tell you that I have started to read up on hashing and will do what I can to help you out with hashing and segmented downloads.

Most of the ideas/problems have been discussed in this thread already.

Let's make a sum up and decide in what direction to go. If Arne doesn't want to make those of course. =)

cologic
Programmer
Posts: 337
Joined: 2003-01-06 13:32
Contact:

Post by cologic » 2003-08-03 00:57

I should probably mention now that I wrote tigertree hashing code for my DC++ mod several months ago and have been playing with it since. It correctly calculates an hash trees to an arbitrary depth, stores them in a file for later retrieval, updates that file (albeit inefficiently), searches based on the resulting root hashes (and runs up against DC++'s incoming search result filtering: if the search term was a hash string, it will filter an apparently unrelated unrelated search result, even though it is in fact a correct search result. JT's id3 mod relies on a stupid parsing trick that shouldn't be relied upon in DC++, nevermind other clients.), and can map hashes to filenames and vice versa using the /filetohash and /hashtofile commands.

The current problem I'm having is transferring hash trees: DC++ isn't equipped to use its file transfer mechanisms to send or receive anything except files, and using a set of plain files for a hash database would be silly, IMO.

This requires a different mechanism to transfer data, up to maybe 50kB practically. The most obvious choice is using the Base32 encoding that's apparently become standard for this sort of hash, bandwidth-inefficient as that is (the total amount of data is small, so a little inefficiency in the client-client connection doesn't particularly matter), and sending it as a standard client-client command, complete with the $ and | initiation and termination tokens.

The potential difficulty with this, which I haven't tested, is that different clients may not react to exceptionally large client commands gracefully.

distiller
Posts: 66
Joined: 2003-01-05 18:05
Location: Sweden
Contact:

Post by distiller » 2003-08-15 04:30

I'd really love to take a look at your work so far cologic, you have a URL for the source?

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-08-17 08:48

http://utrum.dyndns.org:8000/

Look in related. I think this includes hashing stuff.

YaRi
Posts: 18
Joined: 2003-08-19 10:52

Post by YaRi » 2003-08-19 10:54

sandos wrote:http://utrum.dyndns.org:8000/

Look in related. I think this includes hashing stuff.
Too bad there isn't any documentation about hash-feature, it's just mentioned in preferences.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-08-21 20:37

YaRi wrote:
sandos wrote:http://utrum.dyndns.org:8000/

Look in related. I think this includes hashing stuff.
Too bad there isn't any documentation about hash-feature, it's just mentioned in preferences.
It doesnt imo need documentation? Or do you mean the sourcecode?

YaRi
Posts: 18
Joined: 2003-08-19 10:52

Post by YaRi » 2003-09-30 06:35

sandos wrote: It doesnt imo need documentation? Or do you mean the sourcecode?
So it has hashing, then what? It hashes all my shares?
Are these hashes used in any way? Where are they stored? Is multi-source downloading going to happen any time soon?
These are just some questions I'd like to have some answers, please.

YaRi
Posts: 18
Joined: 2003-08-19 10:52

Post by YaRi » 2003-09-30 06:40

OK, I found some answers already and other topics about this subject.
Editing posts seems to be disabled :o

_Cennet_
Posts: 2
Joined: 2004-01-04 20:59

Hashing

Post by _Cennet_ » 2004-01-04 22:11

Ok, nice discussion you have going here! 8)

First of all, the discussion seems to fade, when the amount of hashing is introduced. I think that it must be pointed out that there are two reasons for hashing the file. One has actually the same use as the CRC, and is applied to the whole file. This hash or CRC can the be used to locate a precise match (especially if it is paired with filesize).

The second reason for using hashing is, to detect errors in the file transfer.

These two methods imply different approaches.

To locate the file through CRC or hash, requires that all files are pre-hashed before a search arrives, and thus needs to be performed at startup (or when files are added to the share).
It would be nice to have this "overall" hash in different formats, so DC++ users can benefit from sites that post hashes in different formats. I would call this a filehash because it is a hash for the entire file.

It is obvious that it is extremly annoying to download a large file, and then discover that the hash does not match. Thus the need for hash trees (or perhaps just a simple "partial-chunk-hash" list) comes into play. I would call this a detailhash.
The size of this detailhash should be very dependant on the filesize. IMO a depth of 8 seems fair enough, since a 700 mb file, would consist of 2,7 mb segements. On the other hand a 3 mb file should not be split up in 256 parts. So my idea is that the minimum size for a segment is 1 mb (ea. less segments), and the maximum number of segments should be 256.

IMO it should NOT be possible to search for a partial chunk in DC++. I base this on the fact that slots are a limited resource, and if everyone open connections for seperate chunks, it will become very difficult to get a slot. Furthermore afaik most users they will have maxed out their line at 7-10 connections, and any further connections, will only result in decreased performance.

Both hash types should be cached, to minimize load. I dont know excatly how you cache them if you want them to follow a moved/renamed file, but if you now the filehash for a file, another file with the same size and filehash, will have the same detailhash.

From my experience, creating a MD5 digest from a 1GB file takes less than a second (on a 700 mhz laptop). I haven't tried anything with the trees.

So, basically I'm proposing:
Build (multiple-) filehashes on startup
Build detailhashes (ea. trees) on demand


If it is certain that any client will ignore any command that it does not know, searching for a filehash is actually trivial, since a new command can be created.
Clients who does not support the command, does not support the filehash either, so asking them for a filehash is a waste time/bandwith.
Also getting the detailhash from a user can be implemented with a new command. Here it might come in handy to know what users did respond to a search for a filehash, as these will know the detailhash also, and the user who has an open slot might not.


If a new command cannot be introduced, the following would be my opinon:

Searching for the filehash should be done by introducing a new file-type, as that is somewhat consistent with it's current use (IMO), and compatible with the protocol (afaik). As mentioned earlier, I would not like to see segment downloading implemented as a feature in DC.

Retreving the detailhash, should be done like retrieving the filelist (mylist.dclst), a client would reply "file not found" if it does not support this feature.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Re: Hashing

Post by GargoyleMT » 2004-01-05 09:31

_Cennet_ wrote:Ok, nice discussion you have going here! 8)
Thanks for reviving it. :)
_Cennet_ wrote:One has actually the same use as the CRC, and is applied to the whole file. This hash or CRC can the be used to locate a precise match (especially if it is paired with filesize).
The second reason for using hashing is, to detect errors in the file transfer.
These two methods imply different approaches.
Perhaps. However, with a suitable algorithm, the root hash of the tree can be used as the unique file identifier.

A couple things have changed since my last post. Cologic has gone on to create a fairly complete hashing system in BCDC based on the Tiger (Tree) Hash algorithm. Reverse connect has started a system based on eDonkey2000 hashes. DC:Pro has adopted TTH hashing.
There have also been a few fairly complete specicifications of possible successors to the DC protocol.
_Cennet_ wrote:To locate the file through CRC or hash, requires that all files are pre-hashed before a search arrives, and thus needs to be performed at startup (or when files are added to the share).
The current (or future) system need not block startup until file hashing is complete. Nothing (logically) should prevent a client from returning search results for non-hashed files (though logically it will also prefer search results with hashes over others).
_Cennet_ wrote:It would be nice to have this "overall" hash in different formats, so DC++ users can benefit from sites that post hashes in different formats. I would call this a filehash because it is a hash for the entire file.
This is certainly something I believed. You could either always search for the same hash (eg. sha1) that you have a direct link to, or you could do a search for the purpose of converting that into the native hash for the DC network (eg. tth). However, searches are not cheap for large hubs. Sticking to one format and making it part of the protocol will avoid traffic and make for a more pleasant entry point for new clients. (Even if it means not being able to use sites such as sharereactor.)
_Cennet_ wrote:IMO it should NOT be possible to search for a partial chunk in DC++. I base this on the fact that slots are a limited resource, and if everyone open connections for seperate chunks, it will become very difficult to get a slot.
Searching for a segment is a very specific case of Partial File Sharing (PFS as it's known in Gnutella). Agree upon a convention for indicating "partial file" in a search result, and agree on a client to client part exchange ( + perhaps same-hub source exchange), and you don't need the ugliness of searching for segments.

About slot usage: in my mind, I've always paired multisource downloading with a complex upload queue. I think the way the system currently works is quite archaic.
_Cennet_ wrote:Both hash types should be cached, to minimize load. I dont know excatly how you cache them if you want them to follow a moved/renamed file, but if you now the filehash for a file, another file with the same size and filehash, will have the same detailhash.
True. BCDC uses QDBM as a database to store hashes, plus some necessary information for determining if the file has changed.
_Cennet_ wrote:From my experience, creating a MD5 digest from a 1GB file takes less than a second (on a 700 mhz laptop). I haven't tried anything with the trees.
That file must be cached in memory, otherwise you have a disk subsystem to be feared.
_Cennet_ wrote:If it is certain that any client will ignore any command that it does not know, searching for a filehash is actually trivial, since a new command can be created.
True, but hubs will not ignore any command they do not know, and they will not broadcast unknown cpmmands. Searches are sent to the hubs, so the ability to create new client commands doesn't gain us anything.
_Cennet_ wrote:Clients who does not support the command, does not support the filehash either, so asking them for a filehash is a waste time/bandwith.
Also getting the detailhash from a user can be implemented with a new command. Here it might come in handy to know what users did respond to a search for a filehash, as these will know the detailhash also, and the user who has an open slot might not.
BCDC advertises "TTH" in $Supports to indicate its hashing support. It returns root hashes in search results in the "Hub Name" field in the form: "TTH:<hash>". It supports exchange of the full tree in client to client with "$GetMeta TTH;<levels> <filename>"
_Cennet_ wrote:Searching for the filehash should be done by introducing a new file-type, as that is somewhat consistent with it's current use (IMO), and compatible with the protocol (afaik). As mentioned earlier, I would not like to see segment downloading implemented as a feature in DC.
Searching by hash is enabled in BCDC with a data-type of 9 (folders is number 8 ).

Sidenote: eDonkey2000 hash support is probably not wise, because of its basis on the MD4 hash. Cologic has brought up numerous papers with attacks that can create a collsion with that algorithm.

_Cennet_
Posts: 2
Joined: 2004-01-04 20:59

Post by _Cennet_ » 2004-01-07 18:34

Quotes are from GargoyleMT, can't get it to display properly :(
...with a suitable algorithm, the root hash of the tree can be used as the unique file identifier...
True. My suggestion was that these were unrelated, to avoid a complex hashing at startup. However, if the main problem with hashing seems to be disk transfer speeds, that idea would actually worsen the situation, as files would potentionally read twice. Also, my idea only works, if searching for segements is not a possibility.
The current (or future) system need not block startup until file hashing is complete. Nothing (logically) should prevent a client from returning search results for non-hashed files (though logically it will also prefer search results with hashes over others).
Of course. What I meant was, that it would have to be hashed, before any results with hash, could be returned. My post was unclear, sorry.
I think the way the system currently works is quite archaic.
I can see your point. I just worry that downloading segements will cause the entire DC network to stall, if people start downloading from many more sources, than they do today. Today, a user downloading a file will use up one slot. If he can get a segement from four other users, he will use five in the proposed system. One could argue that he will finish his download faster, but I belive that many people will go a little over the edge, and download from more sources than necessary. If you believe that hub operators will require more slots open, I see the system entering a state where transfer speeds are around the same as on the eDonkey or Kazaa.
But then, that is a whole other discussion. I believe there was one about it on the old site?.
hubs will not ignore any command they do not know
I see, I hadn't thought of that, but of course it does.
I guess that leaves only the option of filetype 9. Too bad, since this will probally increase hub load, when all users search twice, first for hash, then for file, in order to get the most results.
It supports exchange of the full tree
What algorithm does the BCDC use to build the tiger tree?

I suppose that one of eDonkey's hashes are the complete file hash? It would be nice to support just that one anyway. Do you have links to these papers? Any links to the alleged flaw in the sourceforge tiger tree, would be nice also.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2004-01-07 22:28

_Cennet_ wrote:Quotes are from GargoyleMT, can't get it to display properly :(
Try Quote="username"
_Cennet_ wrote:However, if the main problem with hashing seems to be disk transfer speeds, that idea would actually worsen the situation, as files would potentionally read twice. Also, my idea only works, if searching for segements is not a possibility.
Indeed, disk speeds are the real limitation currently. BCDC++ uses quite a bit of processing power during hashing, but it is still disk I/O bound.

Well, using two algorithms does make some kind of sense. In particular, if you want to use Magnet direct file links. There is a schema for SHA1 hashes, and a schema that uses bitprints - which contain both a SHA1 and TTH root hash. However, the SHA1 only one is far more common on the websites that offer such links.

I think that using the existing $Search facility, though novel (I hadn't thought about it) is particularly wasteful. I see you didn't respond to the alternative scheme, but it roughly parallels that in other P2P systems.
_Cennet_ wrote:Of course. What I meant was, that it would have to be hashed, before any results with hash, could be returned. My post was unclear, sorry.
Oh, true. Some people seem to think hashing should/would have to be finished before search results could be returned (or before hubs could be joined). I just assumed you had the same thoughts.
_Cennet_ wrote:I can see your point. I just worry that downloading segements will cause the entire DC network to stall ... I see the system entering a state where transfer speeds are around the same as on the eDonkey or Kazaa.

But then, that is a whole other discussion. I believe there was one about it on the old site?.
Well, aren't transfers speeds on eDonkey and Kazaa both very good? Your concern about slots is... appreciated. I was never around for the old lichlord forums, but I think there's another thread in this one where Sarf had some good arguments about the usage of slots, and potential benefit to the DC network of multi-source downloading. If you want to talk about that subject, let's find and revive that thread (as well). :mrgreen:
_Cennet_ wrote:I see, I hadn't thought of that, but of course it does.
I guess that leaves only the option of filetype 9. Too bad, since this will probally increase hub load, when all users search twice, first for hash, then for file, in order to get the most results.
I don't think that Ptokax (the strictest hub I know) parses $Searches for validity. I know DCH++ doesn't have a problem with Type 9 = Hash.

But until hashes are widespread, you do have the problem where you need to search for alternates the normal way (to get potentially compatible files) and also by hash (for definitely compatible files). After all, you cannot both include the Hash as the search term (and flag it with type 9) and include the normal search terms (plus the size and type constraints) and still get results back from legacy clients.
_Cennet_ wrote:What algorithm does the BCDC use to build the tiger tree?
Tiger is the name of the algorithm. One reference is in Bitzi.com's bitcollider project: tiger.c. The "Tree" part describes a Merkle Hash Tree. A little bit more information can be found in the THEX Draft (seemingly 404, hence my personal mirror).
_Cennet_ wrote:I suppose that one of eDonkey's hashes are the complete file hash? It would be nice to support just that one anyway.
No, it wouldn't be particularly nice. Come up with a hash exchange (in finding a tth root hash for a given ed2k hash) mechanism that isn't wasteful, and I'll reverse my position. The last thing the DC network needs is more hopeless searches, which are what supporting arbitrary (sig2dat, ed2k, md5, sha1) hashes would imply. There needs to be One True Hash for the whole network.
_Cennet_ wrote:Do you have links to these papers? Any links to the alleged flaw in the sourceforge tiger tree, would be nice also.
Here's a link to one of the papers on the MD4 algorithm collisions. There is no flaw in Tiger as of yet, though the construction of the original tiger hash trees was flawed by not properly weighting/shifting some of the leaves (more information on this can be found on the p2p-hackers mailing list from a while back - it's since been fixed in the THEX draft listed above.)

joakim_tosteberg
Forum Moderator
Posts: 587
Joined: 2003-05-07 02:38
Location: Sweden, Linkoping

Post by joakim_tosteberg » 2004-01-25 13:28

Do I get this wrong or has arne begun to implement hashing?
-- 0.307 --
* Fixed full name being displayed in user commands with submenus
* Bloom filters dramatically improve search efficiency (if you have a large share)
* Merkle trees and tiger hashing added
* Auto match results shown in status bar

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2004-01-25 15:43

joakim_tosteberg wrote:Do I get this wrong or has arne begun to implement hashing?
You read that correctly.

joakim_tosteberg
Forum Moderator
Posts: 587
Joined: 2003-05-07 02:38
Location: Sweden, Linkoping

Post by joakim_tosteberg » 2004-01-26 01:10

GargoyleMT wrote:
joakim_tosteberg wrote:Do I get this wrong or has arne begun to implement hashing?
You read that correctly.
:D

liny
Posts: 30
Joined: 2003-11-01 09:18

Post by liny » 2004-01-26 05:22

GargoyleMT wrote:
joakim_tosteberg wrote:Do I get this wrong or has arne begun to implement hashing?
You read that correctly.
Really a good news.
Is there any tools that can generate TTH for a file?

Todi
Forum Moderator
Posts: 699
Joined: 2003-03-04 12:16
Contact:

Post by Todi » 2004-01-26 21:14

liny wrote:Really a good news.
Is there any tools that can generate TTH for a file?
I went ahead and dug some out for you.

HashCalc seems to do the job nicely. Simple GUI and it works for a lot of different hashes.

There are also some commandline types, that could be useful for making sites with hashes, or generating a hash-check file (could be used for confirming file integrity perhaps).

Fsum
ReHash

liny
Posts: 30
Joined: 2003-11-01 09:18

Post by liny » 2004-01-26 22:40

Thanks for your reply. These tools are very helpful.
They can do tiger hash. But how to get tiger tree hash?
Todi wrote:
liny wrote:Really a good news.
Is there any tools that can generate TTH for a file?
I went ahead and dug some out for you.

HashCalc seems to do the job nicely. Simple GUI and it works for a lot of different hashes.

There are also some commandline types, that could be useful for making sites with hashes, or generating a hash-check file (could be used for confirming file integrity perhaps).

Fsum
ReHash

IntraDream
Posts: 32
Joined: 2003-12-12 14:28
Location: FL,USA
Contact:

Re: Hashing

Post by IntraDream » 2004-01-27 07:31

I would like to know why Tiger was chosen? SHA1 from what I have read can process hashes 2X as fast as Tiger (on a 32 bit platform). THEX/SHA1 would be very secure for a hash tree and easier for me to add to my client :wink: . but as im sure this has been pointed out this Tree structure wont stop "hollow eggs"(fake files) as it were because the origional source could have a corrupt hash(or a valid hash for a file name with 3MB*chr(0) named The_File_You_Want.MP3) . but i think this will eventually be noticable when multisource becomes more prevelent and valid sources show a high amount of matches with the same hash. Anyway i think we should set a standard im just not ready to vote for TTH just because someone has already added it to a client.

Tim-

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Re: Hashing

Post by sandos » 2004-01-29 11:14

IntraDream wrote:but as im sure this has been pointed out this Tree structure wont stop "hollow eggs"(fake files) as it were because the origional source could have a corrupt hash(or a valid hash for a file name with 3MB*chr(0) named The_File_You_Want.MP3)
Corrupt hashes doesnt happen in a good implementation, you check the hash as you download a new file. The problem, as I think youre saying, is you need to somehow know what hashes to trust. I think that will mostly be done by looking at the number of sources, hopefully fake files will have a less number of sources than the good files, and secondly itll be done by clicking a link on a webpage that you trust.

IntraDream
Posts: 32
Joined: 2003-12-12 14:28
Location: FL,USA
Contact:

Post by IntraDream » 2004-01-29 22:27

Yes i agree that it likely wont be that much of an issue after sources spread. but the question is still there Tiger vs SHA1. is tiger that much more secure. how much faster is tiger on a 64 bit system and how much slower on a 32 bit system.

joakim_tosteberg
Forum Moderator
Posts: 587
Joined: 2003-05-07 02:38
Location: Sweden, Linkoping

Post by joakim_tosteberg » 2004-01-30 10:50

So the next version will have TTH in supports?

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2004-01-31 17:40

liny wrote:Thanks for your reply. These tools are very helpful.
They can do tiger hash. But how to get tiger tree hash?
To test your implementation, you can see the test vectors in the THEX draft.
Bitzi.com's bitcollider will generate TTHes of a file, in addition to other hashes.


IntraDream wrote:I would like to know why Tiger was chosen? SHA1 ...
One good reason in my mind is that sha1 hashes are full-file hashes, making a sha1 hash tree would confuse the namespace, and perhaps lead to more confusion.
[10:28] <arnetheduck> hum...can't decide between sha1 trees and tiger trees...
[10:35] <GargoyleMT> sha1 trees are potentially confusing, since the sha1 hashes out there are full file, not merkle trees
[10:35] <GargoyleMT> the same cannot be said for tigers
[10:36] <cologic> Tiger tree roots can be searched for on bitzi/etc.
[10:36] <GargoyleMT> magnet links also have a namespace for tth hashes, but none for sha1 hash trees, as far as I know - http://magnet-uri.sourceforge.net/
If DC clients as a whole implement the same hashing scheme and include the ability to take magnet download links, sites will open to serve those links...
joakim_tosteberg wrote:So the next version will have TTH in supports?
Maybe. Right now, it hashes files, stores the hashes, and returns them in the hub name in search results. The BCDC implementation does a bit more, it will pick up and download the root hash from a TTH capable source, and will verify downloads against them.

I'm sure it will all be sorted out, it's nice to see liny interested, since his multi-source implementation seems to cause a bit of file corruption.

liny
Posts: 30
Joined: 2003-11-01 09:18

Post by liny » 2004-02-01 04:22

GargoyleMT wrote:The BCDC implementation does a bit more, .... and will verify downloads against them.

I'm sure it will all be sorted out, it's nice to see liny interested, since his multi-source implementation seems to cause a bit of file corruption.
Did you read BCDC code? I've read it, no code about verifing.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2004-02-01 11:59

liny wrote:Did you read BCDC code? I've read it, no code about verifing.
You branched off from BCDC too early to see that. There are only stubs in 0.306a as well. Sandos has been committing code, and verifying against hashes is in the current BCDC source.

IntraDream
Posts: 32
Joined: 2003-12-12 14:28
Location: FL,USA
Contact:

Post by IntraDream » 2004-02-01 18:29

GargoyleMT wrote:
IntraDream wrote:I would like to know why Tiger was chosen? SHA1 ...
One good reason in my mind is that sha1 hashes are full-file hashes, making a sha1 hash tree would confuse the namespace, and perhaps lead to more confusion.
I dont see where the confusion is.. u can also tiger hash an entire file.. the Merkle tree says it can use any Hash and in the explenation i read it recomends SHA1 and uses it as an example. we dont have to say its sha1 hash just as we dont say TTH is Tiger.. we could say DC uses STH .

cologic
Programmer
Posts: 337
Joined: 2003-01-06 13:32
Contact:

Post by cologic » 2004-02-02 04:42

u can also tiger hash an entire file..
I'm curious whether you can find an example of this in the wild. SHA1, by contrast, is very common in this role.

IntraDream
Posts: 32
Joined: 2003-12-12 14:28
Location: FL,USA
Contact:

Post by IntraDream » 2004-02-02 21:12

cologic wrote:
u can also tiger hash an entire file..
I'm curious whether you can find an example of this in the wild. SHA1, by contrast, is very common in this role.
what do you mean where can you find an example. Tiger hash was not specificly designed for Tree hash's the fact that it is used for THEX by varius programs means nothing.. http://www.cs.technion.ac.il/~biham/Reports/Tiger/ dosnt mention anything about a Tree hash algorythm and its comperisons to MD5 i assume are based on complete files. and yes SHA1 is very common for hashing because it is secure and fast.. so the question is is Tigers security/speed better than SHA1. i beleave that the hash is more secure. but is it worth it if its 30% slower.. or even 10% slower.. weather or not it is slower i dont know i have not run any tests. from what ive read it is slower but i would like to see some comperisons.

I think its foolish to not use SHA1 for a THEX just because SHA1 is a hash known for full files.

Locked