Re: [dcdev] Searching
Todd Pederzani
2004-01-16 4:21
Direct Connect developers

Carl-Adam Brengesjö wrote:

why lock it to a single hash algorithm?

Because searches are expensive.  Because hashing is expensive.  Because some algorithms (uuhash, crc32, compound md4) have weaknesses that fundamentally undermine their purpose (uniquely identifying files).  Because some algorithms don't lend themselves to incremental

you can't know the hash of the file you're looking to unless you have downloaded a *.sfv (what algorithm? is it called sfv, or?) or MD5 (or simlar) file telling it.

MD5 and CRC32 are crude.  If we support a single hash, as three clients already do, sites such as ShareReactor, ShareLive, or FileNexus will pop up with direct download links for the DC supported hash, allowing our users to start a download from a hash.

Here's a rough overview of how BCDC does hashing (some of these steps are only visible when it's advertising itself as BCDC):  It crawls your entire share in a low-priority thread (it completes as the client is running, it doesn't block startup) hashing files and adding the full hash tree to a database.  When it returns a search result, it will replace the hub name field with TTH:<hash>.  When connecting to another client, it includes TTH in its $Supports list; when a fellow TTH supporting source is found (when downloading files in the user's queue), it gets the full hash tree (once) using a new client to client command: $GetMeta.  This tree can be used to verify both segments and the whole file.

If I seem brief... it's because I am.  We've had plenty of excellent discussion in this on the DC Dev public and private hubs, as well as on the DC++ forum (http://dcplusplus.sourceforge.net/forum/viewtopic.php?t=277) over the course of the last year.  Tiger Tree Hashes were chosen for some unique properties of the root file hash and of the tree of segment hashes.  This is really a Good Way(tm) to do hashing... otherwise I wouldn't bother bringing it up.

- Todd
DC Developers mailinglist