Adding file hashes to the protocol

Technical discussion about the NMDC and <a href="http://dcpp.net/ADC.html">ADC</A> protocol. The NMDC protocol is documented in the <a href="http://dcpp.net/wiki/">Wiki</a>, so feel free to refer to it.

Moderator: Moderators

Gumboot
Posts: 22
Joined: 2003-01-15 20:08
Location: New Zealand

Adding file hashes to the protocol

Post by Gumboot » 2003-01-16 22:48

OK, I think most people here agree we should add file hashing to the DC protocol, especially given that the vast majority of p2p systems have it already.

Note that none of the proposed solutions require any changes to the hub. We could do a lot better if we mandated changes to the hub, but for practical reasons I think it is best to continue mangling the DC1 protocol until it does what we want. ;)

First of all, why is file hashing desirable?
* To determine whether different search results refer to the same file. No more adding sources manually, one by one.
* To search for more sources to a known file (even sources that have been renamed) with less bandwidth usage (no extraneous results) and less processor usage (easier for clients to search for a file given a hash rather than a search string).
* Enables validation of the file (or at least the part that was hashed). Clients could automatically redownload if verification fails. A tree hash is even better for this; clients could verify parts of the file and only redownload the parts that are corrupt.
* Enables the possibility of sharing incomplete downloads. The client pretends the file is complete (returns the hash and size of the complete file) but then indicates which parts are missing.

Here is what has been done already in dctc:

Code: Select all

DC protocol extension. 
---------------------- 

* Search by content 
  ----------------- 

To be more efficient while searching, the search by content has been added. 

2 DC commands have been modified: 
$Search (to start a search) 
and 
$SR (the search result) 

$Search change: 
=============== 

a normal $Search command is like this: 
$Search sender a?b?c?d?e 
where sender is "Hub:nick" for passive search or "ip:port" for active search. 
a is F if size doesn't matter, else T 
b is F if size is "at least", else T (at most) 
c is the size in byte 
d is data type: 1=any,2=audio,3=compressed,4=document,5=exe,6=picture,7=videos, 
                8=folder 
and eeee is the pattern to find 

To remain compatible with DC, it is not possible to add new field. I have 
modified the 'a' field (the wanted size) 

now, the command is like this: 
$Search sender a.md?b?c?d?e 

where md is a MD5SUM (see below). 

$SR change: 
=============== 

a normal $SR command is like this: 
$SR sender a\5b c\5d (e) 
(\5 is the character '\005') 
where sender is the nickname of the person sending this reply. 
a is the found filename. 
b is the filesize 
c is the slot ratio (freeslot/total slot) 
d is the hubname 
e is the hub address. 

To remain compatible with DC, it is not possible to add new field. I have 
modified the 'b' field (the filesize) 

now, the command is like this: 
$SR sender a\5b.md c\5d (e) 

where md is a MD5SUM (see below). 

Note on compatibility: For a strange reason, on the hub, there is no warning when 
the modified $Search is sent but there is a warning when the $SR is sent (only 
in passive mode because active mode doesn't use the hub). It is only a warning 
and the extension works fine. 

-------------------------------------------------------------------------------- 
MD5SUM computation. 

the MD5SUM is computed on the first 4KBytes of the file using the standard 
MD5 algorithm (see md.c file of DCTC). The algorithm produces a non printable 
16 bytes string. To be able to send it using the DC protocol, the string is 
rewritten. The 16 bytes non printable string is converted into a 48 bytes 
printable string using the following very simple rule. Each byte of MD5sum is 
written in a 3 decimal characters string. Ex: 
254 is written "254", 136 becomes "136", 28 becomes "028". 

This is not the most efficient way of storing the value but using this, the 
filesize (c) and the encoded MD5sum (md) looks like a float (c.md). Ex: 
35345.111222333444555666777888999000111222333444555666 
35345 is the filesize and the part after the dot is the md5sum (bad example, 
it is not possible to encode 333 ... 999 into a byte :) ). Thus, even if the 
program expect a number, there will be no error.
Personally, I dislike this scheme for three reasons. One, MD5 is old and considered insecure; I would prefer SHA-1 instead. Second, the way the hash is encoded makes the search string very long. Since $Search contributes heavily to the hub bandwidth usage, we should keep this command as terse as possible. Third, it doesn't allow us to encode any other information into the $Search and $SR commands (like meta-data, for instance). So, I propose the following syntax:

$Search sender a?b?c?d?z
$SR sender z\5b c\5d (e)

where z is a string like the following:

<sha1>base64 encoded hash<author>name of author<c:\file.txt

Note 1: To parse this string, just split at the less-than symbol '<'. Then you should get a set of properties like the following: "sha1>aabbcc", "author>John Doe", "c:\file.txt".
Note 2: If a property doesn't have a greater-than symbol (>) then it has the same meaning as per the original spec. Ie. a search pattern for $Search and a file path for $SR.
Note 3: For the $SR command, the file path must be last, so that old clients can extract the file name successfully.
Note 4: The extra information will show up in the Path column for older clients.
Note 5: Older clients will try to download files using the above string (hopefully). New clients should be able to extract the path from this string.
Note 6: A SHA-1 hash is 20 bytes long. That is 27 base 64 encoded characters.
Note 7: This scheme uses the following special symbols: '<', '>', '+' and '/'. Can someone confirm that these characters are safe to use for both $Search and $SR?

To calculate the hash, I propose the following. First, split the file to be hashed into 1MB blocks (THEX suggests 1kB, but I'm worried about the bandwidth/storage requirements needed for such a fine granularity) and hash them individually using SHA-1. Then combine the first and second hashes, and hash the result. Then combine the third and fourth, and hash them. Once all the hashes have been combined and hashed, repeat the process again and again until only one hash remains. That is the hash for the entire file.

See: http://open-content.net/specs/draft-jch ... ex-01.html

This tree hash can be used to verify any part of the file, with a granularity of 1MB (since that was the initial block size). We would need to define a client->client extension so that a client can download the entire tree hash.

What do you guys think?

ivulfusbar
Posts: 506
Joined: 2003-01-03 07:33

Post by ivulfusbar » 2003-01-17 03:46

do-you-know-how-long-it-takes-to-get-a-hash-of-more-than-100000-files-ly'ers?
Everyone is supposed to download from the hubs, - I don´t know why, but I never do anymore.

ender
Posts: 224
Joined: 2003-01-03 17:47

Post by ender » 2003-01-17 04:06

ivulfusbar wrote:do-you-know-how-long-it-takes-to-get-a-hash-of-more-than-100000-files-ly'ers?
eDonkey on linux, VIA Eden 533 MHz machine, 240 GB of data shared, it takes a little less than two hours to generate the hashes for the first time. When it's run the next time, the files are checked in around 20 minutes (the first time not only MD4 hashes for the complete file are generated, but also hashes for each 9.4 MB block of the file; the next time just the complete-file hashes, which are then compared to cached hashes).
Anyway, IMO it would be the best to generate the hashes once, and store them somewhere, together with some other information about the file (size, modification date); next time when DC is run only check if the modification date and/or size have changed, and if they didn't, use the cached hash... This would make the system much faster...

ivulfusbar
Posts: 506
Joined: 2003-01-03 07:33

Post by ivulfusbar » 2003-01-17 04:21

ender: yes, the client needs to save hashes in some way.

next problem, then the client needs to save a map between a hash and what file is called on DC. If you store this in memory i wouldn't be suprised if you dubbled the amount of memory consumed by my dc++ client (if you save it in a similiar way as the sharelist is done now).

my point:
In mature and well managed hubs, all shares consist of 90% orginal-releases with full .nfo, .sfv and no renaming of releases. In such hubs you don't have any need for the hashing.

don't-get-me-wrong-hashes-is-very-helpfull-but-will-it-be-worth-the-pain-of-all-the-drawbacks-or-can-we-impliment-it-nicely-so-that-it-wont-be-a-pain-ly'ers?
Everyone is supposed to download from the hubs, - I don´t know why, but I never do anymore.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-01-17 04:22

ender wrote:
ivulfusbar wrote:do-you-know-how-long-it-takes-to-get-a-hash-of-more-than-100000-files-ly'ers?
eDonkey on linux, VIA Eden 533 MHz machine, 240 GB of data shared, it takes a little less than two hours to generate the hashes for the first time. When it's run the next time, the files are checked in around 20 minutes (the first time not only MD4 hashes for the complete file are generated, but also hashes for each 9.4 MB block of the file; the next time just the complete-file hashes, which are then compared to cached hashes).
Anyway, IMO it would be the best to generate the hashes once, and store them somewhere, together with some other information about the file (size, modification date); next time when DC is run only check if the modification date and/or size have changed, and if they didn't, use the cached hash... This would make the system much faster...
I agree on that its uneccessary to check the hash every time we start up. The user will be able to fool the program anyway. (edit the file after that startup-check), and if we just implement double-checking at the other end, when having downloaded the file, then erronous hashes will not be long-lived, depending on how reports of bad hashes are handled.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-01-17 04:28

Also, imho, we should just make a new search command including both hashes and metadata, much like whats in Luke-Jr´s proposal, whatever that is called, even though I think the command could be made simpler than his version. (I havent checked back on his drafts though, just doing this from my very vague recollection of it).

What purpose is there to let the nmdc client parse/understand hash-searchresults?

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-01-17 04:36

ivulfusbar wrote:ender: yes, the client needs to save hashes in some way.

next problem, then the client needs to save a map between a hash and what file is called on DC. If you store this in memory i wouldn't be suprised if you dubbled the amount of memory consumed by my dc++ client (if you save it in a similiar way as the sharelist is done now).

my point:
In mature and well managed hubs, all shares consist of 90% orginal-releases with full .nfo, .sfv and no renaming of releases. In such hubs you don't have any need for the hashing.

don't-get-me-wrong-hashes-is-very-helpfull-but-will-it-be-worth-the-pain-of-all-the-drawbacks-or-can-we-impliment-it-nicely-so-that-it-wont-be-a-pain-ly'ers?
The problem with this approach is, it relies on people being non-lazy, to retain the original release and just not share the unpacked .avi. This, as most of knows, does not work. People are lazy, bad, stupid, etc. Dont trust people.

Sure in well-behaved hubs, most may be kind enough to keep original releases. I know I dont.

A hashing system is not strictly necessary, but I think it will help alot.


Overhead of hashing? Im not sure about this. It sure will use up resources, but with a little smarts it should be made very possible even for large shares. How? Dont ask me ;)

aDe
Forum Moderator
Posts: 138
Joined: 2003-01-07 09:14
Location: SE
Contact:

Post by aDe » 2003-01-17 04:59

I'm not very interested in hashing, either..
I've been using dc for a little over a year now, and sure, sometimes you get a bad download, so what, get it again (and perhaps send the user a pm)

currupted downloads is not a problem for me. maybe if I was on a slower connection, i'd care about it.

currently i have a 0,5mbit connection which is fairly fast, but there are lots faster :)

if i'm downloading something important (something bigger than 100mb) then i might do a little effort in making sure i get a resonable user with decent speed so that i dont need to risk him taking off. but even when they do - i usually write down the bytesize and just find a user with the same filesize, and there is no problem..

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-01-17 05:04

aDe wrote:I'm not very interested in hashing, either..
I've been using dc for a little over a year now, and sure, sometimes you get a bad download, so what, get it again (and perhaps send the user a pm)

currupted downloads is not a problem for me. maybe if I was on a slower connection, i'd care about it.

currently i have a 0,5mbit connection which is fairly fast, but there are lots faster :)

if i'm downloading something important (something bigger than 100mb) then i might do a little effort in making sure i get a resonable user with decent speed so that i dont need to risk him taking off. but even when they do - i usually write down the bytesize and just find a user with the same filesize, and there is no problem..
Write down? I though DC++ did that for you (When using alternate source finding) :)

aDe
Forum Moderator
Posts: 138
Joined: 2003-01-07 09:14
Location: SE
Contact:

Post by aDe » 2003-01-17 05:08

Hehe.... maybe it does. Maybe i don't always use DC++ :)

Gumboot
Posts: 22
Joined: 2003-01-15 20:08
Location: New Zealand

Post by Gumboot » 2003-01-17 05:18

IMHO, if we ever want the ease-of-use features of, say, KaZaA in DC, we need to implement file hashes. The following features spring to mind:
* Automatic grouping of search results.
* Automatic finding of new sources.
* Working hyperlinks to DC files (rather than just hubs).
* Automatic redownloading of corrupt files.

Sure, you can do most of these now, but they either require manual steps, or don't always work. It all adds up to a nicer experience.

ender
Posts: 224
Joined: 2003-01-03 17:47

Post by ender » 2003-01-17 05:24

ivulfusbar wrote:In mature and well managed hubs, all shares consist of 90% orginal-releases with full .nfo, .sfv and no renaming of releases. In such hubs you don't have any need for the hashing.
You never visited an anime hub, have you? You'll never find SFV checks there, and with popular files you can expect as many as 20 variants of the file name...

0D0A
Posts: 2
Joined: 2003-01-14 20:36
Location: Alabama

What about a simple CRC

Post by 0D0A » 2003-01-17 12:07

I'm not sure what a "Hash" is. It sounds like a something based on a lossy compression scheme which would result in something about 1/10 the original size. Even if its not, anything that takes a couple of hrs to do is prohibitive.

But what about a simple 32bit, 64bit, or 128bit CRC value. Then you could be reletively (99.999 percent) sure that "Something Cool.Zip" and "Something I made cool.zip" are the same file. Also good for validating that you have the same file as the person you downloaded from has (in others words, their copy is bad too if the CRC's match). Plus you dont HAVE to implement it on the server side, if calculation, storage, or bandwidth is a problem then only let the user request it in client to client mode and calc it on the fly. Advanced clients will probably work out a way to store it for future use, but they would not be required to.

It might look like this:

$GetCRC <FileName>
$CRC <value> <FileName>

where value is a 32,64,128 (or whatever) bit number converted to base 64 (0-9, A-Z, a-z, @, #) for transmission and might look like this: AdH5@OIu99

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Re: What about a simple CRC

Post by sandos » 2003-01-17 12:54

0D0A wrote:But what about a simple 32bit, 64bit, or 128bit CRC value.
They are pretty much the same as hashes, differring mostly in number of bits, and cryptographical strength afaik. Hashes are generally slower though, but not that slow. Disk I/O will still be a bit of bottleneck, just as with CRCs, reading gigs of data does take time.

Advantages over crcs are probably
1) Impossible/hard to forge a file to have a particular hash
2) Less chance of collisions (???).
3) More commonly used in other p2p systems.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-01-19 10:17

Just wanted to take this up, might be good to know if using tiger hashes for a tree hash:

http://www.shareaza.com/forum/viewthread.aspx?ID=4261

So care must be taken when implementing tigertree, and the correct one choosen aswell.

yilard
Posts: 66
Joined: 2003-01-11 06:04
Location: Slovakia

Post by yilard » 2003-01-21 06:35

Hmm, I don't want any program to go through my 130 GB share every 20 minutes :( It sounds like my hdd will be dead soon.

And please try to calculate how much time would take rehashing of 100 GB.

In kazaa it is quite possible because 90% of users are just leechers who share nothing so they are not impacted. Other servers are rarely updated.
In the age of super-boredom/hype and mediocrity/celebrate relentlessness/menace to society --KMFDM

Gumboot
Posts: 22
Joined: 2003-01-15 20:08
Location: New Zealand

Post by Gumboot » 2003-01-21 07:09

Well, it would take a few hours, but since hashes would be stored (either in a database, or using an NTFS alternate file stream) they would only need to be calculated once. I don't see why you would ever need to rehash all your shared files.

eHast
Posts: 18
Joined: 2003-01-09 02:36
Location: Lund, Sweden

Post by eHast » 2003-01-24 18:35

TigerTree are essentially a special case of Tree Hashes (which can use any hashing algorithm). So you could implement Tree Hashes and then just "throw away" some information and you've got a TT. (If you use the Tiger hashing algo.)

BTW the Tree Hash Exchange protocol is described here: http://open-content.net/specs/draft-jch ... ex-01.html

Neg
Posts: 20
Joined: 2003-01-19 07:05

Post by Neg » 2003-01-24 18:45

Is there realy a need to provied the hase valut in the search? Is it not easier to transfer the hash in the Client to Client protocol? Just befor a $Send the serving client sends a $Hash or sutch so the reciver know it resumes from a good source.

This way no one needs to make a compleat hash list of all files in there shares. It will be generated the first time someone wants the file it can then be saved in memmory. This way we minimum CPU usage to the atrctive files in the share. We also minimize memory usage incresement to just stor the atractive hases. And we know that the hases allways will be up to date.

The user never need to see the hash value he or she knows the source is good if the download start. If its a bad source we just pop up a error msg telling the user so.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-01-25 04:16

a TTH (TigerTreeHash) implemenetation is available at http://sourceforge.net/projects/tigertree/, but the problem is it has a security issue, so we do not want to use it as it stands today. I was going to give a link to a explanation, but the server is down so Ill get back with that.

yilard
Posts: 66
Joined: 2003-01-11 06:04
Location: Slovakia

Post by yilard » 2003-01-26 17:41

Automatic source finding using hashes is nice.

But I cant help myself, I always think about how the system can be abused :evil: There is a problem associated with hashing: cheaters would forge their hashes intentionaly to prevent them from being chosen as alternate source and thus they will reduce the outgoing traffic from them.
In the age of super-boredom/hype and mediocrity/celebrate relentlessness/menace to society --KMFDM

Gumboot
Posts: 22
Joined: 2003-01-15 20:08
Location: New Zealand

Post by Gumboot » 2003-01-26 18:35

Neg wrote:Is there realy a need to provied the hase valut in the search? Is it not easier to transfer the hash in the Client to Client protocol?
If clients don't send the hash value in search results then there is no (practical) way to automatically group search results, ala KaZaA, eDonkey, etc. This is a desirable feature, IMHO.
yilard wrote:But I cant help myself, I always think about how the system can be abused There is a problem associated with hashing: cheaters would forge their hashes intentionaly to prevent them from being chosen as alternate source and thus they will reduce the outgoing traffic from them.
It's probably just as easy to alter DC++ so that it doesn't return any search results at all. What you describe is not a new problem. The *only* way to prevent leechers is to implement some sort of tit-for-tat/ratio system (like BitTorrent does).

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-01-27 01:48

Gumboot wrote:
yilard wrote:But I cant help myself, I always think about how the system can be abused There is a problem associated with hashing: cheaters would forge their hashes intentionaly to prevent them from being chosen as alternate source and thus they will reduce the outgoing traffic from them.
It's probably just as easy to alter DC++ so that it doesn't return any search results at all. What you describe is not a new problem. The *only* way to prevent leechers is to implement some sort of tit-for-tat/ratio system (like BitTorrent does).
Besides, hashes is to be checked anyhow when downloading files. I think it will be quite obvious if someone forges their hashes, since every single file will have hash-errors in them.

zmack
Posts: 6
Joined: 2003-01-29 03:17
Location: Bucharest, Romania
Contact:

Post by zmack » 2003-02-03 05:44

if hashes are used, it would probably be best if they were implemented for files larger than a certain size ( let's say.. 100 megs ) because... well, i don't care about the authenticity of some 4-meg mp3 or some doc or smth. however, the issue of the size of some shares ( 100gigs + ) would cause a bit of a slowdown in dc++ startup and ... well, just about everything if you want to rebuild the hash list every 20 minutes. It would be much easier to just store the hashes, filesizes and filenames and just check the filenames and filesizes from time to time. If they differ, rebuild the hash. This would indeed be easy to forge, but... what's the use of sharing a file that you can't even use ( rarely have i seen .avi's of identical filesize and different content ). the thing that i haven't really thought through is how to store the hashes and filesizes so that they can't be modified by a 'malicious user' :P..but i'm guesssing that some xml and pgp or rsa would very likely do the trick

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-02-03 08:40

zmack wrote:if hashes are used, it would probably be best if they were implemented for files larger than a certain size ( let's say.. 100 megs ) because... well, i don't care about the authenticity of some 4-meg mp3 or some doc or smth. however, the issue of the size of some shares ( 100gigs + ) would cause a bit of a slowdown in dc++ startup and ... well, just about everything if you want to rebuild the hash list every 20 minutes. It would be much easier to just store the hashes, filesizes and filenames and just check the filenames and filesizes from time to time. If they differ, rebuild the hash. This would indeed be easy to forge, but... what's the use of sharing a file that you can't even use ( rarely have i seen .avi's of identical filesize and different content ). the thing that i haven't really thought through is how to store the hashes and filesizes so that they can't be modified by a 'malicious user' :P..but i'm guesssing that some xml and pgp or rsa would very likely do the trick
We cant ever protect others from malicious users, so we wont even try. Well just save the timestamps, and rehash modified/new files as needed.

volkris
Posts: 121
Joined: 2003-02-02 18:07
Contact:

Post by volkris » 2003-02-07 15:01

Have you looked into Tiger Tree (or something like that) hashes? From what I understand they have built in segmentation kind of thing.

Look it up. When we developed the hash idea fully last time, long long time ago, someone brought the idea of the Tiger hash and it sounded great.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-02-07 16:13

volkris wrote:Have you looked into Tiger Tree (or something like that) hashes? From what I understand they have built in segmentation kind of thing.

Look it up. When we developed the hash idea fully last time, long long time ago, someone brought the idea of the Tiger hash and it sounded great.
TTH are great, but the algoritm is in a state of flux right now. A new version to disambiguate between filehashes and leafs are coming up, question is when:

http://bitzi.com/bboard/message?message ... um_id=4076

It is trivial to make up hash-collisions with the old algorithm.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-02-12 12:18

zmack wrote:if hashes are used, it would probably be best if they were implemented for files larger than a certain size ( let's say.. 100 megs ) because... well, i don't care about the authenticity of some 4-meg mp3 or some doc or smth.
I think that limiting hashing to files above a certain size doesn't really gain anything. People (such as myself) who do use DC++ primarily for small files - mp3s - do care about file integrity. LAME even inserts its own internal music CRC into an MPEG header, so MP3 integrity is important to some people at least. :-D

Is anyone a user of Shareaza? It doesn't seem to hash your file library when it starts, it might only start hashing files when a search result matches them.

And I agree with the earlier assertion that hashes should be part of the meta-data that could ideally be returned for search results (though it's a non-trivial extension to both DC++ and the DC protocol).

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Post by sarf » 2003-02-12 17:06

GargoyleMT wrote:And I agree with the earlier assertion that hashes should be part of the meta-data that could ideally be returned for search results (though it's a non-trivial extension to both DC++ and the DC protocol).
It is a rather trivial extension as you could use the file type byte to do a hash-search request, then use the hash as a keyword. Seems, atleast to me, to be a trivial extension of the protocol, if not of the application itself.

Sarf
---
Hey, that's just the way the cookie gets completely stomped on and obliterated.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-02-12 19:14

sarf wrote: It is a rather trivial extension as you could use the file type byte to do a hash-search request, then use the hash as a keyword. Seems, atleast to me, to be a trivial extension of the protocol, if not of the application itself.
Good call, I was thinking of the problem of returning meta-data with search results (including the hash, bitrate, codec, or something of the sort). The first concept of mine, a $Supports feature and modifying $SR or using a new command, didn't take into account that the (acive) search results are returned via UDP and there's no client<->client handshake. Something like your client's tracking of DC++ versions seems like a potential solution to that problem... Or at least a direction.

You're right, that seems to be a nice extension of the protocol. I take back that generalization.

yilard
Posts: 66
Joined: 2003-01-11 06:04
Location: Slovakia

Post by yilard » 2003-02-12 19:24

sarf wrote:It is a rather trivial extension as you could use the file type byte to do a hash-search request, then use the hash as a keyword. Seems, atleast to me, to be a trivial extension of the protocol, if not of the application itself.
code excerpt from ShareManager.cpp:

Code: Select all

bool checkType(const string& aString, int aType) {
	if(aType == SearchManager::TYPE_ANY)
		return true;

	if(aString.length() < 5)
		return false;
	
	const char* c = aString.c_str() + aString.length() - 3;
	u_int32_t type = '.' | (Util::toLower(c[0]) << 8) | (Util::toLower(c[1]) << 16) | (((u_int32_t)Util::toLower(c[2])) << 24);

	switch(aType) {
	case SearchManager::TYPE_AUDIO:
		{
			for(int i = 0; i < (sizeof(typeAudio) / sizeof(typeAudio[0])); i++) {
				if(IS_TYPE(typeAudio[i])) {
					return true;
				}
			}
			if( IS_TYPE2(type2Audio[0]) || IS_TYPE2(type2Audio[1]) ) {
				return true;
			}
		}
		break;
	case SearchManager::TYPE_COMPRESSED:
		if( IS_TYPE(typeCompressed[0]) || IS_TYPE(typeCompressed[1]) || IS_TYPE(typeCompressed[2]) ) {
			return true;
		}
		break;
	case SearchManager::TYPE_DOCUMENT:
		if( IS_TYPE(typeDocument[0]) || IS_TYPE(typeDocument[1]) || 
			IS_TYPE(typeDocument[2]) || IS_TYPE(typeDocument[3]) ) {
			return true;
		}
		break;
	case SearchManager::TYPE_EXECUTABLE:
		if(IS_TYPE(typeExecutable[0]) ) {
			return true;
		}
		break;
	case SearchManager::TYPE_PICTURE:
		{
			for(int i = 0; i < (sizeof(typePicture) / sizeof(typePicture[0])); i++) {
				if(IS_TYPE(typePicture[i])) {
					return true;
				}
			}
			if( IS_TYPE2(type2Picture[0]) || IS_TYPE2(type2Picture[1]) ) {
				return true;
			}
		}
		break;
	case SearchManager::TYPE_VIDEO:
		{
			for(int i = 0; i < (sizeof(typeVideo) / sizeof(typeVideo[0])); i++) {
				if(IS_TYPE(typeVideo[i])) {
					return true;
				}
			}
			if( IS_TYPE2(type2Video[0]) || IS_TYPE2(type2Video[1]) ) {
				return true;
			}
		}
		break;
	default:
		dcasserta(0);
		break;
	}
	return false;
} 
doesn't seem to be _that_ trivial :( if you don't want old clients to die on that assertion at the end. I hope there is some check somewhere that I have missed, but I don't think so.
In the age of super-boredom/hype and mediocrity/celebrate relentlessness/menace to society --KMFDM

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Post by sarf » 2003-02-13 10:36

{quote="yilard"]code excerpt from ShareManager.cpp:[snipped]

Doesn't seem to be _that_ trivial :( if you don't want old clients to die on that assertion at the end. I hope there is some check somewhere that I have missed, but I don't think so.[/quote]
asserts should prove to be no problems as most old clients are not debug builds, and thus do not die on assertions. If they do, the clients need to be fixed because otherwise this is a very good DOS attack against DC++.

Sarf
---
Penguin lust!

yilard
Posts: 66
Joined: 2003-01-11 06:04
Location: Slovakia

Post by yilard » 2003-02-13 12:06

sarf wrote:If they do, the clients need to be fixed because otherwise this is a very good DOS attack against DC++.
Exactly the same came on my mind :) But I'm currently too busy to make an exploit to prove it (a magic button to kill all DC++ users on the hub :twisted: )
In the age of super-boredom/hype and mediocrity/celebrate relentlessness/menace to society --KMFDM

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Updates on TTH

Post by sandos » 2003-03-01 10:09

Shareaza 1.8.1.1 has been released, with a new TTH. It adds a 0x00 byte to every 1024 byte chunk of filedata, and a 0x01 byte to every 24 byte hash before feeding into the hash again. The author, Mike, says that this is finalized and is what will be used by bitzi.com aswell.

mo
Forum Moderator
Posts: 81
Joined: 2003-02-06 11:20
Location: Ohio
Contact:

Post by mo » 2003-03-01 12:16

I think the best way to generate hashes is to do it on the fly when a search result is found or when an upload starts.

this way there is no 1 time 2 hour event (that needs to be constantly updated) and it's only created for files people care about. (keeping the hash db smaller)

eHast
Posts: 18
Joined: 2003-01-09 02:36
Location: Lund, Sweden

Post by eHast » 2003-03-01 15:57

And how do you propose to search for a hash if they are only generated when you search for them? If someone searches for a hash which filename you haven't seen a search for then the search will fail even if you have the file.

So you /need/ to have the hashes precalculated.

OTOH it would be clever to coordinate the effort with other projects like Bitzi etc. No need to have 5 billion hashes of the same file. And on "real" OS's (NT, Linux) you can keep the hash with the file in the metadata. Thus you don't need to worry about renaming or moving the file.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-03 20:41

mo wrote:I think the best way to generate hashes is to do it on the fly when a search result is found or when an upload starts.

this way there is no 1 time 2 hour event (that needs to be constantly updated) and it's only created for files people care about. (keeping the hash db smaller)
How about a low priority thread that will crawl your whole collection, but will give priority to those files that are actually downloaded, or those for which a search hit was returned? Hashing large files on the fly could result in a pretty big delay, especially if the file is on a network drive. That is, if you block the upload of that file until the hash completes. (cache hash, etc.)

mo
Forum Moderator
Posts: 81
Joined: 2003-02-06 11:20
Location: Ohio
Contact:

Post by mo » 2003-03-03 23:10

eHast wrote:And how do you propose to search for a hash if they are only generated when you search for them?
you search as always, the responding client will send an extra response that has the cooresponing file hash, and the search results are limited on the searching client's side

mo
Forum Moderator
Posts: 81
Joined: 2003-02-06 11:20
Location: Ohio
Contact:

Post by mo » 2003-03-03 23:13

GargoyleMT wrote:Hashing large files on the fly could result in a pretty big delay, especially if the file is on a network drive.
you might only use the first 1-10k of the file to create the hash that is returned

ender
Posts: 224
Joined: 2003-01-03 17:47

Post by ender » 2003-03-04 03:45

Hashing 280 GB data (12000 files) by eDonkey took about 10 hours on my VIA 533. Anyway, IMO the complete file must be hashed, since files are rarely corrupt in the first x kB. Also, instead of hashing the entire file, it would be better to hash segments of the file (so you don't have to re-download the complete file if a part gets corrupted)...

sarf
Posts: 382
Joined: 2003-01-24 05:43
Location: Sweden
Contact:

Post by sarf » 2003-03-04 05:08

While complete file hashing has to be done before you start to share (at least in my opinion) file-segment hashing can (and should) be done on the fly, while the user is downloading the file.
My reasons for this is that the amount of data that can be saved increases quite rapidly when we start to do file-segment hashing, and if this is done on the fly we can "afford" to use smaller file-segments.

Sarf
---
Needling the innocent with dysfunctional logic.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-04 07:48

sarf wrote:While complete file hashing has to be done before you start to share (at least in my opinion) file-segment hashing can (and should) be done on the fly, while the user is downloading the file.
My reasons for this is that the amount of data that can be saved increases quite rapidly when we start to do file-segment hashing, and if this is done on the fly we can "afford" to use smaller file-segments.
This also makes partial filesharing and "swarming" possible using hashes.

I dont know exactly how PFS is done in other systems though, I guess it would be enough to search using the segment hash? What I dont know is if the hash for the entire file is used in anyway to see to that Im downloading off the right file. (In the case there are segment-hash collisions. Yes, should be uncommon, but there will be lots of segments around so its probably not impossible.)

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-04 08:20

Reinventing the wheel?

IMHO these problems have all already been solved by the gnutella community. We should just pick one method to do this and stick to that, maybe even become "compatible" with another system. Shareaza works nicely for me, it checks segments on-the-fly when downloading and supports swarming/PFS.

It calculates both MD5, SHA1, TTH and ED2K hashes (So it can and will grab gnutella, edonkey and magnet URIs). The TTH is calculated using 1024 byte segments, but only 9 levels of the tree are serialized and saved. (This was 8 levels in the previous version). This means files will have a maximum of 256 segments, so the segmentsize is dependent on the filesize. This is only a problem if a 256th of the total downloadtime is a long time, since this is what one can loose when downloading corrupt data. Accessing the TTH is done by using HTTP byteranges, since the TTH format is known from the filesize this makes it possible to download any one part of the TTH tree.

This thread is about the protocol though. The most straightforward way I see implementing this is to first implement a more robust way of downloading byteranges (get(Z)Block). Now how do we share and search hashes?

Transferring hashes

Following the tradition of hardcoded filenames with special meanings, this could be done using <filename>.TTHvolume. Just download the file the normal way, or using GetBlock. This is not a very clean solution though.

I would much like to see metadata like this transferred over the hub, actually. I, as a passive mode user, might want to get the TTH for a file, or even SHA1, from another passive user, so that I can use that to search.

Theres also the option to add metadata to the filelist. I heard Arne mention xml-based filelists, and that sounds very good to me, should make it possible to add arbitrary metadata.

Searching for a file with a particular hash

The searching shouldnt be very hard, $Search SHA1:<sha1> TTH:<tth> might work, I think these are ignored by any sane client, even nmdc.

I actually dont know how to return results on this. Any ideas? A modified $SR might work, but that might be pushing it. New command?

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-04 08:25

sandos wrote:I actually dont know how to return results on this. Any ideas? A modified $SR might work, but that might be pushing it. New command?
Or we could just return normal $SRs, but thats ambigous: they could be from another, normal search. But if the client doublechecks the hash (explicitly downloads the hash for the file(s) in question), this would work.

aDe
Forum Moderator
Posts: 138
Joined: 2003-01-07 09:14
Location: SE
Contact:

Post by aDe » 2003-03-04 08:41

sending a invalid search command to a nmdc client makes it send back a ugly error report. like ..
To: Vandel-Debug..
Error in.. bla bla..
2003-02-02 etc..

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-04 08:43

Thank, you Sandos, for bringing the conversation back to sanity. Hashing the first couple kilobytes of a file defeats one of the primary points of hashing: being able to determine hashes mid stream. And hashing segments, well that's just making a Merkle Hash Tree/Tiger Tree Hash and keeping a specific depth or number of leafs (exactly what you wrote, but it bears repeating).
sandos wrote:Following the tradition of hardcoded filenames with special meanings, this could be done using <filename>.TTHvolume. Just download the file the normal way, or using GetBlock. This is not a very clean solution though.
I'm for some sort of XML exchange (mentioned in another thread), even formatted as a pseudo-filename... I agree with you that hashes are a form of meta-data, and having that exten\dable makes sense.
sandos wrote:The searching shouldnt be very hard, $Search SHA1:<sha1> TTH:<tth> might work, I think these are ignored by any sane client, even nmdc.
WinMX seems to use a HASH>[blahblah] keyword in a normal search to get search results. That might be a good idea instead, although "old" clients will try to search for those strings.
sandos wrote:I actually dont know how to return results on this. Any ideas? A modified $SR might work, but that might be pushing it. New command?
This is the problem I initially emailed Arne about, at least for MP3 information. The problem is that you never establish a real client to client connection, so you don't know the $Supports list of the remote client (if you're sending out hashes to regular search results, this is a problem) to see if they're capable of handling your overloaded $SR. If you had some sort of flag in the regular search that hash-capable clients would set, that would be one solution, but it has to be something that wouldn't impede getting search results back from non-hash capable clients. Taking a quick gander at the spec, maybe someone was right in considering the size field. If both bools are set to F, it might be ignored by most clients. We could put some type of magic number in there to signal that "this client is hash capable" or maybe a magic string, but that might break a couple clients.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-04 09:29

GargoyleMT wrote:I'm for some sort of XML exchange (mentioned in another thread), even formatted as a pseudo-filename... I agree with you that hashes are a form of meta-data, and having that exten\dable makes sense.
WinMX seems to use a HASH>[blahblah] keyword in a normal search to get search results. That might be a good idea instead, although "old" clients will try to search for those strings.
sending a invalid search command to a nmdc client makes it send back a ugly error report. like ..
To: Vandel-Debug..
Error in.. bla bla..
2003-02-02 etc..
Oh, ok. An invalid search is bad. Id go for the HASH>[value] form or something like that, even though it will use cycles in other clients.

Searchresults seems to be a bit messy. Im not suggesting returning hashes in normal results/searches, btw. I actually dont find it necessary to ever return hashes in searchresults. I might be missing something though? I think hash-search results might be best to return in a new command.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-04 11:59

sandos wrote:Oh, ok. An invalid search is bad. Id go for the HASH>[value] form or something like that, even though it will use cycles in other clients.

Searchresults seems to be a bit messy. I'm not suggesting returning hashes in normal results/searches, btw. I actually don't find it necessary to ever return hashes in searchresults. I might be missing something though? I think hash-search results might be best to return in a new command.
That's pretty sensible to me. For an active client, we can just hook in and add another known command near where $SR is defined. A Search that's flagged as being by/for hashes can use that to report a hit instead of $SR. What about passive clients? In that case, we pretty much have to overload the command that sends search results back through the hub, as that seems to be the only way you could get a hash from another passive user...

So how do we flag a $Search as being for a hash?

It seems that downloading would go something like:
A normal $Search to find the file.
Connect to the other person (if it's not passive<->passive), see the $Supports flag (metadata? Filehash? Hash-SHA1? Hash-TTH?), request the file hash, then ask for the file.

I think that meta-data should probably use a mini-slot or have an ultra-high priority in the queue. That way you can get a hash tree from someone who has all their upload slots filled. Once you have the hash, a client can search, probably twice, for alternate sources; once by filename and size, for non-hash enabled sources; once by hash. For non-hash enabled sources, the hash can be used to verify whether they are or are not valid alternate sources, and they can be removed or kept as such. (do failed/removed connections get saved [for a mini-blacklist] in the Queue.xml?)

Also, if we're supporting sig2dat, edonkey, etc. hashes, where does that leave us? I think it'd be "more elegant" to only have one primary hash, for searching, etc.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-03-04 13:53

GargoyleMT wrote:
sandos wrote:Oh, ok. An invalid search is bad. Id go for the HASH>[value] form or something like that, even though it will use cycles in other clients.

Searchresults seems to be a bit messy. I'm not suggesting returning hashes in normal results/searches, btw. I actually don't find it necessary to ever return hashes in searchresults. I might be missing something though? I think hash-search results might be best to return in a new command.
That's pretty sensible to me. For an active client, we can just hook in and add another known command near where $SR is defined. A Search that's flagged as being by/for hashes can use that to report a hit instead of $SR. What about passive clients? In that case, we pretty much have to overload the command that sends search results back through the hub, as that seems to be the only way you could get a hash from another passive user...
Oh yeah, forgot about that minor detail, passive results. But then I guess the same command could be used for active searches, no meaning in havin two commands doing the same thing.
So how do we flag a $Search as being for a hash?

It seems that downloading would go something like:
A normal $Search to find the file.
Connect to the other person (if it's not passive<->passive), see the $Supports flag (metadata? Filehash? Hash-SHA1? Hash-TTH?), request the file hash, then ask for the file.
Yes, this is probably the most common scenario. An even easier one is using hashes from a website (sharelive, sharereactor) where we start by doing a hash-search and then just download one of the results.
I think that meta-data should probably use a mini-slot or have an ultra-high priority in the queue. That way you can get a hash tree from someone who has all their upload slots filled. Once you have the hash, a client can search, probably twice, for alternate sources; once by filename and size, for non-hash enabled sources; once by hash. For non-hash enabled sources, the hash can be used to verify whether they are or are not valid alternate sources, and they can be removed or kept as such. (do failed/removed connections get saved [for a mini-blacklist] in the Queue.xml?)

Also, if we're supporting sig2dat, edonkey, etc. hashes, where does that leave us? I think it'd be "more elegant" to only have one primary hash, for searching, etc.
Im also for one primary hash, and I would then go for TTH. My second option would be SHA1+TTH.

About saving failes hosts, I dunno. Shouldnt be a big problem in not doing it, if we have TTH. We only need to download 1/256th or 1k of the file to see that its wrong.

GargoyleMT
DC++ Contributor
Posts: 3212
Joined: 2003-01-07 21:46
Location: .pa.us

Post by GargoyleMT » 2003-03-05 12:25

sandos wrote:Oh yeah, forgot about that minor detail, passive results. But then I guess the same command could be used for active searches, no meaning in having two commands doing the same thing.
Hmm. Well, we definitely need to be able to return hashes with $Search results sometime, to get the hash to passive clients. Or do we? I suppose a passive user might be the only one of many sources of a specific file online at a given time, and you need the hash from him/her to add it to your queue. Is this a resonable scenario?

When searching by HASH>[value], do we need to return hashes? The client can just request the hash when we connect to that source and verify that it's the right file. And non-hash enabled clients shouldn't send back results at all.

Do we need to overload $Search at all? If we just use the notation above in keywords...
sandos wrote:Yes, this is probably the most common scenario. An even easier one is using hashes from a website (sharelive, sharereactor) where we start by doing a hash-search and then just download one of the results.
This is the problem with the one-true-hash idea, which of the web direct file links have a TT hash as their key?
sandos wrote:I'm also for one primary hash, and I would then go for TTH. My second option would be SHA1+TTH.
Can't you create a Merkle hash tree with SHA1 as well? Just a random thought.
sandos wrote:About saving failes hosts, I dunno. Shouldnt be a big problem in not doing it, if we have TTH. We only need to download 1/256th or 1k of the file to see that its wrong.
True, but I think saving the "bad" sources is probably a good idea. It could be reused in at least another feature I've seen, to "switch" to another source if the speed is below a certain threshold. It also seems like the right thing to do the first time... If someone has a limited number of download slots, then they would probably not appreciate one of those being used to connect to a user that isn't a valid source anyway.

I was kinda hoping that someone else would jump into this conversation, but if nobody is, we must have our heads on straight. ;-)

Locked