Note that none of the proposed solutions require any changes to the hub. We could do a lot better if we mandated changes to the hub, but for practical reasons I think it is best to continue mangling the DC1 protocol until it does what we want.
First of all, why is file hashing desirable?
* To determine whether different search results refer to the same file. No more adding sources manually, one by one.
* To search for more sources to a known file (even sources that have been renamed) with less bandwidth usage (no extraneous results) and less processor usage (easier for clients to search for a file given a hash rather than a search string).
* Enables validation of the file (or at least the part that was hashed). Clients could automatically redownload if verification fails. A tree hash is even better for this; clients could verify parts of the file and only redownload the parts that are corrupt.
* Enables the possibility of sharing incomplete downloads. The client pretends the file is complete (returns the hash and size of the complete file) but then indicates which parts are missing.
Here is what has been done already in dctc:
Code: Select all
DC protocol extension.
----------------------
* Search by content
-----------------
To be more efficient while searching, the search by content has been added.
2 DC commands have been modified:
$Search (to start a search)
and
$SR (the search result)
$Search change:
===============
a normal $Search command is like this:
$Search sender a?b?c?d?e
where sender is "Hub:nick" for passive search or "ip:port" for active search.
a is F if size doesn't matter, else T
b is F if size is "at least", else T (at most)
c is the size in byte
d is data type: 1=any,2=audio,3=compressed,4=document,5=exe,6=picture,7=videos,
8=folder
and eeee is the pattern to find
To remain compatible with DC, it is not possible to add new field. I have
modified the 'a' field (the wanted size)
now, the command is like this:
$Search sender a.md?b?c?d?e
where md is a MD5SUM (see below).
$SR change:
===============
a normal $SR command is like this:
$SR sender a\5b c\5d (e)
(\5 is the character '\005')
where sender is the nickname of the person sending this reply.
a is the found filename.
b is the filesize
c is the slot ratio (freeslot/total slot)
d is the hubname
e is the hub address.
To remain compatible with DC, it is not possible to add new field. I have
modified the 'b' field (the filesize)
now, the command is like this:
$SR sender a\5b.md c\5d (e)
where md is a MD5SUM (see below).
Note on compatibility: For a strange reason, on the hub, there is no warning when
the modified $Search is sent but there is a warning when the $SR is sent (only
in passive mode because active mode doesn't use the hub). It is only a warning
and the extension works fine.
--------------------------------------------------------------------------------
MD5SUM computation.
the MD5SUM is computed on the first 4KBytes of the file using the standard
MD5 algorithm (see md.c file of DCTC). The algorithm produces a non printable
16 bytes string. To be able to send it using the DC protocol, the string is
rewritten. The 16 bytes non printable string is converted into a 48 bytes
printable string using the following very simple rule. Each byte of MD5sum is
written in a 3 decimal characters string. Ex:
254 is written "254", 136 becomes "136", 28 becomes "028".
This is not the most efficient way of storing the value but using this, the
filesize (c) and the encoded MD5sum (md) looks like a float (c.md). Ex:
35345.111222333444555666777888999000111222333444555666
35345 is the filesize and the part after the dot is the md5sum (bad example,
it is not possible to encode 333 ... 999 into a byte :) ). Thus, even if the
program expect a number, there will be no error.
$Search sender a?b?c?d?z
$SR sender z\5b c\5d (e)
where z is a string like the following:
<sha1>base64 encoded hash<author>name of author<c:\file.txt
Note 1: To parse this string, just split at the less-than symbol '<'. Then you should get a set of properties like the following: "sha1>aabbcc", "author>John Doe", "c:\file.txt".
Note 2: If a property doesn't have a greater-than symbol (>) then it has the same meaning as per the original spec. Ie. a search pattern for $Search and a file path for $SR.
Note 3: For the $SR command, the file path must be last, so that old clients can extract the file name successfully.
Note 4: The extra information will show up in the Path column for older clients.
Note 5: Older clients will try to download files using the above string (hopefully). New clients should be able to extract the path from this string.
Note 6: A SHA-1 hash is 20 bytes long. That is 27 base 64 encoded characters.
Note 7: This scheme uses the following special symbols: '<', '>', '+' and '/'. Can someone confirm that these characters are safe to use for both $Search and $SR?
To calculate the hash, I propose the following. First, split the file to be hashed into 1MB blocks (THEX suggests 1kB, but I'm worried about the bandwidth/storage requirements needed for such a fine granularity) and hash them individually using SHA-1. Then combine the first and second hashes, and hash the result. Then combine the third and fourth, and hash them. Once all the hashes have been combined and hashed, repeat the process again and again until only one hash remains. That is the hash for the entire file.
See: http://open-content.net/specs/draft-jch ... ex-01.html
This tree hash can be used to verify any part of the file, with a granularity of 1MB (since that was the initial block size). We would need to define a client->client extension so that a client can download the entire tree hash.
What do you guys think?