Propose $Supports SHA1

Dj_Offset · Post by **Dj_Offset** » 2003-02-22 20:07

Hi,

I see dcgui supports downloading files by "chunks" to be able to do multiple downloads.
Multiple downloads is good, but how do we know if two files are equal?
1) File size doen't mean anything
2) File name doesn't mean anything
3) File size + file name, doesn't neccesarily mean anything either (higher probability, though)

We need a good hashing algorithm like SHA-1 (or MD5, but SHA-1 is more secure) to make sure the files are equal. I think an extended protocol function could solve this;

For example;
RECV: $Supports [...] SHA1|
(login)
SEND: $SHA1 file|
RECV: $SHA1SUM 45e3302ac7174c5a9d94f938e2799c88d1168dc1|

...But there are some problems with this aproach aswell.
1) Maybe many ppl have a file, but only a part of it (half of a 700MB movie)

This can be solved by creating multiple checksums for each "chunk" of data, and a checksum for the complete file. For example, creating an SHA1 sum for each 1MB, or 10MBs.
A SHA-1 sum is 40 bytes long, so a complete SHA-1 request for a file of 700MBs would be in 1MB chunks 701x40 = 28040 bytes to transfer after the $SHA1SUM reply (that's not much).

So the scenario is now:
RECV: $Supports [...] SHA1|
(login)
SEND: $SHA1 file|
RECV: $SHA1SUM [sum1] [sum2] [sum3] [sum4] [sum n+1] |

What do you guys think about this?

sarf · Post by **sarf** » 2003-02-23 09:22

This issue has been discussed before (and is being discussed on the board, I think).

Dj_Offset wrote:[snip]
Multiple downloads is good, but how do we know if two files are equal?
1) File size doen't mean anything
2) File name doesn't mean anything
3) File size + file name, doesn't neccesarily mean anything either (higher probability, though)

Generally, you are correct. However, we are talking about local searches in a very small (relatively speaking) pool of data. If you search for "Once I had Gold (IRS-blues).mp3", size 876213 bytes long (fictional file, FYI) and you find eight people that have this thing in a hub the probability is very high (in excess of 90%) that it is the same file. In fact, the "Search for alternates" method works and thus proves this.

For larger hubs / searching in many hubs hashes need to be added, but they are good to have, not a "must-have-or-the-world-will-end"-feature.

Dj_Offset wrote:We need a good hashing algorithm like SHA-1 (or MD5, but SHA-1 is more secure) to make sure the files are equal. I think an extended protocol function could solve this;
[snipped example]

How costly is SHA-1 in CPU terms? How long does it take for it to process 40 GB data spread on 3 partitions and two harddrives? Could it be done runtime (e.g. not at the start of DC++) ?

Dj_Offset wrote:...But there are some problems with this aproach aswell.
1) Maybe many ppl have a file, but only a part of it (half of a 700MB movie)

Yes. So what? While partial file hashes is all nice and dandy they are not, strictly speaking, necessary. Why should people share partial files in the first place? Don't tell me that they should so that people can find and download them faster, 'cause it would be better (in my opinion, at least) to share your "finished files directory" and update the filelist whenever this directory changed.

Dj_Offset wrote:[snipped block checksum example]
What do you guys think about this?

Again, how much time will this take? When will it be done? Runtime or at start-up?

Before I sign up for SHA-1, I think we should consider how many files we will be searching in.

Even if you are in three very large hubs with, say, 1000 users each, the probability (due to how we fileleechers work) is that a filename and a size combination is enough to find 50% of all possible sources. If SHA-1 increases this to 99% but costs every user considerable time and harddrive abuse every time they add another file to their share we will be killing the cockroach by shooting it through our own feet. I'd be the first to admit that I would not add files if doing so took my computer three minutes of intensive harddrive use for a gigabyte or so.
If Adler-32 is enough to identify 90% of the sources and takes half the time, I say let's go for Adler-32. We're here to improve the network, not to add perfect and unused functionality.

Hmmm... I sound too pessimistic - I personally would love to see hashes added, but there are other considerations than if the hashes are "secure" enough - such as the fact that links of ShareReactor could not be used (if I am not mistaken they use MD5) - a major flaw according to some people (not me).

Sarf
---
Time is a great teacher, but it kills all its pupils.

Dj_Offset · Post by **Dj_Offset** » 2003-02-23 09:34

According to the Botan project's benchmarks (previously known as OpenCL) they can do 160-bit SHA with ~68 MB/s on a 1.4Ghz (the harddrive is the bottleneck really). MD5 has a throughput of ~125MB/s.

The hashes should be created in the background when you first fire up the dc client
and stored in a local cache to prevent it from being recalculated on each startup.
Then each chunk of data should be verified when they are successfully downloaded.

I'm NOT suggesting on-demand hashing, because that can cause some potential DOS attacks. However the method I described will mean a lot of meta-data per shared file...

cologic · Post by **cologic** » 2003-02-23 12:51

Sharereactor uses MD4 links. This hash algorithm has been broken, and it shouldn't be used in new designs. As for which hash to use, I've been leaning towards http://bitzi.com/bboard/message?message ... um_id=4076, though there's a design with how the tiger hashes are combined in the tree that a new specification's supposed to fix soon; this is why I've made no effort to implementt it so far.[/url]

Post by **GargoyleMT** » 2003-02-23 16:25

Dj_Offset wrote:What do you guys think about this?

Google for Tiger Tree Hash or Merkle Hash Tree.

There's no sense reinventing the wheel, when mathematicians, with the head for this theoretical divination have already done it for you.

Cologic: that looks like a good post. Your code seemed to be leaning towards SHA1 hashes, why the switch to Tiger?

cologic · Post by **cologic** » 2003-02-23 18:19

It appears I should proofread my posts more carefully.

In any event, I put some SHA1 code into my distribution of DC++ because it's better than MD4 (broken) and MD5 (shown to be weaker), and because I did not at that point know about either Tiger or Merkle Hash trees. I've not included HashManager.cpp and HashManager.h in the client project because I have no intention of using SHA1 at this point though.

I prefer to use a Tiger Tree hash as opposed to, say, an SHA1 tree hash because the TTH (calculated with, if not always storing all the leaves down to, a 1kB block size) will allow compatibility with other users of Bitzi's code/algorithms.

sandos · Post by **sandos** » 2003-02-23 18:35

cologic wrote:I prefer to use a Tiger Tree hash as opposed to, say, an SHA1 tree hash because the TTH (calculated with, if not always storing all the leaves down to, a 1kB block size) will allow compatibility with other users of Bitzi's code/algorithms.

Too bad that you cant yet lookup files in bitzi using only TTH, you either do it with only SHA1, or SHA1+TTH. This could be a Good Thing, though, thinking about the fact that all TTH hashes will have to be recalculated/thrown away in Bitzis database.

DC++

Propose $Supports SHA1

Propose $Supports SHA1

Re: Propose $Supports SHA1

Re: Propose $Supports SHA1