Multithreaded filehashing?

olmari · Post by **olmari** » 2006-04-13 17:01

I'd like to propose multithreaded filehashing method in DC++

This way multiple-core/CPU computers could hash those files a lot faster than currently. I know biggest filehashing is done when one installs DC++ for first time, but then again it still annoying to wait when you are doing it

I suppose it would be easiest and even best to do it so that one thread just hashes one file and there would be as many threads (at max) than there is CPUs and/or manually selectable.

Post by **GargoyleMT** » 2006-04-13 19:34

On my system, hashing is I/O bound. We had a similar request sometime not too long ago, and I think it was I/O bound on other people's systems as well. If a move was made in this direction, a "multi-disk" hashing approach, where a file was read from each physical drive, would make more sense. I'm not sure it would be practical, but I think it's a better tactic.

Guitarm · Post by **Guitarm** » 2006-04-14 04:14

Having some experience with hashing on a I/O slow machine I'm guessing the I/O is the weakest link in this chain. At least since I now have the opportunity to compare with a faster machine. To me, it seems like I/O is the biggest bottleneck.

TheParanoidOne · Post by **TheParanoidOne** » 2006-04-14 04:52

Deja vu. I swear we had this discussion fairly recently but I can't seem to find it anywhere. However I have been able to find the bugzilla request made by that person:

http://dcpp.net/bugzilla/show_bug.cgi?id=856

Quindor · Post by **Quindor** » 2006-04-27 02:04

For me, multithreaded file hasing would make absolute since. But then again, I have 10 disks to hash over...

Most people just have one or two. And then it won't speed up things much at all.

But looking besides the I/O bound, which would be around 50 to 100MB/sec depending on if there is a RAID array or not. Can one cpu handle that amount of data flow?

For myself I see one of the hyperthreading cpu's (dual 3.0 Ghz) being filled up allmost completely while doing hashing work, and the rest just sits there idle. Sometimes I get the feeling that if it would use more cpu time also on the other cpu's, it might have been faster.

But I'd agree with that this probably isn't the most needed feature for most users out there.

Pinchiukas · Post by **Pinchiukas** » 2006-04-28 14:42

imho multithreaded file hasing would only make sense when reading from multiple drives, cause reading two files from diffrent parts of the disk simultaniously would only make it a lot slower. on my 2ghz celeron and a seagate 7200.9 sata drive the hashing is done at about 27mB/s, so this is pretty sufficient in my case, cause my drive can read up to 70mB/s, that can only mean that the processor is the bottleneck in this case

bastya_elvtars · Post by **bastya_elvtars** » 2006-04-28 15:34

that can only mean that the processor is the bottleneck in this case

Not necessarily. RAR unpacking goes pretty slow for me, and my CPU usage is never 100% when doing it. Maybe the algorithm itself cannot be faster. Oh, and one should not forget about the filesystem used, fragmentation level etc. etc. This is rather multifactorial.

Pinchiukas · Post by **Pinchiukas** » 2006-04-29 02:05

well maybe the hard disk can also be a bottleneck, but hashing multiple files from one disk isn't going to make it any faster imho, and by the way, in raid arrays I think the whole array is thught of by the os as a single partition

Post by **GargoyleMT** » 2006-04-29 10:05

Pinchiukas wrote:in raid arrays I think the whole array is thught of by the os as a single partition

Yes, that's part of the definition of RAID Array, and is the same if they're done in hardware (like, say a 3Ware controller) or software (with a product like SoftRAID).

RAID arrays suffer the same hit from concurrently reading files that single drives do, of course.

Ninjai · Post by **Ninjai** » 2006-05-01 08:53

I was the one who made the previous request. I have an pci-e areca raid with 16 disks in raid5 and a amd Athlon X2 4400+.......

I have a throughput of 600MB/s on a bus that support 2120MB/s (pci-e 8x slot) I don't disk I/O is my bottleneck. Hashing goes about at 70MB/s

lordadmira · Post by **lordadmira** » 2006-05-06 16:56

U know, another solution for these super large shares is an external hashing application. Sure DC can do it but it's not optimized particularily. With an external app, the user can do whatever is best for their own system. If u have 5 disks, u can fire up 5 processes. Or if ur 3 IDE channels are the limiting factor, u can fire up 3 processes. The app will then write out the hash index and xml files. As an alternative to a multi process approach, it can be multi threaded.

LA

gnipi · Post by **gnipi** » 2006-09-06 14:25

lordadmira wrote:U know, another solution for these super large shares is an external hashing application. Sure DC can do it but it's not optimized particularily. With an external app, the user can do whatever is best for their own system. If u have 5 disks, u can fire up 5 processes. Or if ur 3 IDE channels are the limiting factor, u can fire up 3 processes. The app will then write out the hash index and xml files. As an alternative to a multi process approach, it can be multi threaded.

LA

I can see this as "Hashing plugin" since hash format is standardised, I'd propose to move all hashing code out of DC++ and design an API for hashing plugin. Then a separate project can concentrate it's efforts on making hashing faster, more reliable, while DC++ developers use API and latest/most stable/preferred version/mod of hashing plugin.

Really, that should be best. Somebody might even do a full rewrite once API is agreed upon. Can't be that hard ?

I'd really like the idea of downloading more recent hash plugin dll and just replacing it in my DC++ directory to get additional speed. It's bloating poor DC++ source tree. DC++ deals with network connections/sharing not with maths/hashes. At least we should use GNU philosophy of numerous separate tools/libs doing _ONE_ thing perfectly well, rather than having zillions of poor/incomplete/buggy "features"

Post by **GargoyleMT** » 2006-09-29 19:10

gnipi wrote:It's bloating poor DC++ source tree.

Hahaha.

spiritual · Post by **spiritual** » 2006-11-15 09:36

Hello , hashing is a pain in the ass for people using EXTERNAL USB hdd's.

It's the 4th time in 4 months that i m hashing 20 GB's through a 1.1 Usb , after some other USB device[digital camera,virtual drive-deamon tools-,etc] occupied the drive letter my EXT was using.

The hashing system used now is at least stupid ! It doesnt detect nearly anything by itself, like that the same filetree structure has moved to another drive ...

please do something about it ! soon ! or the fake filelists will begin to spread all over..

Multithreaded filehashing?

Multithreaded filehashing?

Depends on the situation.