Multithreaded filehashing?
Moderator: Moderators
Multithreaded filehashing?
I'd like to propose multithreaded filehashing method in DC++
This way multiple-core/CPU computers could hash those files a lot faster than currently. I know biggest filehashing is done when one installs DC++ for first time, but then again it still annoying to wait when you are doing it
I suppose it would be easiest and even best to do it so that one thread just hashes one file and there would be as many threads (at max) than there is CPUs and/or manually selectable.
This way multiple-core/CPU computers could hash those files a lot faster than currently. I know biggest filehashing is done when one installs DC++ for first time, but then again it still annoying to wait when you are doing it
I suppose it would be easiest and even best to do it so that one thread just hashes one file and there would be as many threads (at max) than there is CPUs and/or manually selectable.
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
On my system, hashing is I/O bound. We had a similar request sometime not too long ago, and I think it was I/O bound on other people's systems as well. If a move was made in this direction, a "multi-disk" hashing approach, where a file was read from each physical drive, would make more sense. I'm not sure it would be practical, but I think it's a better tactic.
Having some experience with hashing on a I/O slow machine I'm guessing the I/O is the weakest link in this chain. At least since I now have the opportunity to compare with a faster machine. To me, it seems like I/O is the biggest bottleneck.
"Nothing really happens fast. Everything happens at such a rate that by the time it happens, it all seems normal."
-
- Forum Moderator
- Posts: 1420
- Joined: 2003-04-22 14:37
Deja vu. I swear we had this discussion fairly recently but I can't seem to find it anywhere. However I have been able to find the bugzilla request made by that person:
http://dcpp.net/bugzilla/show_bug.cgi?id=856
http://dcpp.net/bugzilla/show_bug.cgi?id=856
Depends on the situation.
For me, multithreaded file hasing would make absolute since. But then again, I have 10 disks to hash over...
Most people just have one or two. And then it won't speed up things much at all.
But looking besides the I/O bound, which would be around 50 to 100MB/sec depending on if there is a RAID array or not. Can one cpu handle that amount of data flow?
For myself I see one of the hyperthreading cpu's (dual 3.0 Ghz) being filled up allmost completely while doing hashing work, and the rest just sits there idle. Sometimes I get the feeling that if it would use more cpu time also on the other cpu's, it might have been faster.
But I'd agree with that this probably isn't the most needed feature for most users out there.
Most people just have one or two. And then it won't speed up things much at all.
But looking besides the I/O bound, which would be around 50 to 100MB/sec depending on if there is a RAID array or not. Can one cpu handle that amount of data flow?
For myself I see one of the hyperthreading cpu's (dual 3.0 Ghz) being filled up allmost completely while doing hashing work, and the rest just sits there idle. Sometimes I get the feeling that if it would use more cpu time also on the other cpu's, it might have been faster.
But I'd agree with that this probably isn't the most needed feature for most users out there.
-
- Posts: 2
- Joined: 2006-04-28 14:37
- Location: Lithuania
- Contact:
imho multithreaded file hasing would only make sense when reading from multiple drives, cause reading two files from diffrent parts of the disk simultaniously would only make it a lot slower. on my 2ghz celeron and a seagate 7200.9 sata drive the hashing is done at about 27mB/s, so this is pretty sufficient in my case, cause my drive can read up to 70mB/s, that can only mean that the processor is the bottleneck in this case
-
- Posts: 164
- Joined: 2005-01-06 08:39
- Location: HU
- Contact:
Not necessarily. RAR unpacking goes pretty slow for me, and my CPU usage is never 100% when doing it. Maybe the algorithm itself cannot be faster. Oh, and one should not forget about the filesystem used, fragmentation level etc. etc. This is rather multifactorial.that can only mean that the processor is the bottleneck in this case
Hey you, / Don't help them to bury the light... / Don't give in / Without a fight. (Pink Floyd)
-
- Posts: 2
- Joined: 2006-04-28 14:37
- Location: Lithuania
- Contact:
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
Yes, that's part of the definition of RAID Array, and is the same if they're done in hardware (like, say a 3Ware controller) or software (with a product like SoftRAID).Pinchiukas wrote:in raid arrays I think the whole array is thught of by the os as a single partition
RAID arrays suffer the same hit from concurrently reading files that single drives do, of course.
-
- Posts: 52
- Joined: 2003-03-12 00:06
- Location: Zinzinnati
U know, another solution for these super large shares is an external hashing application. Sure DC can do it but it's not optimized particularily. With an external app, the user can do whatever is best for their own system. If u have 5 disks, u can fire up 5 processes. Or if ur 3 IDE channels are the limiting factor, u can fire up 3 processes. The app will then write out the hash index and xml files. As an alternative to a multi process approach, it can be multi threaded.
LA
LA
Anata ga baka da! Repent!
lordadmira wrote:U know, another solution for these super large shares is an external hashing application. Sure DC can do it but it's not optimized particularily. With an external app, the user can do whatever is best for their own system. If u have 5 disks, u can fire up 5 processes. Or if ur 3 IDE channels are the limiting factor, u can fire up 3 processes. The app will then write out the hash index and xml files. As an alternative to a multi process approach, it can be multi threaded.
LA
I can see this as "Hashing plugin" since hash format is standardised, I'd propose to move all hashing code out of DC++ and design an API for hashing plugin. Then a separate project can concentrate it's efforts on making hashing faster, more reliable, while DC++ developers use API and latest/most stable/preferred version/mod of hashing plugin.
Really, that should be best. Somebody might even do a full rewrite once API is agreed upon. Can't be that hard ?
I'd really like the idea of downloading more recent hash plugin dll and just replacing it in my DC++ directory to get additional speed. It's bloating poor DC++ source tree. DC++ deals with network connections/sharing not with maths/hashes. At least we should use GNU philosophy of numerous separate tools/libs doing _ONE_ thing perfectly well, rather than having zillions of poor/incomplete/buggy "features"
-
- DC++ Contributor
- Posts: 3212
- Joined: 2003-01-07 21:46
- Location: .pa.us
Hello , hashing is a pain in the ass for people using EXTERNAL USB hdd's.
It's the 4th time in 4 months that i m hashing 20 GB's through a 1.1 Usb , after some other USB device[digital camera,virtual drive-deamon tools-,etc] occupied the drive letter my EXT was using.
The hashing system used now is at least stupid ! It doesnt detect nearly anything by itself, like that the same filetree structure has moved to another drive ...
please do something about it ! soon ! or the fake filelists will begin to spread all over..
It's the 4th time in 4 months that i m hashing 20 GB's through a 1.1 Usb , after some other USB device[digital camera,virtual drive-deamon tools-,etc] occupied the drive letter my EXT was using.
The hashing system used now is at least stupid ! It doesnt detect nearly anything by itself, like that the same filetree structure has moved to another drive ...
please do something about it ! soon ! or the fake filelists will begin to spread all over..