Optimized search

yilard · Post by **yilard** » 2003-01-20 16:46

Optimized Search Algorithm Design
by Vladimir 'Yilard' marko

What is this about?
--------------------------
I consider current file search algorithm implemented to be one of the biggest weaknesses of DC++ (current version 0.22). This document describes replacement/improvement to this algorithm. I have also appended my partial implementation.

Motivation
----------------------
Current file search algorithm is one of the most cpu consuming aspects of DC++. The time complexity of the algorithm depends roughly on number of client search requests coming from users on logged in hubs and the number of shared files.

The dependency is clear: the more users on hubs, where I'm looged on, the more search requests per second. In fact the DC++ process tend to put almost 100% cpu load on my system when logged more than 5 hubs. My system is Athlon 900 Mhz, 512 MB RAM and Windows XP with 160 GB+ of share.

With todays numbers of hub users that exceed 1000 in some cases, this problem becomes real pain. I have proved the assumption that search make up majority of this load by disabling search command processing and the DC++ cpu load quickly dropped from 100% to less than 5% (measured with Windows Task Manager). As I don't believe in disabling the search completely I started to analyze problem more in depth.

What's wrong with current search algorhithm
---------------------------------------------------------------
First of all, the original algorhithm is more or less linear search and has been designed to return matching results as quick as possible. But according to seatrch spy, less than 10% of search requests are hits. This forces the program to go through all file records in 90% of cases.

Smart algorithm (Reversing the priorities)
---------------------------------------------------------------
I have set my main goal not to return matches as quick as possible, but to discover that there are not matching file records as quickly as possible. Apparently, it's not possible to apply some kind of search tree-like technique (which would turn linear search in something like O(log n) complexity), because we need to do substring match.

I have already started with Search Spy window so I will continue looking in what peaople are looking for. When you look to it closely, you will realized that about 90% of requests contain file extensions and these belong to very limited set (mp3, bin, avi, mpeg, zip...).

My idea is to split the set of file records into several subsets according to whether they contain certain substring. This substrings may be defined as the extensions from limited set I have mentioned above. So we would have subset of files containing mp3, subset of files containing bin, etc. These subsets may have intersections, which aren't important now. Now we can detect these significant strings in search request string and then limit the searching to particular subset.

It also may be smart to arrange list of searched substrings so the longest string is searched first, because there are the greatest possibility of mismatch with long search strings. But I haven't forethought this idea yet.

Maybe it would be interesting to conduct some kind of survey by observing which search string people are looking for but I thing my point are obvious just from look at the Search Spy.

Simplified algorhithm
-------------------------------
I am not used to program in C++ and I'm completely confused by things like STL (as a Java and Delphi programmer I am not interested anyway). Also I didn't want to break things too much to make CVS imports of DC++ releases and synchronizing with my project too complicated, so I decided just to make smaller modifications to the original algorithm.

Initialization: I have set up several groups based on most common file extensions - I called them pivots. Then I search for pivots as substrings in the name of each file remembering which pivots were found (attribute mask). I remember the same thing for each directory (dirNameMask). Additionaly there is also record for each directory which carry set of all pivot groups where are all subdirectories and files in subdirectories of particular directory belong (subDirMask).
Actual searching: First extract the pivot groups that will be searched by looking for pivot substrings in search request. The pivot substrings are then removed from search request string list (if it is possible). Then traverse the directory structure recursively, but will enter only subdirectories that contain files which belong to all the pivot groups in that the search request belong. Aditionally each file and directory is compared first according to pivot groups (please see the code) and then by searching for substrings.

I designed the algorithm with having in mind that people usually have files organized in directories according to content (Mp3s, Movies, Install, etc). Then this algorithm allows to skip the whole directory subtrees which certainly don't contain searched files.
I have implemeted this simplified new algorithm in my modified DC++ client.

Results
-------------
Note that this is not yet thoroughly tested but I have used it with my friends and seems to work well. All tests are done with simplified algorithm. I have tested even with 18 hubs and the DC++ CPU load peaks don't exceed 50% while maintaining under 20% most of the time (I have about 160 GB share with many mp3s, txt files and rar archives). I have measured over 40 search request per second.

I did some artificial benchmarks by searching selected strings and found that the reduction in search time varies from 50% to 98% depending on search request.

As you can see, proposed algorithm is faster than original when people request searching strings containing pivots (something like "britney spears mp3" "matrix 2 avi", etc.

) and doesn't help at all when searching for complete nonsense (like: "preteen rape lolita pr0n sex anal oral"

). In the latter case the algorithm may be a bit slower but the slowdown doesn't exceed 2% even in this (worst) case. Please note, that implementing the complete algorithm instead of simplified would allow much better performance.

I would like to hear about your opinion on speeding up the search algorithm. Maybe it's time to put DC++ into feature freeze as it is already pretty functional and to deal with those performance issues (e.g. search, overblown threads, etc.).

I can supply source code in case of serious interest. But please take the attached sources as and example, I know that what I have done is a mess (I haven't coded in C++ for some 5 years and I'm really not up to date with current libraries:P ). I don't want anyone to use clients modified with my search on public hubs (except me and friends I trust)!! This is due to possibility of bugs, that may harm other users.

I'm sure that experienced windows C++ coder would do much better job implementing my design. Do you think any experienced developer can put some time into implementing proposed algorithm properly (or even better algorithm)? Feel free to ask if I my explanations are unclear, or you have any further questions.

You can contact me on DC++ Chat Hub my nick is [sk]yilard.

Gumboot · Post by **Gumboot** » 2003-01-20 20:47

Nice idea! I wonder, though, why so many search requests have extensions? I know I don't include the extension in searches I make... Could it be the "Automatically search for alternates" feature in DC++?

aDe · Post by **aDe** » 2003-01-21 02:12

Gumboot wrote:Nice idea! I wonder, though, why so many search requests have extensions? I know I don't include the extension in searches I make... Could it be the "Automatically search for alternates" feature in DC++?

Yes.

Post by **arnetheduck** » 2003-01-22 08:41

Heh, I got this as a mail as well a week ago, but I didn't have time to look into it, anyway, now it's here in the forums and that's good...(I even answered the mail now =)...anyway, this is the best researched patch / addition I've received so far (well, there's one more of this caliber that will show up in dc++ soon...), keep up the good work (I wish more people did like this...not just thoughtless "I want that / this / whatever")

anyway, the idea is sound and I'll add it to (the next?) version, probably with some modifications of mine...for instance, could someone have a look at how many search with the type specified to something different than "all types" (i e movies, documents and so on...)...it would probably also make sense to make the autosearch specifiy a type when it searches, this way a lot of searches could be filtered out in an early stage...

ivulfusbar · Post by **ivulfusbar** » 2003-01-22 15:58

(best thread so far in this forum)

He has a good point, i have actually also thought a little bit about this and this is what i had in mind.

My thoughts was to have a set for each datatype used by the protocol,
audio files, compressed files, documents, executables, pictures, video, folder and one for any.
If searched for a compressed file you would only need to search this set (and maybe the "any set").

My second observation is this, when dc++ autosearches for files it doesn't use the ability to set datatype in the $Search. It always adds a 1 for any file. This combined with above would certainly speed up things.

these-are-only-some-small-thoughts-i-have-had-about-search-and-i-agree-with-arne-that-this-how-a-feature-request-should-look-like-ly'ers ;))

Post by **arnetheduck** » 2003-01-30 07:38

Alright, added the search thing in a slightly different form...without the mask for each file, this takes quite a bit of memory if there are a lot of files....(since the map<string,int64> would have to be replaced by map<string,fileinfo>, fileinfo being a struct with more info...)...I hope this is enough, could someone perform some profiling and see if searches are any faster?