Re: [dcdev] Re: [dddev] Searching

---- HUGE (windows) ----

>RegexTest.exe huge.txt ".*microsoft.*"

file: huge.txt
pattern: .*microsoft.*
Reading... 121028 lines read
OK! reading took 13,25 seconds.
Beginning regex test of against lines in memory (121028 lines to test)
Test completed. 76 matches where found.
The search took 12,484375 seconds!

Well, I should admit I have the fastest computer in the world (perhaps in the universe :) ). I have not used your program to perform the test, I have used a simple (but wonderful) shell command:
grep '.*microsoft.*' < huge.txt
and to be more exact:
time grep '.*microsoft.*' < huge.txt
to have the run time duration. My computer is a P4C 2.8 with 1GB and I run linux 2.4.22. At the end of the program, I have 76 matches like you but it takes 0.01s to do the search. I see only the following reasons:
1) windoz sucks :) and linux rules but even with this, I don't think this explain the fact I go 77 times faster
2) I have a faster CPU (let's say 2 or even 3 time faster than yours).
3) the regex library you use has a poor speed.
I think it is even possible to go faster using file mapping but this already is optimization :)

To do the comparison, I have run:
time fgrep 'microsoft' < huge.txt
(fgrep only does simple string search, no regex).
and the result is good... it takes the same time (0.01s or even 0.0 because the execution is too fast).

Finally, I have run a bigger test (3 lines of code :) ). I have written a small PERL program which does the same search:
==============
while (<>) {
print if /.*microsoft.*/;
}
==============
The run time (including perl loading and perl is big :) ) is between 0.07 and 0.09s. This clearly means perl like expression is usable.

I think that instead of removing features to a powerful search model, there is a more simple solution. Why a client could not just discard some search queries when it is overloaded.

Eric

--
DC Developers mailinglist
http://3jane.ashpool.org/cgi-bin/mailman/listinfo/dcdev