Re: [dcdev] Re: [dddev] Searching

Carl-Adam Brengesjö writes:
> Made a test for regex matching. Source code, filelists and binaries used > are attached to this mail. If that doesnt work (don't know if attaching > files on this mailing list works) they can be downloaded from > <http://ptha.mine.nu/~ptha/regextest.tar.bz2>. Be nice on the server > though, its hosted on my personal home 0.5Mbit connection...
> [...]
> ---- HUGE (*nix) ----
> $ mono RegexTest.exe huge.bz2 ".*microsoft.*"
>        file: huge.bz2
>     pattern: .*microsoft.*
> Begin decompression... (`bzip2 -dc "huge.bz2"')OK!
> Reading...OK! reading took 36.705775 seconds.
> Beginning regex test of against lines in memory (121028 lines to test)
> Test completed. 76 matches where found.
> The search took 45.15112 seconds!
> > ---- HUGE (windows) ----
> >RegexTest.exe huge.txt ".*microsoft.*"
>        file: huge.txt
>     pattern: .*microsoft.*
> Reading... 121028 lines read
> OK! reading took 13,25 seconds.
> Beginning regex test of against lines in memory (121028 lines to test)
> Test completed. 76 matches where found.
> The search took 12,484375 seconds!

OK, I don't know what those Mono (or M$, for that part) guys are
doing, but when I try egrepping through the 'net' subdir on my Linux
2.6.0 source, which is in total 404715 lines, egrep -i runs for a
total of 1.8 seconds, with an active CPU time of 0.04 seconds user
time and 0.02 seconds system time, so I'd say that it's no problem
performance-wise to use regexps:

$ find -type f -exec cat {} \; | wc
404715 1246842 14942600
$ find -type f -exec cat {} \; | time egrep -i '.*ipv4.*' | wc
0.04user 0.02system 0:01.76elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (399major+90minor)pagefaults 0swaps
1408    8629 400709

I don't know what you guys say, but I think that very much speaks for
itself, especially considering the performance losses of testing it
this way. I don't really think that it's going to be slower if you use
the regex functions directly instead of piping data back and forth,
and especially when you feed it the share cache from memory instead of
going through tons of VFS code (admittedly, I had pre-heated the
caches, but I think that's not more than right... :-) ).

If it takes 0.04 seconds of CPU time to regex search 400000 lines, it
seems not even the largest filelists should be a problem with a
properly optimized program.

Fredrik

--
DC Developers mailinglist
http://3jane.ashpool.org/cgi-bin/mailman/listinfo/dcdev