Ämne:
RE: [dcdev] adc
Från:
[email protected]
Datum:
2004-01-22 7:04
Till:
Direct Connect developers


At 01:01 AM 1/22/2004 +0100, you wrote:
I don't understand why it matters whether the majority of users will
be able to use regexes or not. Since regexes allow a set of achievable
search criteria that is a strict superset of that which substring
searches allows for, without sacrificing almost any speed (see below),
it is, quite simply, very, very good.

Whilst regexes are strictly more powerful than the multiple substring searches DC++ uses, unless one has a & operator (rare) or Perl5's positive lookahead, one cannot implement them without an increase in length proportional to the permutations of the terms in the query:

Multiple substring: a b c d

Extended POSIX-style: a.*b.*c.*d | a.*b.*d.*c | a.*c.*b.*d | a.*c.*d.*b | a.*d.*b.*c | a.*d.*c.*b | b.*a.*c.*d | b.*a.*d.*c | b.*c.*a.*d | b.*c.*d.*a | b.*d.*a.*c | b.*d.*c.*a | c.*a.*b.*d | c.*a.*d.*b | c.*b.*a.*d | c.*b.*d.*a | c.*d.*a.*b | c.*d.*b.*a | d.*a.*b.*c | d.*a.*c.*b | d.*b.*a.*c | d.*b.*c.*a | d.*c.*a.*b | d.*c.*b.*a

Perl5-compatible regex: (?=a)(?=b)(?=c)(?=d)

All of the above are untested.

As for more complex regexes, I egrepped through 400000 lines of the
Linux networking code, and even using backrefs, without a doubt the
most CPU-intensive part of regexes, I simply couldn't manage to push
it beyond 0.25 seconds, and that was on very many lines. It is also
very important to note that nothing prevents a client implementation
from dropping a search once it has reached a certain time threshold
of, for example, 0.05 seconds, which prevents people from submitting
complex expressions just for the DoS fun of it.

Such protection would be necessary, as http://www.cs.rice.edu/~scrosby/hash/slides/USENIX-RegexpWIP.2.ppt shows that even removing backrefs entirely doesn't immunize a RE engine from attack. 400,000 lines requiring 0.25 seconds would suggest that, for example, 100,000 lines requires 0.0625 seconds, which allows only 16 searches per second before one reaches 100% CPU usage. Even your example of 0.05 seconds allows only 20/second. I don't want a DC client I run monopolizing near that much CPU time.

Another factor you haven't mentioned is your CPU: what is it?

Since regex substring
searches are as fast (or almost as fast) as ordinary substring
searches, normal users, who don't use regexes, won't even be impaired
by such behavior.

As I demonstrated above, not without special support from the RE engine; that 24-way or'd expression is going to be noticeably slower than the semantically equivalent multiple substring search. The Perl5 method should be similar in speed, but requires an NFA RE implementation exhibiting unbounded-time behaviour and thus potentially vulnerable to the attacks described above.

Finally, using regular expressions inhibits one from applying many potential search optimizations, such as Bloom filters (used in Gnutella as the QRP, or Query Routing Protocol) and suffix array-assisted searches. A histogram of the speed results of the latter is at http://opennavel.dnsalias.org/~fusbar/cologic/bmratio2.png ; the y axis represents how many searches (out of approximately 20,000 total) were faster or slower under that scheme than DC++'s default search by the (inverse of natural log of) the x-axis value.

-cologic
-- 
DC Developers mailinglist
http://3jane.ashpool.org/cgi-bin/mailman/listinfo/dcdev