Re: [dcdev] adc
2004-01-22 7:15
Direct Connect developers <[email protected]>, [email protected]

Such protection would be necessary, as
http://www.cs.rice.edu/~scrosby/hash/slides/USENIX-RegexpWIP.2.ppt shows
that even removing backrefs entirely doesn't immunize a RE engine from
attack. 400,000 lines requiring 0.25 seconds would suggest that, for
example, 100,000 lines requires 0.0625 seconds, which allows only 16
searches per second before one reaches 100% CPU usage. Even your example of
0.05 seconds allows only 20/second. I don't want a DC client I run
monopolizing near that much CPU time.

Another factor you haven't mentioned is your CPU: what is it?

a simple blacklist on users sending queries taking too much time is enough, no ?

As I demonstrated above, not without special support from the RE engine;
that 24-way or'd expression is going to be noticeably slower than the
semantically equivalent multiple substring search. The Perl5 method should
be similar in speed, but requires an NFA RE implementation exhibiting
unbounded-time behaviour and thus potentially vulnerable to the attacks
described above.

There is a little flaw in your demonstration. You have assumed when a client searches for "a b c d", this means "a" & "b" & "c" & "d" but it is wrong. Speaking about probability, a.*b.*c.*d is probably what the user wants the most. For example, if a user searches for an album, he won't spend time to write the album title in the reverse order. That's why most of the time, "a b c d" means "a.*b.*c.*d" (when it is not exactly "a b c d"). Any user having enough experience in any search engine (even google) knows he should not use some simple words like "a", "the", "is" and to search for "heaven is a place on earth", he will search "heaven place earth" (and it is even shorter to write :) ).

Finally, using regular expressions inhibits one from applying many
potential search optimizations, such as Bloom filters (used in Gnutella as
the QRP, or Query Routing Protocol) and suffix array-assisted searches. A
histogram of the speed results of the latter is at
http://opennavel.dnsalias.org/~fusbar/cologic/bmratio2.png ; the y axis
represents how many searches (out of approximately 20,000 total) were
faster or slower under that scheme than DC++'s default search by the
(inverse of natural log of) the x-axis value.

optimizing something is good but optimizing something which is limited is IMHO just a waste of time. Moreover, finding something is good, being able to send it to the client requesting it in a reasonable period of time is better. Not everybody has access to a 100Mb connection, most of the users have ADSL or cable connections and if they take 5-10 minutes to send/receive their reply, I think they won't use this P2P network.