RE: [dcdev] adc
Fredrik Tolf
2004-01-22 1:01
Direct Connect developers

Jacek Sieka writes:
> > On Tuesday 20 January 2004 20:00, Jacek Sieka wrote:
> > > A) because the vast majority of users don't have the > > slightest idea of what
> > > a regular expression is
> > > > but the majority does not mean everybody. The majority of > > people have feet but > > this does not have prevented automobile creation ;-)
> The majority of people do know how to drive though (in the western part of
> the world where the majority has a car anyway)...and as sandos noted; all
> the regex searchs you provided in your tests were archivable with substring
> search as well...

I use regexes very frequently to filter search results, and I do that
because I can achieve things that I cannot achieve with substring
searches. I think it would be great to have regex searching in the
client, because then fewer relevant search results would be dropped
because of UDP packets being discarded by my modem when my bandwidth
is breached.

I don't understand why it matters whether the majority of users will
be able to use regexes or not. Since regexes allow a set of achievable
search criteria that is a strict superset of that which substring
searches allows for, without sacrificing almost any speed (see below),
it is, quite simply, very, very good.

> > > > > B) because it would require a 3rd party library (or very > > much work) to
> > > create a BASE compilant client on the biggest target > > platform, i e it's not
> > > part of the C standard library
> > > > I am not sure about this. When I do a "man regex", it is > > classified in man(3) > > and it is in POSIX.2. Moreover, there is no libregex anymore > > since ... a long > > time :)
> Being in posix.2 means, among other things, that it's not covered by the
> standard c library...it also means that it's not available on windows, and
> although you might think that linux is the answer to everything, I doubt
> that your client comes even close to the >1000000 downloads that dc++ had in
> it's second latest version (and that being from sf only, not counting the
> countless mirrors and freeware cd's its being distributed through)...hence,
> the major dc platform does not support regexes without library support...

As someone mentioned earlier on this list, regexes _are_ available for
Windows from http://gnuwin32.sf.net/. I don't use Windows, so I can't
give any specifics, but I guess that it's a library that can be
directly included in the source code for a program, just like, for
example, bzip2 has been included in DC++.

> > > C) because they are slower (indexed searches become tricky > > for instance)
> > > > It is not slower (see earlier mails in the mailing list for > > test results) and > > indexed searches are mainly a matter of index organization.
> It is. Even some of your simple searches took ~0.1s, imagine what a more
> complex one would do that actually uses the enhanced expressiveness of
> regexes (start by introducing a few | and you'll see...), and if it takes
> 0.1s it means that the highes search throughput will be 10searches/s with
> 100% cpu, which is not very impressive, not in my eyes anyway...

First of all, that was 0.01 seconds, and it was also on a file list
that was around 150000 lines, which is _far_ more than the avarage
user that you seem so fond of.

As for more complex regexes, I egrepped through 400000 lines of the
Linux networking code, and even using backrefs, without a doubt the
most CPU-intensive part of regexes, I simply couldn't manage to push
it beyond 0.25 seconds, and that was on very many lines. It is also
very important to note that nothing prevents a client implementation
from dropping a search once it has reached a certain time threshold
of, for example, 0.05 seconds, which prevents people from submitting
complex expressions just for the DoS fun of it. Since regex substring
searches are as fast (or almost as fast) as ordinary substring
searches, normal users, who don't use regexes, won't even be impaired
by such behavior.

> > > > > D) because if there's demand for it, it's very, very easy > > to add to the
> > > protocol and mandate by the hub (using sup)
> > > > then why not add it immediatly if it is so easy :)
> Because it adds complexity to a BASE client...if anything should be added,
> it's hashes...

Considering how regexes are available as a seperate library and all, I
don't see how using regexec() instead of strstr() makes it so much
more complex. In that case, I'd say that eg. filelist compression adds
much more complexity, wouldn't you agree? It is nonetheless very
useful, just like regex searches are.

Also, may I ask why you want such an over-simplified base

Fredrik Tolf

DC Developers mailinglist