Re: [dcdev] Searching

Carl-Adam Brengesjö writes:
> Hello!
> Alot of discussion has been focused on search commands and it's
> features. I don't think we need (nor should) do it too complex and
> with many features.
> [...]
> Here follows the syntax
> > "SEARCH <dest[:port]> <id> <mime> <size|#[type#]hash> :<pattern>\r\n"
> [...]

In my opinion, there are two great disadvantages with doing this:

1. As eric stated in his mail, this approach is not easily
extensible. Say that you want to add an attribute later that could
be searched for (not that I can think of one right now - if I
could, I would have liked to add it, of course). That could not
easily be done with an approach such as this, but very easily with
the approach that eric and I have almost agreed upon.

2. There is a very important reason to be able to specify very strict
search results. Considering the fact that search results are
delivered by UDP, which doesn't have arrival acknowledgments,
relevant search results will be dropped if you search for something
popular without being able to make it more strict.
I'm currently working on a protocol draft which will incorporate a
search pattern syntax that differs slightly from the one I
presented earlier and an example algoritm of how to parse and use
it in a very efficient way. I'm having a test tomorrow, though, but
expect me to publish it this weekend, and you may have a look
yourself.

> No need for fancy if-statements and such in the search query. A
> user mostly just searches for a name anyway and goes through the
> results manually, it's the quickest and most userfriendly. Ok -
> it's simple to write a if-statement yourself.. but imagine a gui to
> that.. to be able to make a complex if statement without actually
> write the if statement it would require alot of controls. As per my
> example you only need a single text field. Pherhaps you like to
> some checkboxes to tell if you want it as a regex, wildcard of
> plain to ensure syntax before sending the request to save traffic,
> but this is optional.

That's not necessarily a problem. Since, as you said yourself, the
average user won't often (or at all) use this functionality very
often, the default UI can very well display just a standard text box
and insert the pattern that is entered into a more complex search
expression. Thus, those who actually want to use the more complex
options can simply click a button to have whatever they type be sent
verbatim.

Just because _most_ users don't use it is no reason in my mind to ruin
the fun for those who would use it. It's an extremely useful facility,
since it allows (especially for automated clients) the bandwidth usage
to drop significantly. I'm sure hub owners would appreciate very much
if passive clients use it.

> I know we are to discuss the actual protocol - but what's the point of > making a protocol that is not usable?

How exactly do you mean that it makes the protocol itself unusable?

> Also, you quickly want to find something.. dont want to sit 2minutes > configuring a criteria when it only takes 30seconds to scroll through > 100+ results.

Again (just to reiterate), the default UI would display just a single
text box that really just inserts whatever you type in it directly
into (printf format) "( N ~= %s )".

> another thing, using a particular type of hashing. simply set the > hash/size argument to "#MD5#615b3de4b7a572679b457de271305229" for using > MD5 the algorithm (of course, this requires a shared database of names > of hashing types, but they shouldn't be too many to choose from).

That is clearly not a good idea. Hashing takes long enough as it is
(hashing my share of 160 GB takes several hours), and thus you clearly
don't want to hash it several times using different algorithms. It
will probably keep chewing for a day while bringing your load to 10
and ruining your buffer cache forever. Especially, think of the poor
guys who share like 600 GB or more - they'll probably have to be
munching numbers for a week, so I don't think that they'll appreciate
this, which is bad since everyone on the network appreciate them very
much.

Instead, we absolutely need to standardize a specific hashing
algorithm. I suggest SHA1, but it doesn't really matter. It would be
good if everyone could agree soon, though.

Also, the question remains how to deal with MIME types. Finding them
probably isn't that big a problem. On UNIX, just call "file -i" on the
file, and on Windows (although I'm not a Windows programmer, so I
don't know this for sure), I believe the MIME type is stored in the
registry for each file extension, right.
However, the problem remains that the detected MIME types differ from
system to system and computer to computer, essentially. Of course,
there are the standardized ones, but the */x-* types aren't really
very reliable, to say the least. For example, if you want to search
for AVI files alone (if you don't want any MPEG ones because of the
bad quality), how would you do then? I have seen at least three
different reported mime types for those.

Also, MIME types don't always properly convey the file type. For
example, when it comes to AVI files, you aren't actually interested in
knowing that it's AVI - most often you want to know the format of the
video stream. I guess this isn't really an issue that a DC
implementation even should be required to handle. However, it could be
a good thing to be able to recover meta-data from files (such as AVI
or MP3), to, for example, filter out those pesky dubbed movies or
search for an artist using the ID3 tag. On the other hand, it would
make DC clients much more complex, which isn't necessarily a good
thing.

I don't want you to think badly of me for disapproving your
suggestion, but as eric said, the search function is the single most
important feature of Direct Connect, and thus needs to be done very
carefully.

This is my opinion. However, discussion is always good, so please try
to convince me.

Fredrik Tolf

--
DC Developers mailinglist
http://3jane.ashpool.org/cgi-bin/mailman/listinfo/dcdev