Re: [dcdev] Re: New Encoding Scheme First

eric writes:
> I don't agree with you, even a semi-binary header can speed up
> stream processing for a hub (and also the client) but only if it is
> the very thing to come (before any fourcc). Currently, when data
> arrives, the only way to know if a full command line is received is
> to search for '|' again and again until the full line is
> received. If a binary header which contains the bloc size, you just
> have to check the value with the size of the buffer of incoming
> data.
> > The size in the header can also be optimized to be stored in 1 byte
> (as small as the "|") and for bigger commands, we can use an escape
> code like 0 or 255 to tell the size is stored in 2 bytes (like in
> x86 instruction set) (it is

Or even better, use UTF-8 encoding straight-off. Or simply send
everything over a bzip2 compressed channel.

> probably better to keep 0 as connection pulse to speed up detection
> of broken connection).

Only a semi-binary protocol with line length won't make a very large
difference, since you will still have to search for word
seperators. If you integrate the word splitter with the recieving
logic, you don't have to search through the entire recieve buffer
every time anyway (just the newely recieved data), so a pre-sent
binary length will make only very little difference in that case.

If you want to be able to optimize the command parser to those
lengths, you will need a fully binary protocol, for example having
each command block begin with a word count, and then every word begin
with a character count.

Admittedly, binary protocols do have advantages, it is simply a matter
of which way you want to go. That is a stance we will have to make
before designing a new protocol, and personally, I am in favor of a
text-based protocol.

> In your previous mail, you also speak about text based protocols being easier > to debug because we can use telnet but there is some problem here:
> 1) you cannot use telnet on DC protocol because telnet sends the command after > a "return" and because the end of command is "|" in DC, you will have a LF in > the buffer at the beginning of the next command

That's not true. Windows9x telnet (I haven't run Windows since then,
so I don't know about the new one) sends data as it comes in, and UNIX
telnet also sends data as it arrives on stdin. It is the kernel that
buffers data until it gets CR, and if you send it an EOF character
(^D), the tty driver will flush the line buffer so that telnet sends
it. You can also turn off icanon on the tty before debugging. I have
used that to debug DC sessions, only that I've had to copy the key
from a calculation program.

In any case, I (and I have understood that I'm not the only one) was
hoping that the new protocol would have standard internet CRLF command
terminators, which will solve the problem in any case.

> 2) the debug status is only temporary (or it is a very bad protocol with bad > programmers).

That is a valid argument, though. However, it's not just for
debugging. I find that it's often useful to be able to interact with
software from a remote location for a number of reasons, like to check
status and whatever without having to install a client. It also makes
traffic monitoring easier (unless encryption is used). If used with
compression, it won't use up too much bandwidth either.

> 3) unlike us, clients and hubs always know the size of the data to send before > sending them because they have built the command.

Which is precisely why we can't interact with nodes in that case. But
like I said before, the choice between a text based or binary protocol
is simply a stance we have to take. Shall we hold a poll?

> About fourcc usage, I am not sure we can gain a lot here because there is some > problem to take into account. Comparing 4 characters is always faster than > comparing string of unknown size but to be very fast, we must compare the 4 > bytes simultaneously (one 32bits value). However, in this case, we must > define the endianness of the protocol (big endian is better, like any > portable network protocol (TCP, IP,...) else we will have PC clients > incompatible with MAC one (ABCD on PC is displayed DCBA on mac). Only a onecc > can solve the problem without conversion :)

Like Zdenek said in his first mail, endianess isn't a problem; after
all, that's what the htonX functions are for.

In any case, I would be pleased to see fourcc commands in a text
protocol since they're faster to type. I don't think that we have to
strictly confine ourselves to fourcc commands, on the other hand, but
rather use them by preference and if possible.

On the other hand, with a binary protocol, why use character based
commands at all? Just using numeric IDs would be more than enough. You
can also define an X11-like protocol, adding extensibility to the
protocol by using protocol extension modules.

As a last note, I haven't ever profiled a hub program, so I don't know
how much CPU time is spent in the command parser, but I would imagine
that command multiplexing and retransmission would take far more CPU
time. I don't know how efficient parser algorithms the hubs use
either, but I'd imagine that if you use a stateful reciever, the hub
would have far worse things to worry about. But then again, I haven't
actually profiled a hub, so I don't know for sure. Can anyone who has
profiled a hub give some input here?

There is a third option: to have protocol modules. It seems that the
protocol is to be word-based in any case, so it's just a matter of
writing to word splitters. While I wouldn't like to see this becoming
reality, it would be more generic in the way that the client and hub
could negotiate parser stacks, ie. I would use only textparser when
connecting with telnet, while a client might use
decrypt->decompress->binparser to recieve data. You could also include
multibyte translators that way: decompress->binparser->utf8trans or
textparser->eucjptrans. Just run the stack backwards to send
data. Then again, though, I would not be pleased to see this becoming
reality.

In any case, we need to decide on either a fully binary or fully text
protocol. There's no point at all in having some halfbreed stuff. Like
I've said, I believe in a text based protocol, since I think that both
hubs and clients have far worse things than just command parsing to
care about, but if the majority is against me, I will gracefully bow
and yield. Like I said, shall we hold a poll?

Fredrik Tolf

--
DC Developers mailinglist
http://3jane.ashpool.org/cgi-bin/mailman/listinfo/dcdev