RE: [dcdev] Re: New Encoding Scheme First

Zdenek Stangl writes:
> > Might I ask what you want the FF FF start word for?
> > for the case of total data distortion. I think it's always better
> to have some 'anchor' for cases, when data goes inconsistent or
> mangled, than none.
> > > I just find it a bit unnecessary that you are into it as far as
> > eliminating a single instruction for bit-shifting, when you still need
> > to broadcast packets to 1000+ users, going through the kernel's TCP
> > assembly code and everything. To me, that optimization seems far less
> > than necessary, since your program will be blocking most of the time
> > anyway, waiting for the kernel to flush network data.
> > > > Especially so since you still have to word-split the data to process
> > it.
> > Doing word-splitting for every incomming data on the hub-side isn't
> really good idea. For current DC protocol, Im having custom parsing
> routine for every single command, looking only for tokens or
> specific parts that I really need for the processing. Im also
> avoiding data copying/moving as much as possible and believe me,
> all this has helped to gain performance of ptokax by approximately
> 15% in 0.330 version in comparsion with currenlty available 0.326,
> which uses string classes and copies data a lot.

Well, I can understand that there was a huge performance gain if you
were using C++ string classes before. Those don't only move and copy
data, they even malloc during operation, so it's no wonder you
experienced a boost when abandoning them. However, it's fairly easy to
manipulate text data in place in C to do word splitting and
dequoting. The only thing that demands data moving is backslash
removal, but see below.

> And that's just the parsing mechanism. All sockets are in
> non-blocking mode - when underlaying tcp kernel wouldblock, im
> buffering data by myself - no blocking, no threads.

Oh yes, sorry for being unclear. I hadn't expected you to use blocking
sockets. I was mainly thinking about the time you have to wait until
sending again.

> There is still something missing in our discussion, to have it
> constructive. Try to come out with alternatives, if you can't agree
> with my draft, please.

OK, for some reason I thought my protocol idea was clear. Now that you
mention it, I don't know where I got that from... ;-)

I suggest using a line-based protocol, with CRLF line termination (the
CRLF can also be thought of as doing the same job as the FF FF that
you suggest, except that it can also be quoted). Within lines, words
are seperated by whitespace (I was planning on simply using isspace(3)
for detecting whitespace, but I can agree to only allowing ASCII 32
spaces instead for efficiency). As for quoting, I suggest allowing
both double quotes and backslash escaping, for good reasons. Double
quotes escaping is limited to only include whole words - ie. no
sub-word quoting like a"b c"d instead of "ab cd". Backslashes can
quote anything.

The only thing that would require data movement is backslash
removal. As for all else, just insert NULs where you want them. Since
double quotes can quote everything but themselves, and words are
likely to not very often contain double quotes or backslashes,
backslash removal will probably be a rather rare procedure (yes, I
want pathnames to consist of slashes, not backslashes). Also, if you
want to optimize it on the hub side, you can simply choose not to
dequote words that you don't need to look at.

Also, if you must insist on a binary protocol, I for one would like to
see that it handles all the words in a binary manner as well. Each
command would begin with (possibly a start word,) a word count, and
then having each word begin with a character count. Example (big
endian, 16-bit counters, no start word):

00 03 00 04 "Chat" 00 14 (20 character UUID) 00 0E "Hi everyone!\r\n"

Of course, in both protocols, all human readable words should be
encoded in UTF-8, if that wasn't already obvious.

Fredrik Tolf

--
DC Developers mailinglist
http://3jane.ashpool.org/cgi-bin/mailman/listinfo/dcdev