Re: [dcdev] file list, regexp, and mailing list
2004-01-23 3:49
Direct Connect developers <[email protected]>, Fredrik Tolf <[email protected]>

I'm all for .xml.bz2. I don't even see a reason for a binary file
list. If it truly is smaller, it won't be more than a few bytes
considering the bzip2 compression of XML. Therefore, I agree with XML
since it's much more accepted everywhere.

I agree, XML is probably better because sharelist is mainly composed of text and converting size into text won't waste a lot of memory (and moreover will prevent us from dealing with CPU endianness).

 > About regexp library choise, I'd say the support for wide charsets
 > should not only be considered, but required. Regex++ supports it, that's
 > all I know for now.

Indeed, it should be that way. However, it's not usually a
problem. I'm not sure how Windows works in this area, but on *ix
systems, filenames are still stored as 8-bit byte strings, encoded
using the character set of the current locale. Therefore, when a regex
comes in on the protocol with UTF-8, and it cannot be converted into a
multi-byte string in the current locale's charset, that would
constitute an automatic false expression, since if the regex contains
characters that aren't in the locale's charset, then no filenames can
exist which contains those characters anyway.

I don't 100% agree. On *nix, filenames are stored as byte strings as you said but the usage of the current locale charset is just a de facto standard because most of the programs work like this. AFAIK, nothing prevents a program to use UTF8 encoded filename.


DC Developers mailinglist