Unicode

Technical discussion about the NMDC and <a href="http://dcpp.net/ADC.html">ADC</A> protocol. The NMDC protocol is documented in the <a href="http://dcpp.net/wiki/">Wiki</a>, so feel free to refer to it.

Moderator: Moderators

Locked

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-07-20 08:24

To clarify: this was a request for comments. Please comment. :)

Sedulus
Forum Moderator
Posts: 687
Joined: 2003-01-04 09:32
Contact:

Post by Sedulus » 2003-07-20 09:02

this would mean that nicknames and myinfo content would be forced utf8
(which I do not have a problem with, per se)
http://dc.selwerd.nl/hublist.xml.bz2
http://www.b.ali.btinternet.co.uk/DCPlusPlus/index.html (TheParanoidOne's DC++ Guide)
http://www.dslreports.com/faq/dc (BSOD2600's Direct Connect FAQ)

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-07-20 11:56

Well, it could be. I havent discussed it or thought about it. It might be a good idea to require hub support for those things, and that way the hub can flag every joining client it should use utf8 in descr and nick.

If you want to make a client which doesnt need any hub support, its thinkable (imo) to make the GUI have some way of manually tagging a user as using utf8 or not. The user will probably know, after all, and if not she can try, which should clear things up (hopefully, but otherwise your nick/descr is seriously screwy)

Making a heuristic to detect utf8 might also be possible, but it will probably not be 100%. Utf8 with any large content of for example asian codes, will have a very large degree of >127 codes in it.

Sedulus
Forum Moderator
Posts: 687
Joined: 2003-01-04 09:32
Contact:

Post by Sedulus » 2003-07-20 12:31

uhm.. I don't like heuristics and guesswork, either a hub forces utf8 or it doesn't

it would mean that if someone tries to name himself (ibm-charset) björn on a hub, he'd be rejected because of the illegal 0x94, and even worse.. those characters can get you booted when used in a MyINFO (description for instance) and this would confuse the hell out of people (read: not liked by arne for the same reason as he rejecting the original $UserIP on)
...unless you were to break the utf8 spec and allow illegal sequences (correct me if I'm wrong on the specs, I merely glances at the docs)
http://dc.selwerd.nl/hublist.xml.bz2
http://www.b.ali.btinternet.co.uk/DCPlusPlus/index.html (TheParanoidOne's DC++ Guide)
http://www.dslreports.com/faq/dc (BSOD2600's Direct Connect FAQ)

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-07-20 12:41

Sedulus wrote:(correct me if I'm wrong on the specs, I merely glances at the docs)
The hub has no reason whatsoever to actually parse the utf8, afaik. To it, its just a stream of bytes with no | or $ in it.

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-07-20 12:42

Oops, error on my part. The spec doesnt seem to specify what to do with illegal codes. Different implementations also seems to take different routes (dropping entire string, dropping invalid code)

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-07-20 13:08

Clients will have to deal with decoding-erros, hubs wont. Was confused, I blame the heat.

Anyway, the problem is one of clients. They can easily use utf8 without hub support, and without the hub even knowing. The problem is merely one of communicating the encoding in use to other clients, and decoding problems arise when a client wrongfully believes the encoding used is utf8 when its not. The problem of decoding could be used as a flag telling this client isnt using utf8, it should be a sure sign. It would be nice to have a better method though.

We could use bit of the speedbyte in the descr for this, I believe there are a few bits left?

sandos
Posts: 186
Joined: 2003-01-05 10:16
Contact:

Post by sandos » 2003-07-20 13:12

Or we could couple/force utf8 with utf8 nick and descr. This way, a client using utf8 will appear garbled until it chats, when the other clients catches the 0xA0, and flags that this client is using utf8, ungarbling the nick and descr.

Locked