Monday, December 04, 2006

Unicode is a pain in the @$$

I was playing around with some code-y things, and wanted to find a simple, cross platform way to deal with Unicode strings. Good luck!

The closest thing I could find for C/C++ was the
ICU library from IBM. While its VERY complete and functionlly rich - oh my god is it huge and complicated!

In the past, I've had to write my own (simple) functions to deal with really exactly the cases I needed, but, this being 2006, I thought I could find some nice publicly available routines to deal with this.

Not so much.

There's definitely code out there, but either by license or implementation, its bound up in ways that make it not so useful... so...

...I think I'm going to have to roll my own on this (again), which I'll then release source code for publicly - something simple, but hopefully more useful than what I've seen. Anybody have/use anything they like better that's already out there, let me know please... save me the hassle :)

As far as I can tell, here's what it takes to decode a single UTF-8 character:

(
code snippet)


Ouy!! ...and its not even "fully" right for the invalid/stict case - there are invalid bit distributions within the byte group ranges.

And UTF-16 isn't much better... it's a little less code, but FAR more complicated in logic. Originally, it was intended to be 1 code point per unit (an unsigned short), but that become untenable over (a short amount of) time, and so they introduced support for multibyte surrogates, just like UTF-8. But because it had not orginally been designed for UTF-16, it used some odd ranges to support them (one of many reasons I don't like it). So THAT code, to decode a code point from a UTF-16 stream, looks like this:

(
code snippet)


And all this just to decode ONE codepoint... of course, if you've ever actually looked at the algorithmic complexity of converting a floating point number to a string (even most veteran code monkeys haven't)... well, you start to appreciate the value of a good code library.

Anyway, I want length, count, cat, copy, compare (case sensitive and not), at least, and probably one or two other classes of functions (conversions betwixt UTF types certainly, but that's reasonably there already, if a little "stricter" than necessary). Locale code pages I can do without .... I'm looking for "pure" Unicode functionality in a lightweight package.


Pointers welcome.

Labels: ,