Unicode is a pain in the @$$
I was playing around with some code-y things, and wanted to find a simple, cross platform way to deal with Unicode strings. Good luck!
The closest thing I could find for C/C++ was the ICU library from IBM. While its VERY complete and functionlly rich - oh my god is it huge and complicated!
In the past, I've had to write my own (simple) functions to deal with really exactly the cases I needed, but, this being 2006, I thought I could find some nice publicly available routines to deal with this.
Not so much.
There's definitely code out there, but either by license or implementation, its bound up in ways that make it not so useful... so...
...I think I'm going to have to roll my own on this (again), which I'll then release source code for publicly - something simple, but hopefully more useful than what I've seen. Anybody have/use anything they like better that's already out there, let me know please... save me the hassle :)
As far as I can tell, here's what it takes to decode a single UTF-8 character:
(code snippet)
Ouy!! ...and its not even "fully" right for the invalid/stict case - there are invalid bit distributions within the byte group ranges.
And UTF-16 isn't much better... it's a little less code, but FAR more complicated in logic. Originally, it was intended to be 1 code point per unit (an unsigned short), but that become untenable over (a short amount of) time, and so they introduced support for multibyte surrogates, just like UTF-8. But because it had not orginally been designed for UTF-16, it used some odd ranges to support them (one of many reasons I don't like it). So THAT code, to decode a code point from a UTF-16 stream, looks like this:
(code snippet)
And all this just to decode ONE codepoint... of course, if you've ever actually looked at the algorithmic complexity of converting a floating point number to a string (even most veteran code monkeys haven't)... well, you start to appreciate the value of a good code library.
Anyway, I want length, count, cat, copy, compare (case sensitive and not), at least, and probably one or two other classes of functions (conversions betwixt UTF types certainly, but that's reasonably there already, if a little "stricter" than necessary). Locale code pages I can do without .... I'm looking for "pure" Unicode functionality in a lightweight package.
Pointers welcome.
The closest thing I could find for C/C++ was the ICU library from IBM. While its VERY complete and functionlly rich - oh my god is it huge and complicated!
In the past, I've had to write my own (simple) functions to deal with really exactly the cases I needed, but, this being 2006, I thought I could find some nice publicly available routines to deal with this.
Not so much.
There's definitely code out there, but either by license or implementation, its bound up in ways that make it not so useful... so...
...I think I'm going to have to roll my own on this (again), which I'll then release source code for publicly - something simple, but hopefully more useful than what I've seen. Anybody have/use anything they like better that's already out there, let me know please... save me the hassle :)
As far as I can tell, here's what it takes to decode a single UTF-8 character:
(code snippet)
Ouy!! ...and its not even "fully" right for the invalid/stict case - there are invalid bit distributions within the byte group ranges.
And UTF-16 isn't much better... it's a little less code, but FAR more complicated in logic. Originally, it was intended to be 1 code point per unit (an unsigned short), but that become untenable over (a short amount of) time, and so they introduced support for multibyte surrogates, just like UTF-8. But because it had not orginally been designed for UTF-16, it used some odd ranges to support them (one of many reasons I don't like it). So THAT code, to decode a code point from a UTF-16 stream, looks like this:
(code snippet)
And all this just to decode ONE codepoint... of course, if you've ever actually looked at the algorithmic complexity of converting a floating point number to a string (even most veteran code monkeys haven't)... well, you start to appreciate the value of a good code library.
Anyway, I want length, count, cat, copy, compare (case sensitive and not), at least, and probably one or two other classes of functions (conversions betwixt UTF types certainly, but that's reasonably there already, if a little "stricter" than necessary). Locale code pages I can do without .... I'm looking for "pure" Unicode functionality in a lightweight package.
Pointers welcome.
5 Comments:
If all you want to do is encode/decode, Tcl has remarkably good support for it. :-P
Otherwise, you could/should check out libiconv.
libiconv is GI-NORMOUS.... (I looked at it already) and tcl... yoinks, seriously?
Agree it's a mess :)...Java seems to have some attractiveness here!
The only other cross platform mechanisms that come to mind, outside of libiconv, are libutf-8 and some mentioned here. Seems like libiconv is best suited.
runescape money runescape gold runescape money runescape gold wow power leveling wow powerleveling Warcraft Power Leveling Warcraft PowerLeveling buy runescape gold buy runescape money runescape items runescape gold runescape accounts runescape gp dofus kamas buy dofus kamas Guild Wars Gold buy Guild Wars Gold runescape accounts buy runescape accounts runescape lotro gold buy lotro gold lotro gold buy lotro gold lotro gold buy lotro gold lotro gold buy lotro gold runescape money runescape power leveling runescape money runescape gold dofus kamas cheap runescape money cheap runescape gold Hellgate Palladium Hellgate London Palladium Hellgate money Tabula Rasa gold tabula rasa money 陈楚生 压力开关 压力传感器 流量开关 流量计 液位计 液位开关 温湿度记录仪 风速仪 差压开关 可燃气体检测仪 过滤器 强磁水处理器 自清洗过滤器 自动反冲洗过滤器 保鲜棕榈树 棕榈树
runescape money runescape gold runescape gold runescape money buy runescape gold buy runescape money runescape money runescape gold wow power leveling wow powerleveling Warcraft Power Leveling Warcraft PowerLeveling buy runescape gold buy runescape money runescape itemsrunescape accounts runescape gp dofus kamas buy dofus kamas Guild Wars Gold buy Guild Wars Gold lotro gold buy lotro gold lotro gold buy lotro gold lotro gold buy lotro gold runescape money runescape power leveling runescape money runescape gold dofus kamas cheap runescape money cheap runescape gold Hellgate Palladium Hellgate London Palladium Hellgate money Tabula Rasa gold tabula rasa money Tabula Rasa Credit Tabula Rasa Credits Hellgate gold Hellgate London gold wow power leveling wow powerleveling Warcraft PowerLeveling Warcraft Power Leveling World of Warcraft PowerLeveling World of Warcraft Power Leveling runescape power leveling runescape powerleveling eve isk eve online isk eve isk eve online isk tibia gold Fiesta Silver Fiesta Gold
runescape money runescape gold wow power leveling
棕榈树
VI设计
画册设计
血管瘤
肝血管瘤
音乐剧
福卡
防静电地板
美国留学
留学美国
电阻器
不锈钢电阻器
频敏电阻器
睡眠呼吸机
伟康呼吸机
呼吸机
无创呼吸机
家用呼吸机
呼吸机的使用
北京消化系统疾病
北京心脑血管疾病
北京肾病
北京中医儿科
北京针灸减肥
针灸减肥
北京糖尿病
北京疼痛病
北京类风湿
Post a Comment
Links to this post:
Create a Link
<< Home