is vim putting the right bytes for utf-8 ?

classic Classic list List threaded Threaded
45 messages Options
123
Reply | Threaded
Open this post in threaded view
|

is vim putting the right bytes for utf-8 ?

Fernando Canizo
Hi all,

i got a problem with vim + mutt + utf-8 (my emails get garbled when i
reply to utf-8 mails) and i didn't find what is causing this.

However i find this: if i create a file with vim and put only a
LATIN SMALL LETTER A WITH ACUTE, save it and then reopen with hexedit
i see this hex for the letter: C3 A1, which is not 00 E1 as says
unicode chart from http://www.unicode.org/charts/

Could it be that vim is linked against libncurses.so.5 instead of
libncursesw.so.5 ?

For more info i'm using gentoo and
# emerge -pv vim
[ebuild   R   ] app-editors/vim-6.3.068  -acl +bash-completion -cscope
-debug +gpm -minimal +ncurses +nls +perl +python +ruby (-selinux)
-vim-with-x 0 kB

$ ldd `which vim` | grep curses
libncurses.so.5 => /lib/libncurses.so.5 (0xb7f98000)

$ vim --version
VIM - Vi IMproved 6.3 (2004 June 7, compiled Jul 26 2005 21:42:52)
Parches incluidos: 1-68
Compilado por root@muriandre
Versi�n �enorme� sin GUI.  Aspectos incluidos (+) o no (-):
+arabic +autocmd -balloon_eval -browse ++builtin_terms +byte_offset
+cindent -clientserver -clipboard +cmdline_compl +cmdline_hist
+cmdline_info +comments +cryptv -cscope +dialog_con +diff +digraphs
-dnd -ebcdic +emacs_tags +eval +ex_extra +extra_search +farsi
+file_in_path +find_in_path +folding -footer +fork() +gettext
-hangul_input +iconv +insert_expand +jumplist +keymap +langmap
+libcall +linebreak +lispindent +listcmds +localmap +menu +mksession
+modify_fname +mouse -mouseshape +mouse_dec +mouse_gpm -mouse_jsbterm
+mouse_netterm +mouse_xterm +multi_byte +multi_lang -netbeans_intg
-osfiletype +path_extra +perl +postscript +printer +python +quickfix
+rightleft +ruby +scrollbind +signs +smartindent -sniff +statusline
-sun_workshop +syntax +tag_binary +tag_old_static -tag_any_white -tcl
+terminfo +termresponse +textobjects +title -toolbar +user_commands
+vertsplit +virtualedit +visual +visualextra +viminfo +vreplace
+wildignore +wildmenu +windows +writebackup

-X11 -xfontset -xim -xsmp -xterm_clipboard -xterm_save
     fichero �vimrc� del sistema: "/etc/vim/vimrc"
     fichero �vimrc� del usuario: "$HOME/.vimrc"
      fichero �exrc� del usuario: "$HOME/.exrc"
            localizaci�n de $VIM: "/usr/share/vim"
Compilaci�n: i686-pc-linux-gnu-gcc -c -I. -Iproto -DHAVE_CONFIG_H
-O2 -march=athlon-xp -fomit-frame-pointer   -pipe -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64  -I/usr/lib/perl5/5.8.6/i686-linux/CORE
-I/usr/include/python2.3 -pthread  -I/usr/lib/ruby/1.8/i686-linux
Enlazado: i686-pc-linux-gnu-gcc   -rdynamic -Wl,-export-dynamic
-rdynamic   -L/usr/local/lib -o vim       -lncurses -lgpm -rdynamic
-L/usr/local/lib
/usr/lib/perl5/5.8.6/i686-linux/auto/DynaLoader/DynaLoader.a
-L/usr/lib/perl5/5.8.6/i686-linux/CORE -lperl -lutil -lc
-L/usr/lib/python2.3/config -lpython2.3 -lutil -Xlinker
-export-dynamic  -Wl,-R -Wl,/usr/lib -L/usr/lib -L/usr/lib -lruby18
-lm  

--
Fernando Canizo - LUGMen: www.lugmen.org.ar - A8N: a8n.lugmen.org.ar
love, n.:
        When you don't want someone too close--because you're very sensitive
        to pleasure.
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Bugzilla from peterp@stack.nl
Op woensdag 27 juli 2005 13:31, schreef Fernando Canizo:
> Hi all,
>
> i got a problem with vim + mutt + utf-8 (my emails get garbled when i
> reply to utf-8 mails) and i didn't find what is causing this.
>
> However i find this: if i create a file with vim and put only a
> LATIN SMALL LETTER A WITH ACUTE, save it and then reopen with hexedit
> i see this hex for the letter: C3 A1, which is not 00 E1 as says
> unicode chart from http://www.unicode.org/charts/

It should be C3 A1, utf-8 is an encoding of unicode.

From man utf-8:

       * The first byte of a multi-byte sequence  which  represents  a  single
         non-ASCII UCS character is always in the range 0xc0 to 0xfd and indi‐
         cates how long this multi-byte sequence is. All further  bytes  in  a
         multi-byte  sequence  are in the range 0x80 to 0xbf. This allows easy
         resynchronization and makes the encoding stateless and robust against
         missing bytes.

Knowing this makes it easier to recognise utf-8 characters.

I haven't checked the mutt-vim-utf8 problem, but the problem probably isn't
vim.


Peter
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Fernando Canizo
El Wed, Jul 27, 2005 at 01:51:34PM +0200, Peter Palm me dec�a:

> Op woensdag 27 juli 2005 13:31, schreef Fernando Canizo:
> > Hi all,
> >
> > i got a problem with vim + mutt + utf-8 (my emails get garbled when i
> > reply to utf-8 mails) and i didn't find what is causing this.
> >
> > However i find this: if i create a file with vim and put only a
> > LATIN SMALL LETTER A WITH ACUTE, save it and then reopen with hexedit
> > i see this hex for the letter: C3 A1, which is not 00 E1 as says
> > unicode chart from http://www.unicode.org/charts/
>
> It should be C3 A1, utf-8 is an encoding of unicode.
>
> From man utf-8:
>
>        * The first byte of a multi-byte sequence  which  represents  a  single
>          non-ASCII UCS character is always in the range 0xc0 to 0xfd and indi‐
>          cates how long this multi-byte sequence is. All further  bytes  in  a
>          multi-byte  sequence  are in the range 0x80 to 0xbf. This allows easy
>          resynchronization and makes the encoding stateless and robust against
>          missing bytes.
>
> Knowing this makes it easier to recognise utf-8 characters.

Ah! So i have confused utf-8 with unicode. Think was the same. But
when i search for "utf-8 chart" i get directed to unicode charts.

So you say that vim is putting the right thing since i got C3 A1. Can
you tell from where did you get that C3 A1 is the codification in
utf-8 for LATIN SMALL LETTER A WITH ACUTE ? I like to have a chart to
test my problem.

--
Fernando Canizo - LUGMen: www.lugmen.org.ar - A8N: a8n.lugmen.org.ar
Could this be the best day of my life?

        -- Homer Simpson
                   Homer the Heretic
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Matthew Winn
On Wed, Jul 27, 2005 at 09:40:55AM -0300, Fernando Canizo wrote:
> So you say that vim is putting the right thing since i got C3 A1. Can
> you tell from where did you get that C3 A1 is the codification in
> utf-8 for LATIN SMALL LETTER A WITH ACUTE ? I like to have a chart to
> test my problem.

I'm not aware of any tables that list all characters in UTF-8, but it's
not difficult to decode the encoding by hand.

Write out the bytes in binary: C3 A1 = 11000011 10100001.  The first
byte of a UTF-8 encoded character starts with a string of "1" bits; the
number of these bits is the total number of bytes used to represent the
character, which is this case is 2.  The remaining bits of the first
byte and the lowest six bits of all subsequent bytes are the bits of the
character code.  In this case that's 000011 100001; redividing that into
groups of four you get 0000 1110 0001 or E1.  A three-byte character
such as E2 A8 B3 (11100010 10101000 10110011) would decode to 00010
101000 110011 = 0010 1010 0011 0011 = 2A33.

--
Matthew Winn ([hidden email])
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

A.J.Mechelynck
In reply to this post by Fernando Canizo
----- Original Message -----
From: "Fernando Canizo" <[hidden email]>
To: "vim list" <[hidden email]>
Sent: Wednesday, July 27, 2005 1:31 PM
Subject: is vim putting the right bytes for utf-8 ?


> Hi all,
>
> i got a problem with vim + mutt + utf-8 (my emails get garbled when i
> reply to utf-8 mails) and i didn't find what is causing this.
>
> However i find this: if i create a file with vim and put only a
> LATIN SMALL LETTER A WITH ACUTE, save it and then reopen with hexedit
> i see this hex for the letter: C3 A1, which is not 00 E1 as says
> unicode chart from http://www.unicode.org/charts/

No, it is due to how UTF-8 encodes data:

- U+0000 to U+007F: encoded as one byte, 0xxxxxxx
- U+0080 to U+07FF: encoded as two bytes, 110xxxxx 10xxxxxx
- U+0800 to U+FFFF : encoded as three bytes, 1110xxxx 10xxxxxx 10xxxxxx
- U+10000 to U+1FFFFF : encoded as four bytes, 11110xxx 10xxxxxx 10xxxxxx
10xxxxxx

The rest were originally foreseen, and are supported by Vim, but (IIUC) it
is now thought that codepoints above U+10FFFF will never be used:

- U+200000 to U+3FFFFFF: encoded as five bytes, 111110xx 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx
- U+4000000 to U+7FFFFFFF : encoded as six bytes, 1111110x 10xxxxxx 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx

(in all the above, each x is one data bit, always high bit first). This
means that byte values are strictly segregated, as follows:

0x00-0x7F: single byte
0x80-0xBF: trailing byte
0xC0-0xDF: leading byte for 2-byte sequence
0xE0-0xEF: leading byte for 3-byte sequence
0xF0-0xF7: leading byte for 4-byte sequence
0xF8-0xFB: leading byte for 5-byte sequence
0xFC-0xFD: leading byte for 6-byte sequence
0xFE-0xFF: illegal

...and there is no ambiguity (when looking, locally, at any part of a text)
regarding where the character boundaries are. Thus the _bytes_ 0xC3 0xA1, or
in binary 1100.0011-1010.0001, represent the _data_ 00011100001, i.e., the
codepoint U+00E1.

In Vim, when 'encoding' is UTF-8, you get the "codepoint number" with ga and
the "data bytes" with g8 (in Normal mode, for the character(s) under the
cursor).


Best regards,
Tony.


Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

A.J.Mechelynck
In reply to this post by Fernando Canizo
----- Original Message -----
From: "Fernando Canizo" <[hidden email]>
To: "vim list" <[hidden email]>
Cc: "Peter Palm" <[hidden email]>
Sent: Wednesday, July 27, 2005 2:40 PM
Subject: Re: is vim putting the right bytes for utf-8 ?
[...]
> Ah! So i have confused utf-8 with unicode. Think was the same. But
> when i search for "utf-8 chart" i get directed to unicode charts.
[...]

Unicode is a scheme to map all past, present and future writing systems
known to man (even some used only in fiction literature or fandom, like
Klingon, Cirth, and F�anorian letters) into an "abstract space" with no more
than 2^31 slots, each of which is called a "code point" and corresponds
roughly (in lay language) to a character. (Most of it is already defined but
some writing systems are still being added to it.)

UTF-8 is one of several possible ways to represent Unicode data (Unicode
codepoints) in memory, on magnetic media, or when transmitting it
electronically.

So the two are related but they are not identical.

For more details, browse to http://www.unicode.org/ and look for "What is
Unicode?"

Best regards,
Tony.


Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Mike Williams
In reply to this post by A.J.Mechelynck
Tony Mechelynck did utter on 27/07/2005 22:16:

> No, it is due to how UTF-8 encodes data:
>
> - U+0000 to U+007F: encoded as one byte, 0xxxxxxx
> - U+0080 to U+07FF: encoded as two bytes, 110xxxxx 10xxxxxx
> - U+0800 to U+FFFF : encoded as three bytes, 1110xxxx 10xxxxxx 10xxxxxx
> - U+10000 to U+1FFFFF : encoded as four bytes, 11110xxx 10xxxxxx
> 10xxxxxx 10xxxxxx
>
> The rest were originally foreseen, and are supported by Vim, but (IIUC)
> it is now thought that codepoints above U+10FFFF will never be used:
>
> - U+200000 to U+3FFFFFF: encoded as five bytes, 111110xx 10xxxxxx
> 10xxxxxx 10xxxxxx 10xxxxxx
> - U+4000000 to U+7FFFFFFF : encoded as six bytes, 1111110x 10xxxxxx
> 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The Unicode codespace is 21-bit only with a defined range up to and
including 0x10FFFF.  The five and six byte encodings are not valid
Unicode encoded codepoints.  ISO646 has a 32-bit codespace range, and
the five and 6 byte encodings are valid for that.  Ironically, both
standards use UTF-8 to represent the encoding but Unicode adds
constraints to the semantics.

Technically, if VIM supports the five and six byte encodings then it is
not Unicode conformant.  Then again, I don't know if Bram actually
claims any level if Unicode conformance.  Hmm, I note that in the doc it
also talks about UCS-2 and UCS-4 encodings - this actually refers to the
ISO646 definition - the Unicode names would be UTF-16 and UTF-32.  So I
guess VIM is not claiming Unicode conformance.

TTFN

Mike
--
SUSHI: Dead fish. CAVIAR: Unborn offspring of dead fish.
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Bram Moolenaar

Mike Williams wrote:

> Tony Mechelynck did utter on 27/07/2005 22:16:
> > No, it is due to how UTF-8 encodes data:
> >
> > - U+0000 to U+007F: encoded as one byte, 0xxxxxxx
> > - U+0080 to U+07FF: encoded as two bytes, 110xxxxx 10xxxxxx
> > - U+0800 to U+FFFF : encoded as three bytes, 1110xxxx 10xxxxxx 10xxxxxx
> > - U+10000 to U+1FFFFF : encoded as four bytes, 11110xxx 10xxxxxx
> > 10xxxxxx 10xxxxxx
> >
> > The rest were originally foreseen, and are supported by Vim, but (IIUC)
> > it is now thought that codepoints above U+10FFFF will never be used:
> >
> > - U+200000 to U+3FFFFFF: encoded as five bytes, 111110xx 10xxxxxx
> > 10xxxxxx 10xxxxxx 10xxxxxx
> > - U+4000000 to U+7FFFFFFF : encoded as six bytes, 1111110x 10xxxxxx
> > 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>
> The Unicode codespace is 21-bit only with a defined range up to and
> including 0x10FFFF.  The five and six byte encodings are not valid
> Unicode encoded codepoints.  ISO646 has a 32-bit codespace range, and
> the five and 6 byte encodings are valid for that.  Ironically, both
> standards use UTF-8 to represent the encoding but Unicode adds
> constraints to the semantics.
>
> Technically, if VIM supports the five and six byte encodings then it is
> not Unicode conformant.  Then again, I don't know if Bram actually
> claims any level if Unicode conformance.  Hmm, I note that in the doc it
> also talks about UCS-2 and UCS-4 encodings - this actually refers to the
> ISO646 definition - the Unicode names would be UTF-16 and UTF-32.  So I
> guess VIM is not claiming Unicode conformance.

I can't claim I am very good at knowing exactly what the standard
specifies, but UTF-16 is one of the possible encodings where 16-bit
words are used to encode Unicode characters.  UCS-2 uses one 16-bit word
in is thus limited to the first 65536 characters.  UCS-4 is 32 bit, more
than enough.  Both UCS-2 and UCS-4 depend on byte order, thus are only
useful inside a program.  Unfortunately, some people do use them for
files, that's why the BOM (Byte Order Mark) was invented (helps, but
makes file manipulation complicated).

I never heard about Unicode limiting to 0x10FFFF, I think that's a limit
of UTF-16.  As mentioned before UTF-16 is a bad hack for people who used
16 bit characters and discovered later that it's not sufficient for all
characters (most notably MS-Windows).  There's a lot of politics going
on for this standard...

As always, I recommend using UTF-8 for everything.  And use as many
characters as you like, up to 32 bits.

--
SECOND SOLDIER: It could be carried by an African swallow!
FIRST SOLDIER:  Oh  yes! An African swallow maybe ... but not a European
                swallow. that's my point.
                 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Mike Williams
Bram Moolenaar did utter on 28/07/2005 12:16:

> Mike Williams wrote:
>
>
>>Tony Mechelynck did utter on 27/07/2005 22:16:
>>
>>>No, it is due to how UTF-8 encodes data:
>>>
>>>- U+0000 to U+007F: encoded as one byte, 0xxxxxxx
>>>- U+0080 to U+07FF: encoded as two bytes, 110xxxxx 10xxxxxx
>>>- U+0800 to U+FFFF : encoded as three bytes, 1110xxxx 10xxxxxx 10xxxxxx
>>>- U+10000 to U+1FFFFF : encoded as four bytes, 11110xxx 10xxxxxx
>>>10xxxxxx 10xxxxxx
>>>
>>>The rest were originally foreseen, and are supported by Vim, but (IIUC)
>>>it is now thought that codepoints above U+10FFFF will never be used:
>>>
>>>- U+200000 to U+3FFFFFF: encoded as five bytes, 111110xx 10xxxxxx
>>>10xxxxxx 10xxxxxx 10xxxxxx
>>>- U+4000000 to U+7FFFFFFF : encoded as six bytes, 1111110x 10xxxxxx
>>>10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
>>
>>The Unicode codespace is 21-bit only with a defined range up to and
>>including 0x10FFFF.  The five and six byte encodings are not valid
>>Unicode encoded codepoints.  ISO646 has a 32-bit codespace range, and
>>the five and 6 byte encodings are valid for that.  Ironically, both
>>standards use UTF-8 to represent the encoding but Unicode adds
>>constraints to the semantics.
>>
>>Technically, if VIM supports the five and six byte encodings then it is
>>not Unicode conformant.  Then again, I don't know if Bram actually
>>claims any level if Unicode conformance.  Hmm, I note that in the doc it
>>also talks about UCS-2 and UCS-4 encodings - this actually refers to the
>>ISO646 definition - the Unicode names would be UTF-16 and UTF-32.  So I
>>guess VIM is not claiming Unicode conformance.
>
>
> I can't claim I am very good at knowing exactly what the standard
> specifies, but UTF-16 is one of the possible encodings where 16-bit
> words are used to encode Unicode characters.  UCS-2 uses one 16-bit word
> in is thus limited to the first 65536 characters.  UCS-4 is 32 bit, more
> than enough.  Both UCS-2 and UCS-4 depend on byte order, thus are only
> useful inside a program.  Unfortunately, some people do use them for
> files, that's why the BOM (Byte Order Mark) was invented (helps, but
> makes file manipulation complicated).

Briefly, UCS-2 and UCS-4 encodings are from the earlier ISO 10646
standard. UTF-16 and UTF-32 encodings are defined by the Unicode
consortium and they are different.

UCS-2 encodes only the first 64K characters of the '646 codespace
whereas UTF-16 can encode the full 21 bit Unicode codespace, by using
two UTF-16 values in sequence to represent Unicode characters above the
BMP (Basic Multilingual Plane)- these are known as surrogate pairs.  The
result of this is that UCS-2 and UTF-16 are _not_ the same encoding.  A
quick Google should fill you in with more detail if you are interested.
  And of course endianess comes into play as well, but that is just and
enginerring problem ;-)

The important difference is how you detect and handle surrogate pairs in
the UTF-16 encoding.  If you don't then you are not treating the
encoding as UTF-16 but as UCS-2.  There are implications for this,
including possible corruption of otherwise valid Unicode encoded
characters which may cause problems with other Unicode conformant tools.

> I never heard about Unicode limiting to 0x10FFFF, I think that's a limit
> of UTF-16.   As mentioned before UTF-16 is a bad hack for people who used
> 16 bit characters and discovered later that it's not sufficient for all
> characters (most notably MS-Windows).  There's a lot of politics going
> on for this standard...

The Unicode codespace is defined in terms of character planes (as in the
BMP) containing 64K codepoints.  There are 16 planes defined.  From this
you get the upper limit of 0x10ffff - the last codepoint of the last
defined plane, yes there are unused bits in the 21-bit codespace.  How
these codepoints are represented is defined by the encoding, of which
Unicode define 3 - UTF-8, UTF-16 and UTF-32.  Each of them can represent
any valid codepoint, UTF-8 with upto 3 bytes, UTF-16 with one or two
characters, and a single UTF-32.

I recommend reading Chapter 2 from the Unicode spec (individual chapters
can be downloaded) available as PDF from the Unicode web site, in
particular section 2.8.

> As always, I recommend using UTF-8 for everything.  And use as many
> characters as you like, up to 32 bits.

That's as may be, but anything greater than 21 bits wont be valid
Unicode characters.  The usual mechanisms for handling invalid Unicode
characters is to remove them from the character sequence, replace them
with another character, or abort all processing.  The bottom line is
that the sequence must always be a valid Unicode character sequence.
Does VIM guarnatee that it can only produce valid Unicode character
sequences?

There are pros and cons for UTF-8 v UTF-16.  UTF-8 works relatively
nicely with C string handling routines as NULs can still be used as
terminators.  But you can write more efficient string handling functions
using UTF-16 - indexing is simpler, and should you end up in a surrogate
pair it is a quick test to check if you need to back up one character or
not.

IIRC WinXP supports UTF-16 so should handle surrogate pairs.  Previous
version of NT were UCS-2 so would ignore surrogate pairs.  If someone
has a font with characters outside the BMP it may be worth trying it on
an XP box.

TTFN

Mike
--
Blessings never come in pairs; misfortunes never come alone.
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

A.J.Mechelynck
In reply to this post by Bram Moolenaar
----- Original Message -----
From: "Bram Moolenaar" <[hidden email]>
To: <[hidden email]>
Cc: "vim list" <[hidden email]>
Sent: Thursday, July 28, 2005 1:16 PM
Subject: Re: is vim putting the right bytes for utf-8 ?
[...]

> I never heard about Unicode limiting to 0x10FFFF, I think that's a limit
> of UTF-16.  As mentioned before UTF-16 is a bad hack for people who used
> 16 bit characters and discovered later that it's not sufficient for all
> characters (most notably MS-Windows).  There's a lot of politics going
> on for this standard...
[...]

Somewhere on the Unicode Consortium's site http://www.unicode.org/ (I'm not
sure exactly where but I remember seeing it: "What is Unicode?" or "The
Unicode Standard" are likely titles), it is said that no codepoints above
U+10FFFF shall be used ever (actually U+10FFFD, since no valid codepoint can
end in FFFE or FFFF); the same text also says that this is a "normative"
rule. According to its wording, it applies strictly to all varieties of
Unicode. (The fact that the limit is not a power of 2 seems to me to
indicate that the norm wasn't made by "hands-on" computer people; by
linguists, businessmen or politicians maybe [including computer
businessmen].) Somewhere near that, it is also said that overlong sequences
are "invalid" and that their use is forbidden. I saw all that when I first
tried to understand Unicode, in Vim 6.1 times IIRC.

IIUC, "Unicode" (as loosely used nowadays) came from the fusion of two
standards, namely Unicode (stricto sensu), as developed by the Unicode
Consortium, and ISO-10646, as developed by, well, ISO. IIUC, the ISO
standard initially defined a 31-bit data space but as a result of the
fusion, codepoints above U+10FFFD shan't be used unless or until the Unicode
standard's normative texts are changed to allow it (which won't happen in
any foreseeable future, but who knows what Unicode will look like 50 years
from now?). Except for UTF-8 (which is used in slightly different meanings
by both organizations), similar encodings are named differently,
UTF-something by the Unicode Consortium and UCS-something by ISO (IIUC).

Space for UTF-16 surrogates is clearly reserved (in two halves, "upper" and
"lower") in the Unicode codespace, but I haven't succeeded to find exactly
how these surrogates are used, or what means what. (The only thing clear to
me is that they are "somehow" used in pairs, one from each half, to
represent codepoints above U+FFFF in 16-bit Unicode encodings.)


Best regards,
Tony.


Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

A.J.Mechelynck
In reply to this post by Mike Williams
----- Original Message -----
From: "Mike Williams" <[hidden email]>
To: "Bram Moolenaar" <[hidden email]>
Cc: "vim list" <[hidden email]>
Sent: Thursday, July 28, 2005 2:09 PM
Subject: Re: is vim putting the right bytes for utf-8 ?
[...]

> Briefly, UCS-2 and UCS-4 encodings are from the earlier ISO 10646
> standard. UTF-16 and UTF-32 encodings are defined by the Unicode
> consortium and they are different.
>
> UCS-2 encodes only the first 64K characters of the '646 codespace whereas
> UTF-16 can encode the full 21 bit Unicode codespace, by using two UTF-16
> values in sequence to represent Unicode characters above the BMP (Basic
> Multilingual Plane)- these are known as surrogate pairs.  The result of
> this is that UCS-2 and UTF-16 are _not_ the same encoding.  A quick Google
> should fill you in with more detail if you are interested. And of course
> endianess comes into play as well, but that is just and enginerring
> problem ;-)
>
> The important difference is how you detect and handle surrogate pairs in
> the UTF-16 encoding.  If you don't then you are not treating the encoding
> as UTF-16 but as UCS-2.  There are implications for this, including
> possible corruption of otherwise valid Unicode encoded characters which
> may cause problems with other Unicode conformant tools.
>
>> I never heard about Unicode limiting to 0x10FFFF, I think that's a limit
>> of UTF-16.   As mentioned before UTF-16 is a bad hack for people who used
>> 16 bit characters and discovered later that it's not sufficient for all
>> characters (most notably MS-Windows).  There's a lot of politics going
>> on for this standard...
>
> The Unicode codespace is defined in terms of character planes (as in the
> BMP) containing 64K codepoints.  There are 16 planes defined.

plus the BMP (plane zero) makes 17. -- Sixteen including the BMP would go to
0FFFFF

> From this you get the upper limit of 0x10ffff - the last codepoint of the
> last defined plane, yes there are unused bits in the 21-bit codespace.
> How these codepoints are represented is defined by the encoding, of which
> Unicode define 3 - UTF-8, UTF-16 and UTF-32.  Each of them can represent
> any valid codepoint, UTF-8 with upto 3 bytes, UTF-16 with one or two
> characters, and a single UTF-32. [...]

UTF-8 with 4: to represent the highest valid codepoint (U+10FFFD) you need
the bytes F4 8F BF BD.

> IIRC WinXP supports UTF-16 so should handle surrogate pairs.  Previous
> version of NT were UCS-2 so would ignore surrogate pairs.  If someone has
> a font with characters outside the BMP it may be worth trying it on an XP
> box.

I have the XP box and I'm willing to try. Anyone got a link to a font with
glyphs for the "high" CJK codepoints above the BMP? I even already have a
data file which contains those characters. My current CJK fonts are (as seen
by ":set guifont=*"):

@BatangChe
@DotumChe
@GulimChe
@GungsuhChe
@MingLiU
@MS Gothic
@MS Mincho
@NSimSun
BatangChe
DotumChe
GulimChe
GungsuhChe
MingLiU
MS Gothic
MS Mincho
NSimSun

Those whose name start with @ are rotated fonts to print CJK in columns on
rotated paper (i.e., to achieve column writing in portrait orientation you
print with an @-font in Landscape orientation). Obviously I haven't tried
all the above. IIUC, the --Che fonts are Korean, the MS- fonts Japanese,
NSimSun Simplified-Chinese and MingLiU Traditional-Chinese. Judging from
their icons, all of them are "TT", not the newer "O" kind of scalable fonts.
According to the font selector, some of them also have other charsets like
Cyrillic or Greek.

>
> TTFN
>
> Mike
> --
> Blessings never come in pairs; misfortunes never come alone.


Best regards,
Tony.


Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Bram Moolenaar
In reply to this post by Mike Williams

Mike Williams wrote:

> > I can't claim I am very good at knowing exactly what the standard
> > specifies, but UTF-16 is one of the possible encodings where 16-bit
> > words are used to encode Unicode characters.  UCS-2 uses one 16-bit word
> > in is thus limited to the first 65536 characters.  UCS-4 is 32 bit, more
> > than enough.  Both UCS-2 and UCS-4 depend on byte order, thus are only
> > useful inside a program.  Unfortunately, some people do use them for
> > files, that's why the BOM (Byte Order Mark) was invented (helps, but
> > makes file manipulation complicated).
>
> Briefly, UCS-2 and UCS-4 encodings are from the earlier ISO 10646
> standard. UTF-16 and UTF-32 encodings are defined by the Unicode
> consortium and they are different.

Note that "the Unicode consortium" is different from "Unicode" in
general.  The last one is the generic term that covers all work done on
Unicode and related encodings.  The consortium is one organisation that
claims to be the authority on Unicode.  However, Unicode is undergoing
changes over time (the consortium is at 4.1.0 now) thus there is never a
fixed definition of what Unicode is.

> UCS-2 encodes only the first 64K characters of the '646 codespace
> whereas UTF-16 can encode the full 21 bit Unicode codespace, by using
> two UTF-16 values in sequence to represent Unicode characters above the
> BMP (Basic Multilingual Plane)- these are known as surrogate pairs.  The
> result of this is that UCS-2 and UTF-16 are _not_ the same encoding.  A
> quick Google should fill you in with more detail if you are interested.
>   And of course endianess comes into play as well, but that is just and
> enginerring problem ;-)
>
> The important difference is how you detect and handle surrogate pairs in
> the UTF-16 encoding.  If you don't then you are not treating the
> encoding as UTF-16 but as UCS-2.  There are implications for this,
> including possible corruption of otherwise valid Unicode encoded
> characters which may cause problems with other Unicode conformant tools.
>
> > I never heard about Unicode limiting to 0x10FFFF, I think that's a limit
> > of UTF-16.   As mentioned before UTF-16 is a bad hack for people who used
> > 16 bit characters and discovered later that it's not sufficient for all
> > characters (most notably MS-Windows).  There's a lot of politics going
> > on for this standard...
>
> The Unicode codespace is defined in terms of character planes (as in the
> BMP) containing 64K codepoints.  There are 16 planes defined.  From this
> you get the upper limit of 0x10ffff - the last codepoint of the last
> defined plane, yes there are unused bits in the 21-bit codespace.  How
> these codepoints are represented is defined by the encoding, of which
> Unicode define 3 - UTF-8, UTF-16 and UTF-32.  Each of them can represent
> any valid codepoint, UTF-8 with upto 3 bytes, UTF-16 with one or two
> characters, and a single UTF-32.

All this stuff about "surrogate pairs" and "character planes" is the
terrible hack that was done to make UTF-16 possible.  It certainly
wasn't needed to give each character a number.  Unfortunately we'll have
to live with it.  But it remains an ugly solution that requires defining
all these terms.  Unicode should have been a nice clean homogeneous
sequence of characters...

> I recommend reading Chapter 2 from the Unicode spec (individual chapters
> can be downloaded) available as PDF from the Unicode web site, in
> particular section 2.8.

I recommend sticking to common sense.  Reading standards text can drive
you mad...

> > As always, I recommend using UTF-8 for everything.  And use as many
> > characters as you like, up to 32 bits.
>
> That's as may be, but anything greater than 21 bits wont be valid
> Unicode characters.

They are not valid according to the Unicode consortium in their current
standard.  In old standards only 16 bits were valid.  It might change
again in the future.  I stick to considering all 32-bit numbers
representing some character, that's a lot cleaner.

> The usual mechanisms for handling invalid Unicode characters is to
> remove them from the character sequence, replace them with another
> character, or abort all processing.

Halt the computer?

Seriously, this only applies to programs doing conversion.

> The bottom line is that the sequence must always be a valid Unicode
> character sequence.  Does VIM guarnatee that it can only produce valid
> Unicode character sequences?

Vim guarantees that it won't change the bytes when you ":edit" a file
and ":write" it (well, so long as no forced conversions are taking
place).  And that's how it should be.  Vim is an editor, not a police
officer from the Unicode consortium.

Vim can also edit binary files and compressed files, those will
certainly not be valid Unicode.  And still 'encoding' can be "utf-8"
while editing these files.

If you want to filter out invalid characters, do you also want to drop
the ones reserved for future use?  That's similar to the ones above the
21 bits.  And the characters for the gaps that the surrogate pairs create
are certainly invalid.  All this is too specific and depends on the
version of the standard that is being used.  It doesn't sound like a
task for Vim.

> There are pros and cons for UTF-8 v UTF-16.  UTF-8 works relatively
> nicely with C string handling routines as NULs can still be used as
> terminators.  But you can write more efficient string handling functions
> using UTF-16 - indexing is simpler, and should you end up in a surrogate
> pair it is a quick test to check if you need to back up one character or
> not.

Not at all so.  UTF-8 is much simpler to handle, since all basic C
functions work on chars.  There even isn't a type for using a UTF-16
string!  "wchar_t" usually is 32 bit, I don't know of a C type that is
guaranteed to be 16 bits.  "short" often is, but it's not guaranteed.
Would need to use an array of two chars to be sure, but you run into
byte order problems then.  Not what I call simple...

On MS-Windows there are "W" versions of most library functions, but they
aren't even properly documented...  You have to look at the "A" version
and figure out what the arguments would be for the "W" version.  Or
better: look in the header file.

Somehow Microsoft assumed that we would write code that either runs on
Windows 2000/NT/XP OR on older systems.  But we want code that runs
everywhere, which is a hassle.  Fortunately it IS possible.

Outside of MS-Windows there is hardly support for UTF-16 in system
functions.

> IIRC WinXP supports UTF-16 so should handle surrogate pairs.  Previous
> version of NT were UCS-2 so would ignore surrogate pairs.  If someone
> has a font with characters outside the BMP it may be worth trying it on
> an XP box.

When writing the Vim code for MS-Windows we need to verify it works
properly with all versions of the runtime libraries, since the meaning
of wide characters changed over time.  Oh, and add Borland C, MingW and
others to that (Borland is known to cause trouble on Windows 95 when
using wide functions, don't know what it does with UTF-16).  Just
changing something and hoping it will work for everybody is not a good
idea...

--
I used to wonder about the meaning of life.  But I looked it
up in the dictionary under "L" and there it was - the meaning
of life.  It was less than I expected.              - Dogbert

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

John (Eljay) Love-Jensen
Hi Bram,

>In old standards only 16 bits were valid.

Correct, v1.x

>It might change again in the future.  I stick to considering all 32-bit numbers
representing some character, that's a lot cleaner.

Unicode is guaranteed to never exceed 21 bits.

Code points above 21 bits may be used for other non-Unicode purposes
(including private use glyphs, or format CSI, or whatever).

CAREFULLY NOTE: UTF-8 can encode up to 31 bits, not 32 bits!

>Seriously, this [invalid character behavior] only applies to programs doing
conversion.

I believe that is strictly in the context of UTF-8 or UTF-16 encoding for
the Unicode code points, and the translation to-and-fro the 21 bit Unicode
code point space.

Non-Unicode private use code points above 21 bits should comply with UTF-8
or UTF-16 encoding when embedded in a UTF-8 or UTF-16 stream.

>If you want to filter out invalid characters, do you also want to drop
the ones reserved for future use?

If UTF-8 or UTF-16 encoding is involved, invalid UTF-8 or UTF-16 bytes
should not appear in the UTF-8 or UTF-16 encoded octet stream.

I do not expect UTF-32 encoding to have that issue.  (But it will have
big-endian, little-endian, BOM issues.)

NOSTALIGIA BREAK:  Remember the old Infocom Z-Machine engine, that used
5-bit encoding (q.v. http://www.gnelson.demon.co.uk/zspec/sect03.html).  I
wonder if anyone* uses a 21-bit stream for encoding Unicode...

* obviously, clinically insane

 >UTF-8 is much simpler to handle, since all basic C functions work on
chars.

UTF-80 is more compatible with basic C string functions.

UTF-80 is exactly like UTF-8 with one exception.  Code point U+0000 is
encoded as two octets: C0 80.

UTF-8 guarantees that \xFF and \xFE octets will never appear in the UTF-8
stream | string.

UTF-80 guarantees that \xFF, \xFE and \x00 will never appear in the UTF-80
stream | string.

Now *WHY* anyone would plop down U+0000 in a Unicode string, I have no idea.

Sincerely,
--Eljay

PS:  I'm a UTF-8 fan (and variant UTF-80, for ease-of-use with C string
routines); I'm not a UTF-16 fan.


Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Bram Moolenaar
In reply to this post by A.J.Mechelynck

Tony Mechelynck wrote:

> > I never heard about Unicode limiting to 0x10FFFF, I think that's a limit
> > of UTF-16.  As mentioned before UTF-16 is a bad hack for people who used
> > 16 bit characters and discovered later that it's not sufficient for all
> > characters (most notably MS-Windows).  There's a lot of politics going
> > on for this standard...
> [...]
>
> Somewhere on the Unicode Consortium's site http://www.unicode.org/ (I'm not
> sure exactly where but I remember seeing it: "What is Unicode?" or "The
> Unicode Standard" are likely titles), it is said that no codepoints above
> U+10FFFF shall be used ever

Bill Gates once said that nobody needs more than 640 Kbyte of memory and
based the architecture of MS-DOS on that.

MS-Windows used 16-bit wide characters because that would be sufficient
for everybody.

Can you see the pattern?

> Space for UTF-16 surrogates is clearly reserved (in two halves,
> "upper" and "lower") in the Unicode codespace, but I haven't succeeded
> to find exactly how these surrogates are used, or what means what.
> (The only thing clear to me is that they are "somehow" used in pairs,
> one from each half, to represent codepoints above U+FFFF in 16-bit
> Unicode encodings.)

That's standardization talk.  It's a complicated way to say that they
used a couple of holes in the character set to implement UTF-16.  And
yes, it requires some detailed reading to understand how it works.
It's similar to UTF-8, but not as "clean".

--
Why I like vim:
> I like VIM because, when I ask a question in this newsgroup, I get a
> one-line answer.  With xemacs, I get a 1Kb lisp script with bugs in it ;-)

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Mike Williams
In reply to this post by A.J.Mechelynck
Tony Mechelynck did utter on 28/07/2005 15:08:
>> The Unicode codespace is defined in terms of character planes (as in
>> the BMP) containing 64K codepoints.  There are 16 planes defined.
>
> plus the BMP (plane zero) makes 17. -- Sixteen including the BMP would
> go to 0FFFFF

You are of course right, I got confused with the numbering which goes
from 0 to 16. ;-)

>> From this you get the upper limit of 0x10ffff - the last codepoint of
>> the last defined plane, yes there are unused bits in the 21-bit
>> codespace. How these codepoints are represented is defined by the
>> encoding, of which Unicode define 3 - UTF-8, UTF-16 and UTF-32.  Each
>> of them can represent any valid codepoint, UTF-8 with upto 3 bytes,
>> UTF-16 with one or two characters, and a single UTF-32. [...]
>
> UTF-8 with 4: to represent the highest valid codepoint (U+10FFFD) you
> need the bytes F4 8F BF BD.

Darn, caught out again.  The 3 byte sequence is sufficient for the BMP,
the 4 byte sequence is needed for the other 16 planes.

TTFN

Mike
--
What the public thinks depends on what the public hears.
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Mike Williams
In reply to this post by Bram Moolenaar
Bram Moolenaar did utter on 28/07/2005 15:31:
> Note that "the Unicode consortium" is different from "Unicode" in
> general.  The last one is the generic term that covers all work done on
> Unicode and related encodings.  The consortium is one organisation that
> claims to be the authority on Unicode.  However, Unicode is undergoing
> changes over time (the consortium is at 4.1.0 now) thus there is never a
> fixed definition of what Unicode is.

Indeed, the term Unicode encompasses a wide range of interrelated
things.  However one of the aims is future proofing where possible, and
a number of aspects have been deemed as unchangeable.

> All this stuff about "surrogate pairs" and "character planes" is the
> terrible hack that was done to make UTF-16 possible.  It certainly
> wasn't needed to give each character a number.  Unfortunately we'll have
> to live with it.  But it remains an ugly solution that requires defining
> all these terms.  Unicode should have been a nice clean homogeneous
> sequence of characters...

It can be, just use UTF-32 ;-)  If you know you will only ever deal with
characters from the BMP then UTF-16 copes without having to know about
surrogate pairs.  Just using American ASCII?  UTF-8.  You picks your
encoding and pays your price.

>>I recommend reading Chapter 2 from the Unicode spec (individual chapters
>>can be downloaded) available as PDF from the Unicode web site, in
>>particular section 2.8.
>
> I recommend sticking to common sense.  Reading standards text can drive
> you mad...

As someone else said, common sense is the collection of prejudices
developed with age.  I think we went down this line of argument with C
and sequence points a couple of weeks ago ;-)

>>>As always, I recommend using UTF-8 for everything.  And use as many
>>>characters as you like, up to 32 bits.
>>
>>That's as may be, but anything greater than 21 bits wont be valid
>>Unicode characters.
>
> They are not valid according to the Unicode consortium in their current
> standard.  In old standards only 16 bits were valid.  It might change
> again in the future.  I stick to considering all 32-bit numbers
> representing some character, that's a lot cleaner.
>
>
>>The usual mechanisms for handling invalid Unicode characters is to
>>remove them from the character sequence, replace them with another
>>character, or abort all processing.
>
> Halt the computer?
>
> Seriously, this only applies to programs doing conversion.
>
>
>>The bottom line is that the sequence must always be a valid Unicode
>>character sequence.  Does VIM guarnatee that it can only produce valid
>>Unicode character sequences?
>
>
> Vim guarantees that it won't change the bytes when you ":edit" a file
> and ":write" it (well, so long as no forced conversions are taking
> place).  And that's how it should be.  Vim is an editor, not a police
> officer from the Unicode consortium.
>
> Vim can also edit binary files and compressed files, those will
> certainly not be valid Unicode.  And still 'encoding' can be "utf-8"
> while editing these files.
>
> If you want to filter out invalid characters, do you also want to drop
> the ones reserved for future use?  That's similar to the ones above the
> 21 bits.  And the characters for the gaps that the surrogate pairs create
> are certainly invalid.  All this is too specific and depends on the
> version of the standard that is being used.  It doesn't sound like a
> task for Vim.

Fine, I don't mind.  As long as people are aware that it is not a
Unicode capable editor.  It may work, it may not.  Any problems you have
are yours - next please.

>>There are pros and cons for UTF-8 v UTF-16.  UTF-8 works relatively
>>nicely with C string handling routines as NULs can still be used as
>>terminators.  But you can write more efficient string handling functions
>>using UTF-16 - indexing is simpler, and should you end up in a surrogate
>>pair it is a quick test to check if you need to back up one character or
>>not.
>
>
> Not at all so.  UTF-8 is much simpler to handle, since all basic C
> functions work on chars.  There even isn't a type for using a UTF-16
> string!  "wchar_t" usually is 32 bit, I don't know of a C type that is
> guaranteed to be 16 bits.  "short" often is, but it's not guaranteed.
> Would need to use an array of two chars to be sure, but you run into
> byte order problems then.  Not what I call simple...

There is no C type guaranteed to be "just" 8 bits, but that is an
argument of a different hue of fish.

> On MS-Windows there are "W" versions of most library functions, but they
> aren't even properly documented...  You have to look at the "A" version
> and figure out what the arguments would be for the "W" version.  Or
> better: look in the header file.
>
> Somehow Microsoft assumed that we would write code that either runs on
> Windows 2000/NT/XP OR on older systems.  But we want code that runs
> everywhere, which is a hassle.  Fortunately it IS possible.

Fine again, as long as everyone is aware that VIM does not claim to be a
Unicode editor - it may work, it may not, your problem.

> Outside of MS-Windows there is hardly support for UTF-16 in system
> functions.
>
>
>>IIRC WinXP supports UTF-16 so should handle surrogate pairs.  Previous
>>version of NT were UCS-2 so would ignore surrogate pairs.  If someone
>>has a font with characters outside the BMP it may be worth trying it on
>>an XP box.
>
>
> When writing the Vim code for MS-Windows we need to verify it works
> properly with all versions of the runtime libraries, since the meaning
> of wide characters changed over time.  Oh, and add Borland C, MingW and
> others to that (Borland is known to cause trouble on Windows 95 when
> using wide functions, don't know what it does with UTF-16).  Just
> changing something and hoping it will work for everybody is not a good
> idea...
>

Mike
--
Lose weight - eat stuff you hate.
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Bram Moolenaar

Mike Williams wrote:

> Fine again, as long as everyone is aware that VIM does not claim to be a
> Unicode editor - it may work, it may not, your problem.

That's right, claiming that Vim is a Unicode editor doesn't make much
sense, Vim is much more than that!  I suppose a Unicode editor wouldn't
be able to edit cp936 text, for a start.

--
If all you have is a hammer, everything looks like a nail.
When your hammer is C++, everything begins to look like a thumb.
                        -- Steve Hoflich, comp.lang.c++

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

adah
In reply to this post by Fernando Canizo
> > If you want to filter out invalid characters, do you also want to
> > drop the ones reserved for future use?  That's similar to the ones
> > above the 21 bits.  And the characters for the gaps that the
> > surrogate pairs create are certainly invalid.  All this is too
> > specific and depends on the version of the standard that is being
> > used.  It doesn't sound like a task for Vim.
>
> Fine, I don't mind.  As long as people are aware that it is not a
> Unicode capable editor.  It may work, it may not.  Any problems you
> have are yours - next please.

Your statement is really provocative.  Please give the definition of
`Unicode capability'.

IMHO, Vim can handle `correct' Unicode input, and will write `correct'
Unicode output, even in your narrow sense of `correctness'.  (If the user
chooses to use extended characters, why should Vim prohibit it?)  The
point is that Vim will *NOT* generate characters incompatible with _any_
Unicode standards, if you use it in that way.  Vim never claims to be
Unicode code point validity checker.  Period.

Regards,

Yongwei
Reply | Threaded
Open this post in threaded view
|

RE: is vim putting the right bytes for utf-8 ?

Gareth Oakes
In reply to this post by Fernando Canizo
> Your statement is really provocative.  Please give the
> definition of `Unicode capability'.
>
> IMHO, Vim can handle `correct' Unicode input, and will write `correct'
> Unicode output, even in your narrow sense of `correctness'.  
> (If the user chooses to use extended characters, why should
> Vim prohibit it?)  The point is that Vim will *NOT* generate
> characters incompatible with _any_ Unicode standards, if you
> use it in that way.  Vim never claims to be Unicode code
> point validity checker.  Period.

To be fair, I'm a bit of a pragmatist also.  I can see what Mike's
getting at but, to play devil's advocate, the world is never going to be
a perfect place :)

Vim is fine for unicode editing, in my experience.  I've been working in
the field of automated publishing systems for a few years now and I've
successfully used vim to perform many edits on customer files (generally
XML or SGML).  These files have been from all corners of the globe and
vim hasn't choked yet :)

Cheers,
Gareth

PS. I've also been enjoying the pleasures of embedded vim on the Sharp
Zaurus SL-5500 PDA I recently picked up cheaply - open source rocks ;)
Reply | Threaded
Open this post in threaded view
|

Re: is vim putting the right bytes for utf-8 ?

Mike Williams
In reply to this post by adah
[hidden email] did utter on 29/07/2005 03:29:

>>>If you want to filter out invalid characters, do you also want to
>>>drop the ones reserved for future use?  That's similar to the ones
>>>above the 21 bits.  And the characters for the gaps that the
>>>surrogate pairs create are certainly invalid.  All this is too
>>>specific and depends on the version of the standard that is being
>>>used.  It doesn't sound like a task for Vim.
>>
>>Fine, I don't mind.  As long as people are aware that it is not a
>>Unicode capable editor.  It may work, it may not.  Any problems you
>>have are yours - next please.
>
>
> Your statement is really provocative.  Please give the definition of
> `Unicode capability'.
>
> IMHO, Vim can handle `correct' Unicode input, and will write `correct'
> Unicode output, even in your narrow sense of `correctness'.  (If the user
> chooses to use extended characters, why should Vim prohibit it?)  The
> point is that Vim will *NOT* generate characters incompatible with _any_
> Unicode standards, if you use it in that way.  Vim never claims to be
> Unicode code point validity checker.  Period.

Just taking your last statement, then VIM cannot be described as Unicode
conformant.  VIM supports various aspects of the Unicode standard, but
is not conformant.  A such no guarantees are made about the correctness
of processing files.  If someone requires conformance then they will
need 3rd party tools to validate VIMs output.

Consider VIM as a Swiss Army Editor - it can do many things but it will
also let you damage stuff without any warning.  Just as long as people
are aware, that's all.

TTFN

Mike
--
A clean desk requires a large wastebasket.
123