Multibyte bugs

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Multibyte bugs

Tony Mechelynck
Hi Bram,

1. (Minor bug): On this system (gvim 7.2.411, Huge version with
GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
doesn't give the expected result: instead of U+00A0 ("Alt-space", the
non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
correctly.

2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?

3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints. I
tried to find examples and counterexamples, as follows (in the comment
after the :echo statements, the UTF-8 expansion in hex):

        :echo "«\<Char-0x40>" | " 40
«@»
        :echo "«\<Char-0x80>" | " C2 80
«<80><fe>X»
        :echo "«\<Char-0x100>»" | " C4 80
«Ā<fe>X»
        :echo "«\<Char-0x101>»" | " C4 81
«ā»
        :echo "«\<Char-0x180>»" | " C6 80
«ƀ<fe>X»
        :echo "«\<Char-0x190>»" | " C6 90
«Ɛ»
        :echo "«\<Char-0x1A0>»" | " C6 A0
«Ơ»
        :echo "«\<Char-0x1C0>»" | " C7 80
«ǀ<fe>X»
        :echo "«\<Char-0x4E00>»" | " E4 B8 80
«一<fe>X»
        :echo "«\<Char-0x4E01>»" | " E4 B8 81
«丁»
        :echo "«\<Char-0x4E20>»" | " E4 B8 A0
«丠»
        :echo "«\<Char-0x4E40>»" | " E4 B9 80
«乀<fe>X»
        :echo "«\<Char-0xE000>»" | " EE 80 80
«<ee><80><fe>X<80><fe>X»
        :echo "«\<Char-57344>»" | " EE 80 80
«<ee><80><fe>X<80><fe>X»
        :echo "«\<Char-0xE001>»" | " EE 80 81
«<ee><80><fe>X<81>»"
        :echo "«\<Char-0xE040>»" | " EE 81 80
«<fe>X»

This seems to indicate that the extra bytes 0xFE 0x58 appear after any
0x80 in the UTF-8 expansion of the character. (I added the « »
characters to "bound" the display so that any extra whitespace would be
visible but they change nothing to the bug.)

The bug does not occur after Ctrl-V u in Insert mode or when using
<Char-...> in an Insert-mode mapping. It does when using "\<Char-...>"
in other commands than :echo. Note the following:

        :let j = "\<Char-0xE000>"
        :let j
j <ee><80><fe>X<80><fe>X
        i<Ctrl-R>=j<Enter>
î<t_þ>X<t_þ>X

(where <Ctrl-R> and <Enter> are one keystroke each, not counting
modifiers). Apparently gvim tries to interpret 0x80 0xFE as a "special
key", and "resolves" it (incorrectly) as <t_þ>.

Two very big files were loaded when I first noticed bug #3, but
restarting gvim without them reproduced the bug again with the same
spurious bytes.


Best regards,
Tony.
--
Alimony is a system by which, when two people make a mistake, one of
them keeps paying for it.
                -- Peggy Joyce

--
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php

To unsubscribe, reply using "remove me" as the subject.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte bugs (Update)

Tony Mechelynck
On 03/04/10 06:36, Tony Mechelynck wrote:

> Hi Bram,
>
> 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
> GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
> doesn't give the expected result: instead of U+00A0 ("Alt-space", the
> non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
> correctly.
>
> 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?
>
> 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints. I
> tried to find examples and counterexamples, as follows (in the comment
> after the :echo statements, the UTF-8 expansion in hex):
>
> :echo "«\<Char-0x40>" | " 40
> «@»
> :echo "«\<Char-0x80>" | " C2 80
> «<80><fe>X»
> :echo "«\<Char-0x100>»" | " C4 80
> «Ā<fe>X»
> :echo "«\<Char-0x101>»" | " C4 81
> «ā»
> :echo "«\<Char-0x180>»" | " C6 80
> «ƀ<fe>X»
> :echo "«\<Char-0x190>»" | " C6 90
> «Ɛ»
> :echo "«\<Char-0x1A0>»" | " C6 A0
> «Ơ»
> :echo "«\<Char-0x1C0>»" | " C7 80
> «ǀ<fe>X»
> :echo "«\<Char-0x4E00>»" | " E4 B8 80
> «一<fe>X»
> :echo "«\<Char-0x4E01>»" | " E4 B8 81
> «丁»
> :echo "«\<Char-0x4E20>»" | " E4 B8 A0
> «丠»
> :echo "«\<Char-0x4E40>»" | " E4 B9 80
> «乀<fe>X»
> :echo "«\<Char-0xE000>»" | " EE 80 80
> «<ee><80><fe>X<80><fe>X»
> :echo "«\<Char-57344>»" | " EE 80 80
> «<ee><80><fe>X<80><fe>X»
> :echo "«\<Char-0xE001>»" | " EE 80 81
> «<ee><80><fe>X<81>»"
> :echo "«\<Char-0xE040>»" | " EE 81 80
> «<fe>X»
>
> This seems to indicate that the extra bytes 0xFE 0x58 appear after any
> 0x80 in the UTF-8 expansion of the character. (I added the « »
> characters to "bound" the display so that any extra whitespace would be
> visible but they change nothing to the bug.)
>
> The bug does not occur after Ctrl-V u in Insert mode or when using
> <Char-...> in an Insert-mode mapping. It does when using "\<Char-...>"
> in other commands than :echo. Note the following:
>
> :let j = "\<Char-0xE000>"
> :let j
> j <ee><80><fe>X<80><fe>X
> i<Ctrl-R>=j<Enter>
> î<t_þ>X<t_þ>X
>
> (where <Ctrl-R> and <Enter> are one keystroke each, not counting
> modifiers). Apparently gvim tries to interpret 0x80 0xFE as a "special
> key", and "resolves" it (incorrectly) as <t_þ>.
>
> Two very big files were loaded when I first noticed bug #3, but
> restarting gvim without them reproduced the bug again with the same
> spurious bytes.
>
>
> Best regards,
> Tony.

Update: There is a second case which triggers incorrect behaviour in
"\<Char-nnnn>" when 'encoding' is UTF-8:

- As noted above, after every 0x80 byte in the UTF-8 representation, the
bytes 0xFE 0x58 are spuriously added: after the UTF-8 string if the 0x80
is its last byte (giving two invalid bytes after the correct multibyte
glyph), and/or in the middle of it if there is a 0x80 byte other than
the last (making the whole multibyte sequence invalid; the 0x80 can
never be the first byte, because the first byte of a multibyte UTF-8
sequence is >= 0xC0 [0xC2 actually, except for "overlong" sequences
representing ASCII bytes], and it can not be an "only byte" because
single-byte sequences are <= 0x7F).

- In addition, after every 0x9B byte, the bytes 0xFD 0x4F are added,
also immediately after that byte, breaking the UTF-8 sequence if it
isn't the last byte.

- The above are repeatable "every time", even from one run of gvim to
the next, and I always get 0x80 0xFE 0x58 instead of 0x80, and 0x9B 0xFD
0x4F instead of 0x9B, in all the UTF-8 sequences generated by the
"\<Char-nnnn>" construct.

- Removing the spurious bytes (including those in the middle of a byte
sequence) make the correct multibyte glyph appear immediately (I'm
assuming, of course, that 'encoding' is still set to UTF-8).

- The fact that those two byte values, 0x80 aka Alt-Null and 0x9B aka
Alt-Escape aka CSI, play special roles in gvim's representation of
special keys, might help to spot where the bug comes from. (Yes, did I
say it? I tested all this in GUI mode, in my usual "Huge" gvim with
GTK2/Gnome GUI, and, of course, with +multi_byte among others. Currently
at patchlevel 7.2.411)


I'm crossposting this update to vim_dev because my first post (in
vim_multibyte) got no reply whatsoever; but it was only four days ago,
and the Easter holiday is upon us; maybe I wasn't patient enough.


Have a nice holiday, and Happy Vimming!
Tony.
--
Immortality -- a fate worse than death.
                -- Edgar A. Shoaff

--
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php

To unsubscribe, reply using "remove me" as the subject.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte bugs

Bram Moolenaar
In reply to this post by Tony Mechelynck

Tony Mechelynck wrote:

> 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
> GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
> doesn't give the expected result: instead of U+00A0 ("Alt-space", the
> non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
> correctly.

Why do you expect CTRL-K <space> <space> to produce 0xa0?  According to
http://www.faqs.org/rfcs/rfc1345.html  it's 0xe000.

> 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?

Why would it be a double-width character?  In
http://unicode.org/Public/UNIDATA/EastAsianWidth.txt it's marked as
"private use".

> 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints. I
> tried to find examples and counterexamples, as follows (in the comment
> after the :echo statements, the UTF-8 expansion in hex):
>
> :echo "«\<Char-0x40>" | " 40
> «@»
> :echo "«\<Char-0x80>" | " C2 80
> «<80><fe>X»
> :echo "«\<Char-0x100>»" | " C4 80
> «Ā<fe>X»
> :echo "«\<Char-0x101>»" | " C4 81
> Â«Ä Â»
> :echo "«\<Char-0x180>»" | " C6 80
> «ƀ<fe>X»
> :echo "«\<Char-0x190>»" | " C6 90
> Â«Æ Â»
> :echo "«\<Char-0x1A0>»" | " C6 A0
> Â«Æ Â»
> :echo "«\<Char-0x1C0>»" | " C7 80
> «ǀ<fe>X»
> :echo "«\<Char-0x4E00>»" | " E4 B8 80
> «一<fe>X»
> :echo "«\<Char-0x4E01>»" | " E4 B8 81
> Â«ä¸ Â»
> :echo "«\<Char-0x4E20>»" | " E4 B8 A0
> Â«ä¸ Â»
> :echo "«\<Char-0x4E40>»" | " E4 B9 80
> «乀<fe>X»
> :echo "«\<Char-0xE000>»" | " EE 80 80
> «<ee><80><fe>X<80><fe>X»
> :echo "«\<Char-57344>»" | " EE 80 80
> «<ee><80><fe>X<80><fe>X»
> :echo "«\<Char-0xE001>»" | " EE 80 81
> «<ee><80><fe>X<81>»"
> :echo "«\<Char-0xE040>»" | " EE 81 80
> Â«î €<fe>X»
>
> This seems to indicate that the extra bytes 0xFE 0x58 appear after any
> 0x80 in the UTF-8 expansion of the character. (I added the « »
> characters to "bound" the display so that any extra whitespace would be
> visible but they change nothing to the bug.)

The form "\<xxx>" is for special keys, not characters.  For the character
itself use \x or \u or \U.  See ":help expr-string".
The special keys are escaped for use in a mapping.

> The bug does not occur after Ctrl-V u in Insert mode or when using
> <Char-...> in an Insert-mode mapping. It does when using "\<Char-...>"
> in other commands than :echo. Note the following:
>
> :let j = "\<Char-0xE000>"
> :let j
> j <ee><80><fe>X<80><fe>X
> i<Ctrl-R>=j<Enter>
> î<t_þ>X<t_þ>X
>
> (where <Ctrl-R> and <Enter> are one keystroke each, not counting
> modifiers). Apparently gvim tries to interpret 0x80 0xFE as a "special
> key", and "resolves" it (incorrectly) as <t_þ>.
>
> Two very big files were loaded when I first noticed bug #3, but
> restarting gvim without them reproduced the bug again with the same
> spurious bytes.

--
SUPERIMPOSE "England AD 787".  After a few more seconds we hear hoofbeats in
the distance.  They come slowly closer.  Then out of the mist comes KING
ARTHUR followed by a SERVANT who is banging two half coconuts together.
                 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\        download, build and distribute -- http://www.A-A-P.org        ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

--
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php

To unsubscribe, reply using "remove me" as the subject.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte bugs

Tony Mechelynck
On 10/04/10 23:43, Bram Moolenaar wrote:

>
> Tony Mechelynck wrote:
>
>> 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
>> GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
>> doesn't give the expected result: instead of U+00A0 ("Alt-space", the
>> non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
>> correctly.
>
> Why do you expect CTRL-K<space>  <space>  to produce 0xa0?  According to
> http://www.faqs.org/rfcs/rfc1345.html  it's 0xe000.

Because of the following paragraph at lines 99-100 of digraph.txt:

----8<----
> For CTRL-K, there is one general digraph: CTRL-K <Space> {char} will enter
> {char} with the highest bit set.  You can use this to enter meta-characters.
---->8----

When {char} is 0x20 i.e. <Space>, the above tells me that CTRL-K <Space>
<Space> gives 0xA0 i.e. the non-breaking space, which is useful to enter
the "meta-character" Meta-Space if I don't remember the NS digraph. If
U+E000 is a "private use" character, I don't see why it needs a digraph
of its own anyway.

On reading that RFC, which states in its beginning paragraph that it has
no normative value whatsoever, I see (at the very end of section 3)
quite a number of digraphs and trigraphs assigned to U+E000 to U+E028,
in what Unicode calls a "private use area": see for instance the very
start of http://www.unicode.org/charts/pdf/UE000.pdf:

----8<----
Private Use Area
Range: E000–F8FF
The Private Use Area does not contain any character assignments,
consequently no character code charts or namelists are provided for this
area.
---->8----

At least some of the characters listed there in the RFC have a different
Unicode codepoint assigned to them, but maybe Unicode assigned them
after the RFC (dated June 1992) was published. Personally I have strong
doubts as to the usefulness of any Vim digraph for a "private use"
character. U+E000 is listed as "indicates unfinished (Mnemonic)". I'm
not sure what that means, unless maybe that a blank space in a charset
chart (further down in the same RFC) indicates that the chart is unfinished?

>
>> 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?
>
> Why would it be a double-width character?  In
> http://unicode.org/Public/UNIDATA/EastAsianWidth.txt it's marked as
> "private use".

Ah, I see. FWIW my usual 'guifont' has a glyph for it, which AFAICT is a
fullwidth CJK glyph. OTOH the Unihan database does not mention it.

>
>> 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints.
[...]
>
> The form "\<xxx>" is for special keys, not characters.  For the character
> itself use \x or \u or \U.  See ":help expr-string".
> The special keys are escaped for use in a mapping.

The example given at |expr-string| is "\<C-W>" which is the "<control>"
character defined by ASCII as 0x17 ("\x17") and by Unicode as U+0017
("\u0017"), not a "special" non-ASCII key like <F8>, <Home> or
<PageDown>. I had always thought that _every_ <> name could be used in a
double-quoted string with a backslash prefix, and indeed I have verified
that it works for all the <Char-nnnn> or <Char-0xnnnn> that I tested
_except_ those whose UTF-8 expansion includes either or both of the
bytes 0x80 and 0x9B, in which case two spurious bytes are inserted
immediately after every occurrence of a 0x80 or 0x9B byte.

If this bug is WONTFIX, I suggest to mention explicitly at the bottom of
the list under |expr-quote| that the \<xxx> form does not apply if xxx
is Char-nnnn or Char-0xnnnn.


Best regards,
Tony.
--
        "To whoever finds this note -
        I have been imprisoned by my father who wishes me to marry
        against my will.  Please please please please come and rescue me.
        I am in the tall tower of Swamp Castle."
    SIR LAUNCELOT's eyes light up with holy inspiration.
                  "Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

--
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte bugs

Bram Moolenaar

Tony Mechelynck wrote:

> >> 1. (Minor bug): On this system (gvim 7.2.411, Huge version with
> >> GTK2-GNOME GUI), typing Ctrl-K in Insert mode followed by two spaces
> >> doesn't give the expected result: instead of U+00A0 ("Alt-space", the
> >> non-breaking space) I get U+E000, a CJK character. Ctrl-K NS works
> >> correctly.
> >
> > Why do you expect CTRL-K<space>  <space>  to produce 0xa0?  According to
> > http://www.faqs.org/rfcs/rfc1345.html  it's 0xe000.
>
> Because of the following paragraph at lines 99-100 of digraph.txt:
>
> ----8<----
> > For CTRL-K, there is one general digraph: CTRL-K <Space> {char} will enter
> > {char} with the highest bit set.  You can use this to enter meta-characters.
> ---->8----
>
> When {char} is 0x20 i.e. <Space>, the above tells me that CTRL-K <Space>
> <Space> gives 0xA0 i.e. the non-breaking space, which is useful to enter
> the "meta-character" Meta-Space if I don't remember the NS digraph. If
> U+E000 is a "private use" character, I don't see why it needs a digraph
> of its own anyway.

Ah, OK.

> On reading that RFC, which states in its beginning paragraph that it has
> no normative value whatsoever, I see (at the very end of section 3)
> quite a number of digraphs and trigraphs assigned to U+E000 to U+E028,
> in what Unicode calls a "private use area": see for instance the very
> start of http://www.unicode.org/charts/pdf/UE000.pdf:
>
> ----8<----
> Private Use Area
> Range: E000–F8FF
> The Private Use Area does not contain any character assignments,
> consequently no character code charts or namelists are provided for this
> area.
> ---->8----
>
> At least some of the characters listed there in the RFC have a different
> Unicode codepoint assigned to them, but maybe Unicode assigned them
> after the RFC (dated June 1992) was published. Personally I have strong
> doubts as to the usefulness of any Vim digraph for a "private use"
> character. U+E000 is listed as "indicates unfinished (Mnemonic)". I'm
> not sure what that means, unless maybe that a blank space in a charset
> chart (further down in the same RFC) indicates that the chart is unfinished?

It's weird that digraphs are defined for an area that doesn't have
characters assigned to it.  I wonder what happened here.  Perhaps this
changed at some point in time?  If we know the reason we may want to
drop all the dibgraphs for 0xexxx.


> >> 2. U+E000 is displayed in gvim as CJK halfwidth. Shouldn't it be fullwidth?
> >
> > Why would it be a double-width character?  In
> > http://unicode.org/Public/UNIDATA/EastAsianWidth.txt it's marked as
> > "private use".
>
> Ah, I see. FWIW my usual 'guifont' has a glyph for it, which AFAICT is a
> fullwidth CJK glyph. OTOH the Unihan database does not mention it.
>
> >
> >> 3. "\<Char-nnnn>" gives wrong results for some Unicode codepoints.
> [...]
> >
> > The form "\<xxx>" is for special keys, not characters.  For the character
> > itself use \x or \u or \U.  See ":help expr-string".
> > The special keys are escaped for use in a mapping.
>
> The example given at |expr-string| is "\<C-W>" which is the "<control>"
> character defined by ASCII as 0x17 ("\x17") and by Unicode as U+0017
> ("\u0017"), not a "special" non-ASCII key like <F8>, <Home> or
> <PageDown>. I had always thought that _every_ <> name could be used in a
> double-quoted string with a backslash prefix, and indeed I have verified
> that it works for all the <Char-nnnn> or <Char-0xnnnn> that I tested
> _except_ those whose UTF-8 expansion includes either or both of the
> bytes 0x80 and 0x9B, in which case two spurious bytes are inserted
> immediately after every occurrence of a 0x80 or 0x9B byte.
>
> If this bug is WONTFIX, I suggest to mention explicitly at the bottom of
> the list under |expr-quote| that the \<xxx> form does not apply if xxx
> is Char-nnnn or Char-0xnnnn.

Yes.

--
SOLDIER: Where did you get the coconuts?
ARTHUR:  Through ... We found them.
SOLDIER: Found them?  In Mercea.  The coconut's tropical!
                 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\        download, build and distribute -- http://www.A-A-P.org        ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

--
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php

To unsubscribe, reply using "remove me" as the subject.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte bugs

Tony Mechelynck
On 11/04/10 16:33, Bram Moolenaar wrote:
[...]
> It's weird that digraphs are defined for an area that doesn't have
> characters assigned to it.  I wonder what happened here.  Perhaps this
> changed at some point in time?  If we know the reason we may want to
> drop all the dibgraphs for 0xexxx.
[...]

My guess is that when that RFC was drafted in 1992, some of the charsets
they wanted to list used a few characters which, at that time, weren't
clearly assigned to one Unicode codepoint, and that the RFC authors
arbitrarily (and maybe temporarily) placed these characters in a
"private use area", which is the only place where "characters not yet
assigned a Unicode codepoint" may go. This is only a guess, however. I'm
not sure how many people are reading this (extremely low-volume) ML, but
maybe someone knows the history of those mnemonics from RFC 1345 better
than you and I do? If someone with that knowledge is reading this,
please speak up.

IMHO it makes no sense to have digraphs in Vim for "private use"
characters. I propose to drop any of them that cannot be usefully
reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
means forty-one codepoints, it ought not to be a big problem.


Best regards,
Tony.
--
LAUNCELOT: At last!   A call!  A cry of distress ...
            (he draws his sword, and turns to CONCORDE)
            Concorde!  Brave, Concorde ... you shall not have died in vain!
CONCORDE:  I'm not quite dead, sir ...
                  "Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

--
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php

To unsubscribe, reply using "remove me" as the subject.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte bugs

Bram Moolenaar

Tony Mechelynck wrote:

> On 11/04/10 16:33, Bram Moolenaar wrote:
> [...]
> > It's weird that digraphs are defined for an area that doesn't have
> > characters assigned to it.  I wonder what happened here.  Perhaps this
> > changed at some point in time?  If we know the reason we may want to
> > drop all the dibgraphs for 0xexxx.
> [...]
>
> My guess is that when that RFC was drafted in 1992, some of the charsets
> they wanted to list used a few characters which, at that time, weren't
> clearly assigned to one Unicode codepoint, and that the RFC authors
> arbitrarily (and maybe temporarily) placed these characters in a
> "private use area", which is the only place where "characters not yet
> assigned a Unicode codepoint" may go. This is only a guess, however. I'm
> not sure how many people are reading this (extremely low-volume) ML, but
> maybe someone knows the history of those mnemonics from RFC 1345 better
> than you and I do? If someone with that knowledge is reading this,
> please speak up.
>
> IMHO it makes no sense to have digraphs in Vim for "private use"
> characters. I propose to drop any of them that cannot be usefully
> reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
> means forty-one codepoints, it ought not to be a big problem.

Searching revealed a few proposals for these character ranges.  And
this page has a confusing summary:
http://en.wikibooks.org/wiki/Unicode/Character_reference/E000-EFFF
"private use" but it does have a table with characters.

Let's remove these digraphs.  I can't imagine anyone is using them.

--
Clothes make the man.  Naked people have little or no influence on society.
                               -- Mark Twain (Samuel Clemens) (1835-1910)

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\        download, build and distribute -- http://www.A-A-P.org        ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

--
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php

To unsubscribe, reply using "remove me" as the subject.
Reply | Threaded
Open this post in threaded view
|

Re: Multibyte bugs

Tony Mechelynck
On 11/04/10 17:33, Bram Moolenaar wrote:
>
> Tony Mechelynck wrote:
[...]
>> IMHO it makes no sense to have digraphs in Vim for "private use"
>> characters. I propose to drop any of them that cannot be usefully
>> reassigned to some "official" Unicode codepoint elsewhere. E000 to E028
>> means forty-one codepoints, it ought not to be a big problem.
>
> Searching revealed a few proposals for these character ranges.  And
> this page has a confusing summary:
> http://en.wikibooks.org/wiki/Unicode/Character_reference/E000-EFFF
> "private use" but it does have a table with characters.

Yes; in my browser and with my usual font most (but not all) of them are
CJK fullwidth ideograms and full-width counterparts of halfwidth math
symbols etc. A few are (halfwidth) Latin accented letters which even
exist in Latin1 i.e. below U+0100 !!! For instance (in my browser)
U+E023 to U+E081 look like duplicates of ASCII 0x21 to 0x7E in the same
order. Note however the last sentence immediately before the table:

«The repertoire seen with your computer's font will most likely not be
the same as with other computers or fonts.»

And indeed I see a different glyph for those codepoints in gvim with my
usual 'guifont', which is not the same as my browser's usual serif and
sans-serif fonts.

>
> Let's remove these digraphs.  I can't imagine anyone is using them.
>

Neither can I.


Best regards,
Tony.
--
    LAUNCELOT leaps into SHOT with a mighty cry and runs the GUARD
through and
    hacks him to the floor.  Blood.  Swashbuckling music (perhaps).
    LAUNCELOT races through into the castle screaming.
SECOND SENTRY: Hey!
                  "Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

--
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php