Handling Unicode codepoints outside the BMP

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Handling Unicode codepoints outside the BMP

Tony Mechelynck

Bram Moolenaar wrote:

> Tony Mechelynck wrote:
>
>> Talking of todo lists, is the following item on it?
>>
>> - In the GUI, when using a multi-byte 'encoding' use glyphs for any codepoints
>> present in the 'guifont', even for Unicode codepoints above U+FFFF.
>>
>> Rationale: Many CJK codepoints are in plane 2 (U+20000 to U+2FFFF).
>> Representing all of them indiscriminately by double-wide question marks makes
>> editing East-Asian text files difficult.
>>
>> Hm... There seems to be something similar at todo.txt (2007-Jul-22) line 720:
>>
>> 8   When 'encoding' is "utf-8", should use 'guifont' for both normal and wide
>>      characters to make Asian languages work.  Win32 fonts contain both
>>      type of characters.
>>
>> This is already implemented (at least in the GTK2 GUI) for wide and narrow
>> characters in the BMP (Basic Multilingual Plane, U+0000 to U+FFFF). It remains
>> to be implemented for characters outside the BMP. Many TTF fonts installed on
>> my Linux system (and which I got from Novell/SuSE but also from various other
>> royalty-free sources) include glyphs both inside and outside the BMP, as can
>> be seen by displaying the page in Firefox with an appropriate CSS sheet. The
>> latest fonts which I installed are particularly complete if not extremely
>> pretty: "HAN NOM A" and "HAN NOM B", both from
>> http://sourceforge.net/project/showfiles.php?group_id=153105&package_id=172061 
>> -- I got the tip from http://en.wikipedia.org/wiki/GB18030#Glyphs
>
> It is very unlikely I will work on this myself.  Thus we will have to
> wait for someone to work on this.
>

OK. Anyone wants to take up the challenge? I have done some preliminary
detective work. The problem is in function gui_mch_draw_string() which is
different for each GUI flavour, as follows, checking for a test on (c >=
0x10000). Line numbers below are for the latest (7.1.043) sources.

gui_gtk_x11.c:6064      if >= U+10000, replace by question mark
gui_mac.c               no special handling
gui_photon.c            no special handling
gui_riscos.c            no special handling
gui_w16.c               no special handling
gui_w32.c:2343          if >= U+10000, convert to UTF-16 surrogate pair
gui_x11.c:2565          if >= U+10000, replace by question mark (1)
gui_x11.c:2572          if >= U+10000, replace by question mark (2)

(1) with FEAT_XFONTSET, if current_fontset != NULL
(2) otherwise

It seems that the Win32 GUI converts UTF-8 to UTF-16 with surrogates (thus, I
suppose, displaying characters correctly even outside the BMP) and that X11
GUIs (GTK, and non-GTK-non-Photon) explicitly replace anything outside the BMP
by a wide question mark (which I call the "faulty" handling). For Mac, Photon,
RiscOS and W16, either the OS handles it transparently or the problem cannot
occur (+multi_byte not possible?), I don't know.

I don't feel up to modifying this platform-dependent code, but I hope that
with the above, someone will be willing. Any takers?


Best regards,
Tony.
--
The temperature of Heaven can be rather accurately computed.  Our
authority is Isaiah 30:26, "Moreover, the light of the Moon shall be as
the light of the Sun and the light of the Sun shall be sevenfold, as
the light of seven days."  Thus Heaven receives from the Moon as much
radiation as we do from the Sun, and in addition 7*7 (49) times as much
as the Earth does from the Sun, or 50 times in all.  The light we
receive from the Moon is one 1/10,000 of the light we receive from the
Sun, so we can ignore that ... The radiation falling on Heaven will
heat it to the point where the heat lost by radiation is just equal to
the heat received by radiation, i.e., Heaven loses 50 times as much
heat as the Earth by radiation.  Using the Stefan-Boltzmann law for
radiation, (H/E)^4 = 50, where E is the absolute temperature of the
earth (300K), gives H as 798K (525C).  The exact temperature of Hell
cannot be computed ... [However] Revelations 21:8 says "But the
fearful, and unbelieving ... shall have their part in the lake which
burneth with fire and brimstone."  A lake of molten brimstone means
that its temperature must be at or below the boiling point, 444.6C.  We
have, then, that Heaven, at 525C is hotter than Hell at 445C.
                -- From "Applied Optics" vol. 11, A14, 1972

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Edward L. Fox

On 7/29/07, Tony Mechelynck <[hidden email]> wrote:
> [...]
> I don't feel up to modifying this platform-dependent code, but I hope that
> with the above, someone will be willing. Any takers?

I think this work is worth doing. If Bram permits me to mess up the
svn trunk, maybe I could set up an SVK downstream repository and then
we could work with this issue together.

>
>
> Best regards,
> Tony.
> [...]

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Tony Mechelynck

Edward L. Fox wrote:

> On 7/29/07, Tony Mechelynck <[hidden email]> wrote:
>> [...]
>> I don't feel up to modifying this platform-dependent code, but I hope that
>> with the above, someone will be willing. Any takers?
>
> I think this work is worth doing. If Bram permits me to mess up the
> svn trunk, maybe I could set up an SVK downstream repository and then
> we could work with this issue together.
>
>>
>> Best regards,
>> Tony.
>> [...]

Great! What you could do even without asking Bram is set up a duplicate source
tree locally on your system and "mess" with that: for instance, if the
"official" source starts at ~/.build/vim/vim71/

        cd ~/.build
        mkdir -v vim-test
        cp -av vim/vim71 vim-test

Then modify vim-test/vim71/src/gui_*.c and use the usual "make" procedures
(except from vim-test/vim71 instead of vim/vim71). Don't "make install" but
run the generated binary (if any ;-) ) from vim-test/vim71/src instead.

Then after testing (I can help you test on GTK2) submit the diffs to Bram. :-)


Best regards,
Tony.
--
    [SIR LAUNCELOT runs back up the stairs, grabs a rope
    of the wall and swings out over the heads of the CROWD in a
    swashbuckling manner towards a large window.  He stops just short
    of the window and is left swing pathetically back and forth.]
LAUNCELOT: Excuse me ... could somebody give me a push ...
                  "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Nico Weber-3
In reply to this post by Tony Mechelynck

Sorry to awake this old thread...there was a newer thread requesting
correct display of unicode chars >= 0x10000, but I can't find it, so I
reply to this. Since this thread is so old, I'm quoting the whole
thread below. Sorry if this offends you in some way ;-)

I can confirm that the Mac OS X versions of vim (carbon gvim and
console vim) display characters outside of the BMP correctly (carbon
vim has some drawing issues, but it has those as well when it displays
BMP double-width characters as well. If we get this going, I could
look into that as well). I can test on win32 and ubuntu (gtk version)
as well if there's interest.

Here's a diff (which is not supposed to become an official
patch! ;-) ) that enables displaying of characters outside of the BMP:

Index: screen.c
===================================================================
--- screen.c (revision 475)
+++ screen.c (working copy)
@@ -2305,9 +2305,9 @@
  prev_c = u8c;
 #endif
     /* Non-BMP character: display as ? or fullwidth ?. */
-    if (u8c >= 0x10000)
- ScreenLinesUC[idx] = (cells == 2) ? 0xff1f : (int)'?';
-    else
+    //if (u8c >= 0x10000)
+    //    ScreenLinesUC[idx] = (cells == 2) ? 0xff1f : (int)'?';
+    //else
  ScreenLinesUC[idx] = u8c;
     for (i = 0; i < Screen_mco; ++i)
     {
@@ -3678,25 +3678,25 @@
     if ((mb_l == 1 && c >= 0x80)
     || (mb_l >= 1 && mb_c == 0)
     || (mb_l > 1 && (!vim_isprintc(mb_c)
- || mb_c >= 0x10000)))
+ /* || mb_c >= 0x10000 */)))
     {
  /*
  * Illegal UTF-8 byte: display as <xx>.
  * Non-BMP character : display as ? or fullwidth ?.
  */
- if (mb_c < 0x10000)
- {
+ //if (mb_c < 0x10000)
+ //{
     transchar_hex(extra, mb_c);
 # ifdef FEAT_RIGHTLEFT
     if (wp->w_p_rl) /* reverse */
  rl_mirror(extra);
 # endif
- }
- else if (utf_char2cells(mb_c) != 2)
-    STRCPY(extra, "?");
- else
-    /* 0xff1f in UTF-8: full-width '?' */
-    STRCPY(extra, "\357\274\237");
+ //}
+ //else if (utf_char2cells(mb_c) != 2)
+ //    STRCPY(extra, "?");
+ //else
+ //    /* 0xff1f in UTF-8: full-width '?' */
+ //    STRCPY(extra, "\357\274\237");

  p_extra = extra;
  c = *p_extra;
@@ -6229,12 +6229,12 @@
     u8c = utfc_ptr2char(ptr, u8cc);
  mbyte_cells = utf_char2cells(u8c);
  /* Non-BMP character: display as ? or fullwidth ?. */
- if (u8c >= 0x10000)
- {
-    u8c = (mbyte_cells == 2) ? 0xff1f : (int)'?';
-    if (attr == 0)
- attr = hl_attr(HLF_8);
- }
+ //if (u8c >= 0x10000)
+ //{
+ //    u8c = (mbyte_cells == 2) ? 0xff1f : (int)'?';
+ //    if (attr == 0)
+ // attr = hl_attr(HLF_8);
+ //}
 # ifdef FEAT_ARABIC
  if (p_arshape && !p_tbidi && ARABIC_CHAR(u8c))
  {

On Jul 29, 4:03 pm, Tony Mechelynck <[hidden email]>
wrote:

> Bram Moolenaar wrote:
> > Tony Mechelynck wrote:
>
> >> Talking of todo lists, is the following item on it?
>
> >> - In the GUI, when using a multi-byte 'encoding' use glyphs for any codepoints
> >> present in the 'guifont', even for Unicode codepoints above U+FFFF.
>
> >> Rationale: Many CJK codepoints are inplane2 (U+20000 to U+2FFFF).
> >> Representing all of them indiscriminately by double-wide question marks makes
> >> editing East-Asian text files difficult.
>
> >> Hm... There seems to be something similar at todo.txt (2007-Jul-22) line 720:
>
> >> 8   When 'encoding' is "utf-8", should use 'guifont' for both normal and wide
> >>      characters to make Asian languages work.  Win32 fonts contain both
> >>      type of characters.
>
> >> This is already implemented (at least in the GTK2 GUI) for wide and narrow
> >> characters in the BMP (Basic MultilingualPlane, U+0000 to U+FFFF). It remains
> >> to be implemented for characters outside the BMP. Many TTF fonts installed on
> >> my Linux system (and which I got from Novell/SuSE but also from various other
> >> royalty-free sources) include glyphs both inside and outside the BMP, as can
> >> be seen by displaying the page in Firefox with an appropriate CSS sheet. The
> >> latest fonts which I installed are particularly complete if not extremely
> >> pretty: "HAN NOM A" and "HAN NOM B", both from
> >>http://sourceforge.net/project/showfiles.php?group_id=153105&package_...
> >> -- I got the tip fromhttp://en.wikipedia.org/wiki/GB18030#Glyphs
>
> > It is very unlikely I will work on this myself.  Thus we will have to
> > wait for someone to work on this.
>
> OK. Anyone wants to take up the challenge? I have done some preliminary
> detective work. The problem is in function gui_mch_draw_string() which is
> different for each GUI flavour, as follows, checking for a test on (c >=
> 0x10000). Line numbers below are for the latest (7.1.043) sources.
>
> gui_gtk_x11.c:6064      if >= U+10000, replace by question mark
> gui_mac.c               no special handling
> gui_photon.c            no special handling
> gui_riscos.c            no special handling
> gui_w16.c               no special handling
> gui_w32.c:2343          if >= U+10000, convert to UTF-16 surrogate pair
> gui_x11.c:2565          if >= U+10000, replace by question mark (1)
> gui_x11.c:2572          if >= U+10000, replace by question mark (2)
>
> (1) with FEAT_XFONTSET, if current_fontset != NULL
> (2) otherwise
>
> It seems that the Win32 GUI converts UTF-8 to UTF-16 with surrogates (thus, I
> suppose, displaying characters correctly even outside the BMP) and that X11
> GUIs (GTK, and non-GTK-non-Photon) explicitly replace anything outside the BMP
> by a wide question mark (which I call the "faulty" handling). For Mac, Photon,
> RiscOS and W16, either the OS handles it transparently or the problem cannot
> occur (+multi_byte not possible?), I don't know.
>
> I don't feel up to modifying this platform-dependent code, but I hope that
> with the above, someone will be willing. Any takers?
>
> Best regards,
> Tony.
> --
> The temperature of Heaven can be rather accurately computed.  Our
> authority is Isaiah 30:26, "Moreover, the light of the Moon shall be as
> the light of the Sun and the light of the Sun shall be sevenfold, as
> the light of seven days."  Thus Heaven receives from the Moon as much
> radiation as we do from the Sun, and in addition 7*7 (49) times as much
> as the Earth does from the Sun, or 50 times in all.  The light we
> receive from the Moon is one 1/10,000 of the light we receive from the
> Sun, so we can ignore that ... The radiation falling on Heaven will
> heat it to the point where the heat lost by radiation is just equal to
> the heat received by radiation, i.e., Heaven loses 50 times as much
> heat as the Earth by radiation.  Using the Stefan-Boltzmann law for
> radiation, (H/E)^4 = 50, where E is the absolute temperature of the
> earth (300K), gives H as 798K (525C).  The exact temperature of Hell
> cannot be computed ... [However] Revelations 21:8 says "But the
> fearful, and unbelieving ... shall have their part in the lake which
> burneth with fire and brimstone."  A lake of molten brimstone means
> that its temperature must be at or below the boiling point, 444.6C.  We
> have, then, that Heaven, at 525C is hotter than Hell at 445C.
>                 -- From "Applied Optics" vol. 11, A14, 1972


--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Bram Moolenaar


Nicolas Weber wrote:

> Sorry to awake this old thread...there was a newer thread requesting
> correct display of unicode chars >= 0x10000, but I can't find it, so I
> reply to this. Since this thread is so old, I'm quoting the whole
> thread below. Sorry if this offends you in some way ;-)
>
> I can confirm that the Mac OS X versions of vim (carbon gvim and
> console vim) display characters outside of the BMP correctly (carbon
> vim has some drawing issues, but it has those as well when it displays
> BMP double-width characters as well. If we get this going, I could
> look into that as well). I can test on win32 and ubuntu (gtk version)
> as well if there's interest.
>
> Here's a diff (which is not supposed to become an official
> patch! ;-) ) that enables displaying of characters outside of the BMP:

Thanks for looking into this.  Looks good.  Except that a check for
UNICODE16 is needed, if that is defined then we really can use only 16
bits.

--
GUARD #2:  Wait a minute -- supposing two swallows carried it together?
GUARD #1:  No, they'd have to have it on a line.
GUARD #2:  Well, simple!  They'd just use a standard creeper!
GUARD #1:  What, held under the dorsal guiding feathers?
GUARD #2:  Well, why not?
                                  The Quest for the Holy Grail (Monty Python)

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\        download, build and distribute -- http://www.A-A-P.org        ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Edward L. Fox

Hi Bram,

On 9/11/07, Bram Moolenaar <[hidden email]> wrote:
> [...]
> Thanks for looking into this.  Looks good.  Except that a check for
> UNICODE16 is needed, if that is defined then we really can use only 16
> bits.

Do you mean this?

Index: src/screen.c
===================================================================
--- src/screen.c        (revision 513)
+++ src/screen.c        (working copy)
@@ -2304,10 +2304,12 @@
                    else
                        prev_c = u8c;
 #endif
+#ifdef UNICODE16
                    /* Non-BMP character: display as ? or fullwidth ?. */
                    if (u8c >= 0x10000)
-                       ScreenLinesUC[idx] = (cells == 2) ? 0xff1f : (int)'?';
+                       ScreenLinesUC[idx] = (cells == 2) ? 0xff1f : (int)'?';
                    else
+#endif
                        ScreenLinesUC[idx] = u8c;
                    for (i = 0; i < Screen_mco; ++i)
                    {
@@ -3678,25 +3680,32 @@
                    if ((mb_l == 1 && c >= 0x80)
                            || (mb_l >= 1 && mb_c == 0)
                            || (mb_l > 1 && (!vim_isprintc(mb_c)
-                                                        || mb_c >= 0x10000)))
+#ifdef UNICODE16
+                                                || mb_c >= 0x10000
+#endif
+                                    )))
                    {
                        /*
                         * Illegal UTF-8 byte: display as <xx>.
                         * Non-BMP character : display as ? or fullwidth ?.
                         */
+#ifdef UNICODE16
                        if (mb_c < 0x10000)
                        {
+#endif
                            transchar_hex(extra, mb_c);
 # ifdef FEAT_RIGHTLEFT
                            if (wp->w_p_rl)             /* reverse */
                                rl_mirror(extra);
 # endif
+#ifdef UNICODE16
                        }
                        else if (utf_char2cells(mb_c) != 2)
                            STRCPY(extra, "?");
                        else
                            /* 0xff1f in UTF-8: full-width '?' */
                            STRCPY(extra, "\357\274\237");
+#endif

                        p_extra = extra;
                        c = *p_extra;
@@ -6246,12 +6255,14 @@
                    u8c = utfc_ptr2char(ptr, u8cc);
                mbyte_cells = utf_char2cells(u8c);
                /* Non-BMP character: display as ? or fullwidth ?. */
+#ifdef UNICODE16
                if (u8c >= 0x10000)
                {
                    u8c = (mbyte_cells == 2) ? 0xff1f : (int)'?';
                    if (attr == 0)
                        attr = hl_attr(HLF_8);
                }
+#endif
 # ifdef FEAT_ARABIC
                if (p_arshape && !p_tbidi && ARABIC_CHAR(u8c))
                {

>
> --
> GUARD #2:  Wait a minute -- supposing two swallows carried it together?
> GUARD #1:  No, they'd have to have it on a line.
> GUARD #2:  Well, simple!  They'd just use a standard creeper!
> GUARD #1:  What, held under the dorsal guiding feathers?
> GUARD #2:  Well, why not?
>                                   The Quest for the Holy Grail (Monty Python)
>
>  /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
> ///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
> \\\        download, build and distribute -- http://www.A-A-P.org        ///
>  \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Edward L. Fox

On 9/11/07, Edward L. Fox <[hidden email]> wrote:
> Hi Bram,
>
> On 9/11/07, Bram Moolenaar <[hidden email]> wrote:
> > [...]
> > Thanks for looking into this.  Looks good.  Except that a check for
> > UNICODE16 is needed, if that is defined then we really can use only 16
> > bits.

Hmmm... It seems that more files need to be patched:

Index: gui_gtk_x11.c
===================================================================
--- gui_gtk_x11.c       (revision 513)
+++ gui_gtk_x11.c       (working copy)
@@ -6070,8 +6070,10 @@
            if (enc_utf8)
            {
                c = utf_ptr2char(p);
+#ifdef UNICODE16
                if (c >= 0x10000)       /* show chars > 0xffff as ? */
                    c = 0xbf;
+#endif
                buf[textlen].byte1 = c >> 8;
                buf[textlen].byte2 = c;
                p += utf_ptr2len(p);
Index: screen.c
===================================================================
--- screen.c    (revision 513)
+++ screen.c    (working copy)
@@ -2304,10 +2304,12 @@
                    else
                        prev_c = u8c;
 #endif
+#ifdef UNICODE16
                    /* Non-BMP character: display as ? or fullwidth ?. */
                    if (u8c >= 0x10000)
-                       ScreenLinesUC[idx] = (cells == 2) ? 0xff1f : (int)'?';
+                       ScreenLinesUC[idx] = (cells == 2) ? 0xff1f : (int)'?';
                    else
+#endif
                        ScreenLinesUC[idx] = u8c;
                    for (i = 0; i < Screen_mco; ++i)
                    {
@@ -3678,25 +3680,32 @@
                    if ((mb_l == 1 && c >= 0x80)
                            || (mb_l >= 1 && mb_c == 0)
                            || (mb_l > 1 && (!vim_isprintc(mb_c)
-                                                        || mb_c >= 0x10000)))
+#ifdef UNICODE16
+                                                || mb_c >= 0x10000
+#endif
+                                    )))
                    {
                        /*
                         * Illegal UTF-8 byte: display as <xx>.
                         * Non-BMP character : display as ? or fullwidth ?.
                         */
+#ifdef UNICODE16
                        if (mb_c < 0x10000)
                        {
+#endif
                            transchar_hex(extra, mb_c);
 # ifdef FEAT_RIGHTLEFT
                            if (wp->w_p_rl)             /* reverse */
                                rl_mirror(extra);
 # endif
+#ifdef UNICODE16
                        }
                        else if (utf_char2cells(mb_c) != 2)
                            STRCPY(extra, "?");
                        else
                            /* 0xff1f in UTF-8: full-width '?' */
                            STRCPY(extra, "\357\274\237");
+#endif

                        p_extra = extra;
                        c = *p_extra;
@@ -6246,12 +6255,14 @@
                    u8c = utfc_ptr2char(ptr, u8cc);
                mbyte_cells = utf_char2cells(u8c);
                /* Non-BMP character: display as ? or fullwidth ?. */
+#ifdef UNICODE16
                if (u8c >= 0x10000)
                {
                    u8c = (mbyte_cells == 2) ? 0xff1f : (int)'?';
                    if (attr == 0)
                        attr = hl_attr(HLF_8);
                }
+#endif
 # ifdef FEAT_ARABIC
                if (p_arshape && !p_tbidi && ARABIC_CHAR(u8c))
                {
Index: gui_x11.c
===================================================================
--- gui_x11.c   (revision 513)
+++ gui_x11.c   (working copy)
@@ -2562,15 +2562,19 @@
 # ifdef FEAT_XFONTSET
            if (current_fontset != NULL)
            {
-               if (c >= 0x10000 && sizeof(wchar_t) <= 2)
+#ifdef UNICODE16
+               if (c >= 0x10000)
                    c = 0xbf;           /* show chars > 0xffff as ? */
+#endif
                ((wchar_t *)buf)[wlen] = c;
            }
            else
 # endif
            {
+#ifdef UNICODE16
                if (c >= 0x10000)
                    c = 0xbf;           /* show chars > 0xffff as ? */
+#endif
                ((XChar2b *)buf)[wlen].byte1 = (unsigned)c >> 8;
                ((XChar2b *)buf)[wlen].byte2 = c;
            }


> [...]

Regards,

Edward L. Fox

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Bram Moolenaar
In reply to this post by Edward L. Fox


Edward L. Fox wrote:

> On 9/11/07, Bram Moolenaar <[hidden email]> wrote:
> > [...]
> > Thanks for looking into this.  Looks good.  Except that a check for
> > UNICODE16 is needed, if that is defined then we really can use only 16
> > bits.
>
> Do you mean this?

Yes.

--
MORTICIAN:    What?
CUSTOMER:     Nothing -- here's your nine pence.
DEAD PERSON:  I'm not dead!
MORTICIAN:    Here -- he says he's not dead!
CUSTOMER:     Yes, he is.
DEAD PERSON:  I'm not!
                                  The Quest for the Holy Grail (Monty Python)

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\        download, build and distribute -- http://www.A-A-P.org        ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Bram Moolenaar
In reply to this post by Edward L. Fox


Edward L. Fox wrote:

> On 9/11/07, Edward L. Fox <[hidden email]> wrote:
> > Hi Bram,
> >
> > On 9/11/07, Bram Moolenaar <[hidden email]> wrote:
> > > [...]
> > > Thanks for looking into this.  Looks good.  Except that a check for
> > > UNICODE16 is needed, if that is defined then we really can use only 16
> > > bits.
>
> Hmmm... It seems that more files need to be patched:
>
> Index: gui_gtk_x11.c
> ===================================================================
> --- gui_gtk_x11.c       (revision 513)
> +++ gui_gtk_x11.c       (working copy)
> @@ -6070,8 +6070,10 @@
>             if (enc_utf8)
>             {
>                 c = utf_ptr2char(p);
> +#ifdef UNICODE16
>                 if (c >= 0x10000)       /* show chars > 0xffff as ? */
>                     c = 0xbf;
> +#endif
>                 buf[textlen].byte1 = c >> 8;
>                 buf[textlen].byte2 = c;
>                 p += utf_ptr2len(p);

Only two bytes are put in buf[], thus more needs to be changed here to
make it work.

> Index: gui_x11.c
> ===================================================================
> --- gui_x11.c   (revision 513)
> +++ gui_x11.c   (working copy)
> @@ -2562,15 +2562,19 @@
>  # ifdef FEAT_XFONTSET
>             if (current_fontset != NULL)
>             {
> -               if (c >= 0x10000 && sizeof(wchar_t) <= 2)
> +#ifdef UNICODE16
> +               if (c >= 0x10000)
>                     c = 0xbf;           /* show chars > 0xffff as ? */
> +#endif
>                 ((wchar_t *)buf)[wlen] = c;
>             }
>             else
>  # endif
>             {
> +#ifdef UNICODE16
>                 if (c >= 0x10000)
>                     c = 0xbf;           /* show chars > 0xffff as ? */
> +#endif
>                 ((XChar2b *)buf)[wlen].byte1 = (unsigned)c >> 8;
>                 ((XChar2b *)buf)[wlen].byte2 = c;
>             }

Here too.

--
Shit makes the flowers grow and that's beautiful

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\        download, build and distribute -- http://www.A-A-P.org        ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Edward L. Fox

Hi Bram,

On 9/11/07, Bram Moolenaar <[hidden email]> wrote:
> [...]
>
> Only two bytes are put in buf[], thus more needs to be changed here to
> make it work.
>
> [...]
>
> Here too.
> [...]

I don't know how to deal with the characters outside BMP for XChar2b.
I just assume that it is UTF-16. Please help me check this patch:

Index: gui_x11.c
===================================================================
--- gui_x11.c   (revision 513)
+++ gui_x11.c   (working copy)
@@ -2562,17 +2562,39 @@
 # ifdef FEAT_XFONTSET
            if (current_fontset != NULL)
            {
-               if (c >= 0x10000 && sizeof(wchar_t) <= 2)
+#ifdef UNICODE16
+               if (c >= 0x10000)
                    c = 0xbf;           /* show chars > 0xffff as ? */
+#endif
                ((wchar_t *)buf)[wlen] = c;
            }
            else
 # endif
            {
+#ifdef UNICODE16
                if (c >= 0x10000)
                    c = 0xbf;           /* show chars > 0xffff as ? */
                ((XChar2b *)buf)[wlen].byte1 = (unsigned)c >> 8;
                ((XChar2b *)buf)[wlen].byte2 = c;
+#else
+               if (c < 0x10000)
+                {
+                    ((XChar2b *)buf)[wlen].byte1 = (unsigned)c >> 8;
+                    ((XChar2b *)buf)[wlen].byte2 = c;
+                }
+                else
+                {
+                    unsigned short first, second;
+                    c -= 0x10000;
+                    first = c >> 10 + 0xD800;
+                    second = c & 0x3FFUL + 0xDC00;
+                    ((XChar2b *)buf)[wlen].byte1 = first >> 8;
+                    ((XChar2b *)buf)[wlen].byte2 = first;
+                    ++wlen;
+                    ((XChar2b *)buf)[wlen].byte1 = second >> 8;
+                    ((XChar2b *)buf)[wlen].byte2 = second;
+                }
+#endif
            }
            ++wlen;
            cells += utf_char2cells(c);
Index: gui_gtk_x11.c
===================================================================
--- gui_gtk_x11.c       (revision 513)
+++ gui_gtk_x11.c       (working copy)
@@ -6057,7 +6057,8 @@
        if (buflen < len)
        {
            XtFree((char *)buf);
-           buf = (XChar2b *)XtMalloc(len * sizeof(XChar2b));
+           buf = (XChar2b *)XtMalloc(len * (sizeof(XChar2b) < sizeof(wchar_t)
+                                       ? sizeof(wchar_t) : sizeof(XChar2b)));
            buflen = len;
        }

@@ -6070,10 +6071,30 @@
            if (enc_utf8)
            {
                c = utf_ptr2char(p);
+#ifdef UNICODE16
                if (c >= 0x10000)       /* show chars > 0xffff as ? */
                    c = 0xbf;
                buf[textlen].byte1 = c >> 8;
                buf[textlen].byte2 = c;
+#else
+               if (c < 0x10000)
+                {
+                    buf[textlen].byte1 = c >> 8;
+                    buf[textlen].byte2 = c;
+                }
+                else
+                {
+                    unsigned short first, second;
+                    c -= 0x10000;
+                    first = c >> 10 + 0xD800;
+                    second = c & 0x3FFUL + 0xDC00;
+                    buf[textlen].byte1 = first >> 8;
+                    buf[textlen].byte2 = first;
+                    ++textlen;
+                    buf[textlen].byte1 = second >> 8;
+                    buf[textlen].byte2 = second;
+                }
+#endif
                p += utf_ptr2len(p);
                width += utf_char2cells(c);
            }


Regards,


Edward L. Fox

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Bram Moolenaar


Edward L. Fox wrote:

> On 9/11/07, Bram Moolenaar <[hidden email]> wrote:
> > [...]
> >
> > Only two bytes are put in buf[], thus more needs to be changed here to
> > make it work.
> >
> > [...]
> >
> > Here too.
> > [...]
>
> I don't know how to deal with the characters outside BMP for XChar2b.
> I just assume that it is UTF-16. Please help me check this patch:

Ehm, are you just guessing here?  You better find a way to test it.
I don't want to include this without testing.

I doubt X11 uses UTF-16, that is something that MS-Windows uses.  GTK
uses utf-8 in most places.  Other X11 toolkits probably differ, since
they are older.

--
DEAD PERSON:  I'm getting better!
CUSTOMER:     No, you're not -- you'll be stone dead in a moment.
MORTICIAN:    Oh, I can't take him like that -- it's against regulations.
                                  The Quest for the Holy Grail (Monty Python)

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\        download, build and distribute -- http://www.A-A-P.org        ///
 \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Tony Mechelynck

Bram Moolenaar wrote:
[...]
> I doubt X11 uses UTF-16, that is something that MS-Windows uses.  GTK
> uses utf-8 in most places.  Other X11 toolkits probably differ, since
> they are older.
>

Even with UTF-16, it is possible to display codepoints outside the BMP (up to
U+10FFFF IIRC, and the current Unicode guidelines say that there will "never"
be any valid codepoints higher than that) by means of surrogate pairs. IIRC,
that's what the W32 GUI already does.


Best regards,
Tony.
--
Dealing with failure is easy: work hard to improve.  Success is also
easy to handle: you've solved the wrong problem.  Work hard to
improve.

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Edward L. Fox

Hi Tony, thanks for your reply! But we are not talking about the
coverage of UTF-16,  we just want to make sure whether X11 uses
UTF-16, or only UCS-16, which only supports BMP. I'm afraid I'll have
no time to test that until this Thursday, so I hope any one could
help. Best regards, Edward Fox

On 9/11/07, Tony Mechelynck <[hidden email]> wrote:

> Bram Moolenaar wrote:
> [...]
> > I doubt X11 uses UTF-16, that is something that MS-Windows uses.  GTK
> > uses utf-8 in most places.  Other X11 toolkits probably differ, since
> > they are older.
> >
>
> Even with UTF-16, it is possible to display codepoints outside the BMP (up
> to
> U+10FFFF IIRC, and the current Unicode guidelines say that there will
> "never"
> be any valid codepoints higher than that) by means of surrogate pairs. IIRC,
> that's what the W32 GUI already does.
>
>
> Best regards,
> Tony.
> --
> Dealing with failure is easy: work hard to improve.  Success is also
> easy to handle: you've solved the wrong problem.  Work hard to
> improve.
>

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Tony Mechelynck

Edward L. Fox wrote:
> Hi Tony, thanks for your reply! But we are not talking about the
> coverage of UTF-16,  we just want to make sure whether X11 uses
> UTF-16, or only UCS-16, which only supports BMP. I'm afraid I'll have
> no time to test that until this Thursday, so I hope any one could
> help. Best regards, Edward Fox

Yeah, well, Bram mentioned UTF-16 and this is one of the few details in which
UTF-16 differs from UCS-2.

>
> On 9/11/07, Tony Mechelynck <[hidden email]> wrote:
>> Bram Moolenaar wrote:
>> [...]
>>> I doubt X11 uses UTF-16, that is something that MS-Windows uses.  GTK
>>> uses utf-8 in most places.  Other X11 toolkits probably differ, since
>>> they are older.
>>>
>> Even with UTF-16, it is possible to display codepoints outside the BMP (up
>> to
>> U+10FFFF IIRC, and the current Unicode guidelines say that there will
>> "never"
>> be any valid codepoints higher than that) by means of surrogate pairs. IIRC,
>> that's what the W32 GUI already does.
>>
>>
>> Best regards,
>> Tony.
>> --
>> Dealing with failure is easy: work hard to improve.  Success is also
>> easy to handle: you've solved the wrong problem.  Work hard to
>> improve.
>>
>

Best regards,
Tony.
--
Besides the device, the box should contain:

* Eight little rectangular snippets of paper that say "WARNING"

* A plastic packet containing four 5/17 inch pilfer grommets and two
   club-ended 6/93 inch boxcar prawns.

YOU WILL NEED TO SUPPLY: a matrix wrench and 60,000 feet of tram
cable.

IF ANYTHING IS DAMAGED OR MISSING: You IMMEDIATELY should turn to your
spouse and say: "Margaret, you know why this country can't make a car
that can get all the way through the drive-through at Burger King
without a major transmission overhaul?  Because nobody cares, that's
why."

WARNING: This is assuming your spouse's name is Margaret.
                -- Dave Barry, "Read This First!"

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Edward L. Fox
In reply to this post by Bram Moolenaar

Hi Bram,

On 9/11/07, Bram Moolenaar <[hidden email]> wrote:
> [...]
>
> Ehm, are you just guessing here?  You better find a way to test it.
> I don't want to include this without testing.
>
> I doubt X11 uses UTF-16, that is something that MS-Windows uses.  GTK
> uses utf-8 in most places.  Other X11 toolkits probably differ, since
> they are older.

I just tested the X11 and Gtk1 in my Debian box. X11 front-end failed
to compile. Gtk1 front-end compiled OK, but doesn't work properly. I'm
afraid I'll not be able to make these two "so old" front-ends work
properly.

Could you just send out the first patch for screen.c first? Then all
the other front-ends, including GTK2, GNOME2, Windows, will work fine.

>
> --
> DEAD PERSON:  I'm getting better!
> CUSTOMER:     No, you're not -- you'll be stone dead in a moment.
> MORTICIAN:    Oh, I can't take him like that -- it's against regulations.
>                                   The Quest for the Holy Grail (Monty Python)
>
>  /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
> ///        sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
> \\\        download, build and distribute -- http://www.A-A-P.org        ///
>  \\\            help me help AIDS victims -- http://ICCF-Holland.org    ///
>

Regards,

Edward L. Fox

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Nico Weber-3
Hi,

> I just tested the X11 and Gtk1 in my Debian box. X11 front-end failed
> to compile. Gtk1 front-end compiled OK, but doesn't work properly. I'm
> afraid I'll not be able to make these two "so old" front-ends work
> properly.
>
> Could you just send out the first patch for screen.c first? Then all
> the other front-ends, including GTK2, GNOME2, Windows, will work fine.

I tested the screen.c-only patch on ubuntu (terminal and gui), os x  
(terminal and gui) and windows (gui only), with the default os  
settings. Encoding was set to utf8, and everything was compiled with  
--enable-multibyte. By default, neither ubuntu nor windows seem to  
have unicode fonts installed. Results look good on osx, okayish for  
gtk and windows gui (both have some redraw problems, but nothing  
serious. And this problems also occur with some BMP characters that  
are already supported at the moment). gtk terminal (well -- ubuntu  
terminal :-P) however doesn't display characters >= 0x10000 at all.  
`cat` doesn't display them either, so this might not be critical.

Screenshots are here:
http://amnoid.de/tmp/osxgui.png
http://amnoid.de/tmp/osxterm.png
http://amnoid.de/tmp/gtkgui.png
http://amnoid.de/tmp/gtkterm.png
http://amnoid.de/tmp/windraw.png
http://amnoid.de/tmp/winredraw.png

(the last screenshot shows that the "new" character is drawn twice:  
once at its original position, and once one to the right). The file I  
used for testing is attached.

I'd vote for using this patch and working on the display problems  
after that (except for the windows case, these display problems  
already exist with the current version, as the first line in all  
screenshots shows).

Nico


--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---


asdfsdfⓐasdf
asdfAasdf
asdfあasdfsdf
asdfボasdf
asdfAsdf
𝐴
asdf
Reply | Threaded
Open this post in threaded view
|

Re: Handling Unicode codepoints outside the BMP

Nico Weber-3

> (the last screenshot shows that the "new" character is drawn twice:
> once at its original position, and once one to the right). The file I
> used for testing is attached.

FWIW, it's in utf8.


--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_dev" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---