Vim multibyte support for non-utf8 encodings

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Vim multibyte support for non-utf8 encodings

perky-2
Hi,

The current version of vim doesn't handle non-utf8 multibyte encodings
such as EUC and/or GBK in FreeBSD.  Cursor moves around weird places
inside a character and the last character on each lines disappears
sometimes.

This problem is due to vim's dependency to undefined behavior of
mblen(3).  Looking vim's source code mbyte.c:653, the routine assumes
that mblen(3) isn't stateful.  On glibc or Solaris libc, mblen(3)
does not change the internal state when EILSEQ or EINVAL is occurred.
But FreeBSD libc changes the internal state even when it meets an
error.  The mblen(3) behavior is undefined in POSIX [1] and none
of each libc implementations are wrong.  So I think it's required
to reset multibyte states before a mblen(3) call to work the routine
free from implementation.

My patch is attached.

[1] http://www.opengroup.org/onlinepubs/009695399/functions/mblen.html

Hye-Shik

patch-mbyte (321 bytes) Download Attachment
attachment1 (194 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Vim multibyte support for non-utf8 encodings

Bram Moolenaar

Hye-Shik Chang wrote:

> The current version of vim doesn't handle non-utf8 multibyte encodings
> such as EUC and/or GBK in FreeBSD.  Cursor moves around weird places
> inside a character and the last character on each lines disappears
> sometimes.
>
> This problem is due to vim's dependency to undefined behavior of
> mblen(3).  Looking vim's source code mbyte.c:653, the routine assumes
> that mblen(3) isn't stateful.  On glibc or Solaris libc, mblen(3)
> does not change the internal state when EILSEQ or EINVAL is occurred.
> But FreeBSD libc changes the internal state even when it meets an
> error.  The mblen(3) behavior is undefined in POSIX [1] and none
> of each libc implementations are wrong.  So I think it's required
> to reset multibyte states before a mblen(3) call to work the routine
> free from implementation.
>
> My patch is attached.
>
> [1] http://www.opengroup.org/onlinepubs/009695399/functions/mblen.html

> --- mbyte.c.orig Fri Apr 23 17:44:36 2004
> +++ mbyte.c Thu May 12 08:48:35 2005
> @@ -650,6 +650,7 @@
>       * where mblen() returns 0 for invalid character.
>       * Therefore, following condition includes 0.
>       */
> +    (void)mblen(NULL, 0);
>      if (mblen(buf, (size_t)1) <= 0)
>   n = 2;
>      else

The behavior of mblen() on various systems has always been a bit unclear
to me.  Your remark makes a lot of sense, but I wonder why nobody had
this problem before.

I'll include this now in Vim 7 and await further comments.  Hopefully
there is no mblen() implementation that crashes when invoked with a NULL
pointer.

--
CUSTOMER:     Well, can you hang around a couple of minutes?  He won't be
              long.
MORTICIAN:    Naaah, I got to go on to Robinson's -- they've lost nine today.
CUSTOMER:     Well, when is your next round?
MORTICIAN:    Thursday.
DEAD PERSON:  I think I'll go for a walk.
                                  The Quest for the Holy Grail (Monty Python)

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Vim multibyte support for non-utf8 encodings

perky-2
On Sat, Jul 16, 2005 at 12:44:34PM +0200, Bram Moolenaar wrote:
>
> Hye-Shik Chang wrote:
>
> > The current version of vim doesn't handle non-utf8 multibyte encodings
> > such as EUC and/or GBK in FreeBSD.  Cursor moves around weird places
> > inside a character and the last character on each lines disappears
> > sometimes.
[snip]
>
> The behavior of mblen() on various systems has always been a bit unclear
> to me.  Your remark makes a lot of sense, but I wonder why nobody had
> this problem before.
>

In fact, many of Japanese FreeBSD users seems to have been suffered
from the problem:

http://www.queen.ne.jp/iMA/showmdir.pl?ports-jp=Current&num=14694&link=20040430015955%2eGA52106%25st%40be%2eto
(even if you can't read japanese, you still can discover some
alphabets on the page. :)

I didn't aware of the problem because I'm using UTF-8 locale, but
few friends of mine asked a help to me.

> I'll include this now in Vim 7 and await further comments.  Hopefully
> there is no mblen() implementation that crashes when invoked with a NULL
> pointer.

Thanks for applying the fix!  I think the fix will not harm any
platform.  mblen(NULL, 0); is clearly defined in POSIX as a reset
method.


Thanks,
Hye-Shik

attachment0 (194 bytes) Download Attachment