Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

adah
Hi Bram,

How is it going on this? I am now thinking you could execute a setlocale
on "set encoding=...". The language_country part can omitted. My tests
show that setlocale(LC_ALL, ".65001") will return such a string:

LC_COLLATE=Chinese_People's Republic of
China.65001;LC_CTYPE=C;LC_MONETARY=Chinese_People's Republic of
China.65001;LC_NUMERIC=Chinese_People's Republic of
China.65001;LC_TIME=Chinese_People's Republic of China.65001

And setlocale(LC_ALL, ".936") will return:

Chinese_People's Republic of China.936

Regretfully this method does not work on Linux: setlocale(LC_ALL,
".UTF-8") will return NULL. And currently I do not know the implementation
details of a UNIX/Linux sprintf. So what I have suggested might be
Windows-only at present.

Yet a last hack to do it most simply is to call setlocale(LC_CTYPE, "C")
in main.c, after setlocale(LC_ALL, ""). I have verified that it works (at
least for my crash test case).

Best regards,

Yongwei





[hidden email]
2005-05-17 11:27

 
        To:     "A. J. Mechelynck" <[hidden email]>
        CC:     [hidden email], [hidden email]
        Subject:        Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by
"language messages zh_CN.UTF-8")

Some more comments now.

Now I am thinking the Microsoft approach is theoretically correct. Just as

I was taking the shower before going to bed, I realized Microsoft might
want to ensure a '%' after a leading MBCS bytes would not be taken as a
'%'. However, practically I do not know of a single Microsoft system code
page that includes the range 0x20-0x3F as the second byte. Even the odious

Chinese encoding GB18030 also excludes 0x20-0x2F. --- So the current
Microsoft implementation really looks an overkill.

I think Method 1 is the easiest, and Method 3 is the most "correct"
(though changing {"<UTF-8 sequece> %..."} to {"%s %...",
<string-variable>} seems too much labour). Yet another way might be
calling setlocale with a string like "Chinese_China.65001" (Unix guys
should regard it as zh_CN.UTF-8). I have checked that isleadbyte will then

always return 0. I think Vim should do something like this when
encountering "set encoding=UTF-8".

Best regards,

Yongwei




[hidden email] wrote:
> A really, really disgusting thing seems there in there MSVCRT code.
After
> I built Vim with DEBUG=yes, I found strange crashes and assertions in
> _output, which is called by sprintf. Both are in MSVCRT. I can reproduce


> the (wrong) assertion with the following code:
>
> #include <string.h>
> #include <locale.h>
>
> int main()
> {
>         char buf[100];
>         setlocale(LC_ALL, "Chinese_China.936");
>         sprintf(buf, "\xe8\xa1\x8c");
>         return 0;
> }
>
> When linked with MSVCRTd.DLL (I use "cl /MDd test.c"), the sprintf call
> will make an assertion fail, because there are lines like the following
in

> the M$ implementation:
>
>             if (isleadbyte((int)(unsigned char)ch)) {
>                 WRITE_CHAR(ch, &charsout);
>                 ch = *format++;
>                 _ASSERTE (ch != _T('\0')); /* UNDONE: don't fall off
> format string */
>             }
>             WRITE_CHAR(ch, &charsout);
>
> The isleadbyte call is sensitive to locales, so is the Microsoft
sprintf!
> I cannot say it is completely unreasonable, but ... it is short-sighted
> and makes completely valid code fail!
>
> I have exactly this assertion doing even just typing after "lang
messages
> zh_CN.UTF-8" is executed in Vim. I could ignore the error for some
times,
> but then the code will cause a crash at last --- isleadbyte is used in
> several other places, which I have not investigated deeper yet.
>
> So it seems any MBCS locale can have problems with byte sequences
invalid
> in it. The cures I can think of are:
>
> 1) Choose a more reasonable sprintf implementation;
> 2) Use only locale "C" in Vim, and do internationalization with some
other

> ways;
> 3) Avoid multibyte sequences in the sprint format;
> 4) Avoid using messages other than default and pure ASCII ones (C, en,
> etc.) with MBCS locales.
>
> Sigh!
>
> Best regards,
>
> Yongwei


Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

Bram Moolenaar

Yongwei wrote:

> How is it going on this? I am now thinking you could execute a setlocale
> on "set encoding=...". The language_country part can omitted. My tests
> show that setlocale(LC_ALL, ".65001") will return such a string:
>
> LC_COLLATE=Chinese_People's Republic of
> China.65001;LC_CTYPE=C;LC_MONETARY=Chinese_People's Republic of
> China.65001;LC_NUMERIC=Chinese_People's Republic of
> China.65001;LC_TIME=Chinese_People's Republic of China.65001
>
> And setlocale(LC_ALL, ".936") will return:
>
> Chinese_People's Republic of China.936

What would Vim do with this?  How can Vim guess what to search for?

The basic idea is that the locale comes from the enviroment and
everything works automatically.  If the user prefers he can set a
different locale with the ":language" command.

> Regretfully this method does not work on Linux: setlocale(LC_ALL,
> ".UTF-8") will return NULL. And currently I do not know the implementation
> details of a UNIX/Linux sprintf. So what I have suggested might be
> Windows-only at present.

This is very system dependent.  BSD is again different from Linux.  And
the locales present on a system may change over time.

> Yet a last hack to do it most simply is to call setlocale(LC_CTYPE, "C")
> in main.c, after setlocale(LC_ALL, ""). I have verified that it works (at
> least for my crash test case).

You could try doing ":language C" in your vimrc file.

--
(letter from Mark to Mike, about the film's probable certificate)
      I would like to get back to the Censor and agree to lose the shits, take
      the odd Jesus Christ out and lose Oh fuck off, but to retain 'fart in
      your general direction', 'castanets of your testicles' and 'oral sex'
      and ask him for an 'A' rating on that basis.
                 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

adah
In reply to this post by adah
Hi Bram,

I do not mind using English messages (which is exactly what I am doing
now). However, one must confess it is not ideal that on a Far East Windows
system (with default locale settings), "set encoding=UTF-8" CANNOT
co-exist with localized messages. It limits the usability/accessibility of
Vim.

Setting LANG is useless, in that no LANG setting I have tried can achieve
the effect of (Chinese for example, but it is the same for Korean and
Japanese)

- setlocale(LC_ALL, "Chinese_China.65001")
- set encoding=UTF-8
- language messages zh_CN.UTF-8

One can easily achieve the latter two in _vimrc, but not the first. Not
even when I specify LANG=C.

I confess that platforms are different, but putting a setlocale call in
"#ifdef _WIN32" should not affect other platforms that may not display the
crash behaviours --- though in theory other platforms could display this
behaviour too.

":language C" (or ":language en") works for me. But many other people
might want the localized messages.

The point is that a Far East locale set by setlocale(LC_ALL/LC_CTYPE, "")
makes sprintf fail with format strings invalid in that locale. I have
proposed several ways of workarounds. The suggestion in my last message is
that when a encoding is set, Vim should get the corresponding Windows code
page for that encoding and call setlocale(LC_CTYPE, ".<THIS CODE PAGE>").
(I wrote LC_ALL, but really LC_CTYPE is enough.)

I hope I am clear enough this time.

Best regards,

Yongwei

P.S. Commenting out the line setlocale(LC_ALL, "") in main.c seems to work
quite well too, and all the Chinese menus and messages are still correct.
This should be identical to setlocale(LC_ALL, "C"). I do not know what
side-effects it could have. But it seems to work.






Bram Moolenaar <[hidden email]>
发件人: [hidden email]
2005-05-23 17:38

 
        To:     [hidden email]
        CC:     [hidden email]
        Subject:        Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by
"language messages zh_CN.UTF-8")


Yongwei wrote:

> How is it going on this? I am now thinking you could execute a setlocale

> on "set encoding=...". The language_country part can omitted. My tests
> show that setlocale(LC_ALL, ".65001") will return such a string:
>
> LC_COLLATE=Chinese_People's Republic of
> China.65001;LC_CTYPE=C;LC_MONETARY=Chinese_People's Republic of
> China.65001;LC_NUMERIC=Chinese_People's Republic of
> China.65001;LC_TIME=Chinese_People's Republic of China.65001
>
> And setlocale(LC_ALL, ".936") will return:
>
> Chinese_People's Republic of China.936

What would Vim do with this?  How can Vim guess what to search for?

The basic idea is that the locale comes from the enviroment and
everything works automatically.  If the user prefers he can set a
different locale with the ":language" command.

> Regretfully this method does not work on Linux: setlocale(LC_ALL,
> ".UTF-8") will return NULL. And currently I do not know the
implementation
> details of a UNIX/Linux sprintf. So what I have suggested might be
> Windows-only at present.

This is very system dependent.  BSD is again different from Linux.  And
the locales present on a system may change over time.

> Yet a last hack to do it most simply is to call setlocale(LC_CTYPE, "C")

> in main.c, after setlocale(LC_ALL, ""). I have verified that it works
(at
> least for my crash test case).

You could try doing ":language C" in your vimrc file.

--
(letter from Mark to Mike, about the film's probable certificate)
      I would like to get back to the Censor and agree to lose the shits,
take
      the odd Jesus Christ out and lose Oh fuck off, but to retain 'fart
in
      your general direction', 'castanets of your testicles' and 'oral
sex'
      and ask him for an 'A' rating on that basis.
                 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES
LTD

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///



Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

Bram Moolenaar

Yongwei wrote:

> I do not mind using English messages (which is exactly what I am doing
> now). However, one must confess it is not ideal that on a Far East
> Windows system (with default locale settings), "set encoding=UTF-8"
> CANNOT co-exist with localized messages. It limits the
> usability/accessibility of Vim.

Using translate messages with utf-8 should work.  And as far as I know
it does work.

> Setting LANG is useless, in that no LANG setting I have tried can achieve
> the effect of (Chinese for example, but it is the same for Korean and
> Japanese)
>
> - setlocale(LC_ALL, "Chinese_China.65001")
> - set encoding=UTF-8
> - language messages zh_CN.UTF-8
>
> One can easily achieve the latter two in _vimrc, but not the first. Not
> even when I specify LANG=C.

I have no experience with an actual Asian locale on MS-Windows.  As far
as I know Vim picks it up from the environment and all you would need to
do is set 'encoding' to "utf-8" in your vimrc.  If this doesn't work
then we need to fix it.

> The point is that a Far East locale set by setlocale(LC_ALL/LC_CTYPE, "")
> makes sprintf fail with format strings invalid in that locale. I have
> proposed several ways of workarounds. The suggestion in my last message is
> that when a encoding is set, Vim should get the corresponding Windows code
> page for that encoding and call setlocale(LC_CTYPE, ".<THIS CODE PAGE>").
> (I wrote LC_ALL, but really LC_CTYPE is enough.)

I have no idea what side effects this has.  It's very well possible that
the locale with utf-8 isn't supported and the setlocale() call will
fail.  Thus it's not a generic solution.

If sprintf() works in a wrong way then it must be fixed or replaced.  I
see no reason why sprintf() needs to depend on the locale, it should
just manipulate bytes.  It can't know if the arguments are text or some
binary byte string.

> P.S. Commenting out the line setlocale(LC_ALL, "") in main.c seems to work
> quite well too, and all the Chinese menus and messages are still correct.
> This should be identical to setlocale(LC_ALL, "C"). I do not know what
> side-effects it could have. But it seems to work.

Yes, if you leave out the setlocale() call then the C locale is used by
default.  That's what I was suggesting before: use the C locale and set
the message language to Chinese.  Perhaps this should somehow be done by
Vim when 'encoding' is set to "utf-8".

--
To keep milk from turning sour: Keep it in the cow.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

adah
In reply to this post by adah
Bram wrote:

> Yongwei wrote:
>
> > I do not mind using English messages (which is exactly what I am
> > doing now). However, one must confess it is not ideal that on a Far
> > East Windows system (with default locale settings), "set
> > encoding=UTF-8" CANNOT co-exist with localized messages. It limits
> > the usability/accessibility of Vim.
>
> Using translate messages with utf-8 should work.  And as far as I know
> it does work.

No, it does not on Far East Windows systems with default locale
settings.  Nor does it on other version of Windows if the user changes
all the locale settings to a Far East one.  Crashes and other
indeterminable results will occur.

> > Setting LANG is useless, in that no LANG setting I have tried can
> > achieve the effect of (Chinese for example, but it is the same for
> > Korean and Japanese)
> >
> > - setlocale(LC_ALL, "Chinese_China.65001") - set encoding=UTF-8 -
> > language messages zh_CN.UTF-8
> >
> > One can easily achieve the latter two in _vimrc, but not the first.
> > Not even when I specify LANG=C.
>
> I have no experience with an actual Asian locale on MS-Windows.  As
> far as I know Vim picks it up from the environment and all you would
> need to do is set 'encoding' to "utf-8" in your vimrc.  If this
> doesn't work then we need to fix it.

Right.  That's why I was proposing fixes.

> > The point is that a Far East locale set by
> > setlocale(LC_ALL/LC_CTYPE, "") makes sprintf fail with format
> > strings invalid in that locale. I have proposed several ways of
> > workarounds. The suggestion in my last message is that when a
> > encoding is set, Vim should get the corresponding Windows code page
> > for that encoding and call setlocale(LC_CTYPE, ".<THIS CODE PAGE>").
> > (I wrote LC_ALL, but really LC_CTYPE is enough.)
>
> I have no idea what side effects this has.  It's very well possible
> that the locale with utf-8 isn't supported and the setlocale() call
> will fail.  Thus it's not a generic solution.
>
> If sprintf() works in a wrong way then it must be fixed or replaced.
> I see no reason why sprintf() needs to depend on the locale, it should
> just manipulate bytes.  It can't know if the arguments are text or
> some binary byte string.

It can be argued whether sprintf is used in a wrong way, and whether
sprintf needs to depend on the locale.

- Pro.  Theoretically one should not presume what octet could occur as
  the second octet of a character in a multi-byte character encoding.
  So '%' could well appear as the second octet of such a multi-byte
  charcter.

- Con.  No microsoft system code pages I know uses range 0x00-0x3F as
  the second octet of a multi-byte character.  Of all the character
  encodings I know, only GB18030 uses the range 0x30-0x3F too.  But
  notice that 0x25 ('%') is still excluded.

Apart from that, locales also decides how the dot in 123.4 is output.

Whatever the arguments, the real-world case is that the sprintf in the
Microsoft runtime depends on locales.  I would not bet that no other C
runtimes do the similar.

To make it work across platforms, I do not think localized strings could
be directly put into the format string of a *printf call.  "%s" *should*
be used in that case, IMHO.

> > P.S. Commenting out the line setlocale(LC_ALL, "") in main.c seems
> > to work quite well too, and all the Chinese menus and messages are
> > still correct.  This should be identical to setlocale(LC_ALL, "C").
> > I do not know what side-effects it could have. But it seems to work.
>
> Yes, if you leave out the setlocale() call then the C locale is used
> by default.  That's what I was suggesting before: use the C locale and
> set the message language to Chinese.  Perhaps this should somehow be
> done by Vim when 'encoding' is set to "utf-8".

The simplest solution to the current mess is a setlocale(LC_CTYPE, "C")
call or something identical (also note that the string returned by
setlocale(LC_ALL, ".65001") contains "LC_CTYPE=C" too: that's why it
works), if there are no side-effects.  Functions affected by LC_CTYPE
can be found here:

http://msdn.microsoft.com/library/en-us/vccore98/html/_crt_locale.asp

(But Microsoft is mistaken in declaring printf functions dependent on
LC_NUMERIC only: clearly they depend on LC_CTYPE since the isleadbyte
call they use depends on LC_CTYPE.)

I see that mblen is used in mbyte.c, and wcstombs is used in
gui_at_fs.c.  The former seems to have no problems, but I am worrying
about the latter.

So a reasonable solution really seems to be (in pseudo-code)

- in main.c:

string saved_locale = setlocale(LC_ALL, "");

- in places where 'encoding' is set:

string new_locale = saved_locale;
new_locale ~= s/\..*$/./;
new_locale += new_encoding_converted_to_native_format;
setlocale(LC_CTYPE, new_locale);

Best regards,

Yongwei
Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

Bram Moolenaar

Yongwei wrote:

> > If sprintf() works in a wrong way then it must be fixed or replaced.
> > I see no reason why sprintf() needs to depend on the locale, it should
> > just manipulate bytes.  It can't know if the arguments are text or
> > some binary byte string.
>
> It can be argued whether sprintf is used in a wrong way, and whether
> sprintf needs to depend on the locale.
>
> - Pro.  Theoretically one should not presume what octet could occur as
>   the second octet of a character in a multi-byte character encoding.
>   So '%' could well appear as the second octet of such a multi-byte
>   charcter.

This in fact changed between C standards.  Thus it's impossible to write
code that works with every compiler/library.  Same applies to the
backslash, which is an even bigger problem.  We ran into this problem
with message translation.  Fortunately, there it could be solved by
telling the message programs to use the old style strings.  I don't
think sprintf has such an option.

Fortunately, I just added the vim_snprintf() function.  Mainly to avoid
having to do size checks everywhere, now this function makes sure the
result fits in the available space.  We could replace all calls to
sprintf() by vim_snprintf().  This function works on bytes, it is not
multi-byte aware.  That means it works consistently for all locales.
A few extra backslashes may be required for translated messages with a %
in the second byte of a multi-byte char.

> - Con.  No microsoft system code pages I know uses range 0x00-0x3F as
>   the second octet of a multi-byte character.  Of all the character
>   encodings I know, only GB18030 uses the range 0x30-0x3F too.  But
>   notice that 0x25 ('%') is still excluded.

The problem with the backslash does exist though.  And it does occur as
a second byte in certain translated messages.  If this appears just
before a "%" character it will be skipped and the sprintf() arguments
get mixed up, that might cause a crash when a %s argument is a number.

> Apart from that, locales also decides how the dot in 123.4 is output.

Vim doesn't use floating point (except in the PostScript printing code,
where it shouldn't depend on the locale anyway).

> Whatever the arguments, the real-world case is that the sprintf in the
> Microsoft runtime depends on locales.  I would not bet that no other C
> runtimes do the similar.

I suspect the C99 standard specifies this.

> To make it work across platforms, I do not think localized strings could
> be directly put into the format string of a *printf call.  "%s" *should*
> be used in that case, IMHO.

The format string itself is often translated.  Avoiding that would be a
big hassle.  Also, when translating a message the position of arguments
often change, thus the % items must be in the translated message.

> > > P.S. Commenting out the line setlocale(LC_ALL, "") in main.c seems
> > > to work quite well too, and all the Chinese menus and messages are
> > > still correct.  This should be identical to setlocale(LC_ALL, "C").
> > > I do not know what side-effects it could have. But it seems to work.
> >
> > Yes, if you leave out the setlocale() call then the C locale is used
> > by default.  That's what I was suggesting before: use the C locale and
> > set the message language to Chinese.  Perhaps this should somehow be
> > done by Vim when 'encoding' is set to "utf-8".
>
> The simplest solution to the current mess is a setlocale(LC_CTYPE, "C")
> call or something identical (also note that the string returned by
> setlocale(LC_ALL, ".65001") contains "LC_CTYPE=C" too: that's why it
> works), if there are no side-effects.  Functions affected by LC_CTYPE
> can be found here:
>
> http://msdn.microsoft.com/library/en-us/vccore98/html/_crt_locale.asp
>
> (But Microsoft is mistaken in declaring printf functions dependent on
> LC_NUMERIC only: clearly they depend on LC_CTYPE since the isleadbyte
> call they use depends on LC_CTYPE.)

Microsoft documentation for library functions is often incomplete or
plain wrong.  Trying it out is often necessary to verify what actually
happens.

> I see that mblen is used in mbyte.c, and wcstombs is used in
> gui_at_fs.c.  The former seems to have no problems, but I am worrying
> about the latter.

We already take care of the situation that 'encoding' differs from the
system locale.  mblen() isn't used for Win32.  wcstombs() is only used
for Athena.

> So a reasonable solution really seems to be (in pseudo-code)
>
> - in main.c:
>
> string saved_locale = setlocale(LC_ALL, "");
>
> - in places where 'encoding' is set:
>
> string new_locale = saved_locale;
> new_locale ~= s/\..*$/./;
> new_locale += new_encoding_converted_to_native_format;
> setlocale(LC_CTYPE, new_locale);

I wonder if that works in general.  We might be better off setting the
LC_CTYPE locale to "C" always.  Then all library functions should work
on bytes, which is what Vim expects.  But I'm not sure what side effects
this has...

--
CART DRIVER: Bring out your dead!
   There are legs stick out of windows and doors.  Two MEN are fighting in the
   mud - covered from head to foot in it.  Another MAN is on his hands in
   knees shovelling mud into his mouth.  We just catch sight of a MAN falling
   into a well.
                 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

adah
In reply to this post by adah
Bram wrote:

> Yongwei wrote:
>
> > - Pro.  Theoretically one should not presume what octet could occur
> > as the second octet of a character in a multi-byte character
> > encoding.  So '%' could well appear as the second octet of such a
> > multi-byte charcter.
>
> This in fact changed between C standards.  Thus it's impossible to
> write code that works with every compiler/library.  Same applies to
> the backslash, which is an even bigger problem.  We ran into this
> problem with message translation.  Fortunately, there it could be
> solved by telling the message programs to use the old style strings.

Or change the program appropriately.  Some time ago I changed something
like this

  if (str[len - 1] == '\\')

to

  if (_tcsrchr(str, '\\') == str + len - 1)

But _tcsrchr is not a cross-platform solution, I suppose.  Also it is
less efficient.  :-(

> I don't think sprintf has such an option.
>
> Fortunately, I just added the vim_snprintf() function.  Mainly to
> avoid having to do size checks everywhere, now this function makes
> sure the result fits in the available space.  We could replace all
> calls to sprintf() by vim_snprintf().  This function works on bytes,
> it is not multi-byte aware.  That means it works consistently for all
> locales.  A few extra backslashes may be required for translated
> messages with a % in the second byte of a multi-byte char.

Really good to know that a solution is not far ahead.

> > - Con.  No microsoft system code pages I know uses range 0x00-0x3F
> > as the second octet of a multi-byte character.  Of all the character
> > encodings I know, only GB18030 uses the range 0x30-0x3F too.  But
> > notice that 0x25 ('%') is still excluded.
>
> The problem with the backslash does exist though.  And it does occur
> as a second byte in certain translated messages.  If this appears just
> before a "%" character it will be skipped and the sprintf() arguments
> get mixed up, that might cause a crash when a %s argument is a number.

Yes.  Unlike '%', backslash is a valid second-octet in GBK (Windows code
page 936) and GB18030.

> > Apart from that, locales also decides how the dot in 123.4 is
> > output.
>
> Vim doesn't use floating point (except in the PostScript printing
> code, where it shouldn't depend on the locale anyway).

I am wondering whether this code works on German Windows, where 123.4
will be output as "123,4" after setlocale(LC_ALL, "")?

> > Whatever the arguments, the real-world case is that the sprintf in
> > the Microsoft runtime depends on locales.  I would not bet that no
> > other C runtimes do the similar.
>
> I suspect the C99 standard specifies this.
>
> > To make it work across platforms, I do not think localized strings
> > could be directly put into the format string of a *printf call.
> > "%s" *should* be used in that case, IMHO.
>
> The format string itself is often translated.  Avoiding that would be
> a big hassle.  Also, when translating a message the position of
> arguments often change, thus the % items must be in the translated
> message.

Er, I see....

> > The simplest solution to the current mess is a setlocale(LC_CTYPE,
> > "C") call or something identical (also note that the string returned
> > by setlocale(LC_ALL, ".65001") contains "LC_CTYPE=C" too: that's why
> > it works), if there are no side-effects.  Functions affected by
> > LC_CTYPE can be found here:
> >
> > http://msdn.microsoft.com/library/en-us/vccore98/html/_crt_locale.asp
> >
> > (But Microsoft is mistaken in declaring printf functions dependent
> > on LC_NUMERIC only: clearly they depend on LC_CTYPE since the
> > isleadbyte call they use depends on LC_CTYPE.)
>
> Microsoft documentation for library functions is often incomplete or
> plain wrong.  Trying it out is often necessary to verify what actually
> happens.
>
> > I see that mblen is used in mbyte.c, and wcstombs is used in
> > gui_at_fs.c.  The former seems to have no problems, but I am
> > worrying about the latter.
>
> We already take care of the situation that 'encoding' differs from the
> system locale.  mblen() isn't used for Win32.  wcstombs() is only used
> for Athena.

So setlocale(LC_CTYPE, "C") is safe?  Glad to know that.

> > So a reasonable solution really seems to be (in pseudo-code)
> >
> > - in main.c:
> >
> > string saved_locale = setlocale(LC_ALL, "");
> >
> > - in places where 'encoding' is set:
> >
> > string new_locale = saved_locale; new_locale ~= s/\..*$/./;
> > new_locale += new_encoding_converted_to_native_format;
> > setlocale(LC_CTYPE, new_locale);
>
> I wonder if that works in general.  We might be better off setting the
> LC_CTYPE locale to "C" always.  Then all library functions should work
> on bytes, which is what Vim expects.  But I'm not sure what side
> effects this has...

So the conclusion is ...?  I would begin to use setlocale(LC_CTYPE, "C")
in main.c right away.

BTW, what is the `standard' way of building Vim on Windows?  Currently I
use

nmake -f Make_mvc.mak GUI=yes OLE=yes IME=yes

Best regards,

Yongwei
Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

Bram Moolenaar

Yongwei wrote:

> > > - Pro.  Theoretically one should not presume what octet could occur
> > > as the second octet of a character in a multi-byte character
> > > encoding.  So '%' could well appear as the second octet of such a
> > > multi-byte charcter.
> >
> > This in fact changed between C standards.  Thus it's impossible to
> > write code that works with every compiler/library.  Same applies to
> > the backslash, which is an even bigger problem.  We ran into this
> > problem with message translation.  Fortunately, there it could be
> > solved by telling the message programs to use the old style strings.
>
> Or change the program appropriately.  Some time ago I changed something
> like this
>
>   if (str[len - 1] == '\\')
>
> to
>
>   if (_tcsrchr(str, '\\') == str + len - 1)
>
> But _tcsrchr is not a cross-platform solution, I suppose.  Also it is
> less efficient.  :-(

No, you cannot solve it that way.  The C89 compiler will recognize the
backslash in a trail byte as a backslash, the C99 compiler doesn't, it
sees one character (depending on the locale).  Thus for C89 you need to
double the backslash, for C99 you don't.  This is a clear
incompatibility that is impossible to solve without #ifdefs.

I would say this is a bug in the C99 standard, since you can break a
program by compiling it in another locale.  The only way out seems to be
using utf-8, which is locale-neutral.

> > Vim doesn't use floating point (except in the PostScript printing
> > code, where it shouldn't depend on the locale anyway).
>
> I am wondering whether this code works on German Windows, where 123.4
> will be output as "123,4" after setlocale(LC_ALL, "")?

We have the prt_write_real() function to work around this problem.

> > > I see that mblen is used in mbyte.c, and wcstombs is used in
> > > gui_at_fs.c.  The former seems to have no problems, but I am
> > > worrying about the latter.
> >
> > We already take care of the situation that 'encoding' differs from the
> > system locale.  mblen() isn't used for Win32.  wcstombs() is only used
> > for Athena.
>
> So setlocale(LC_CTYPE, "C") is safe?  Glad to know that.

I didn't say it was safe, just that we won't have problems with those
functions.  It's hard to know for sure it's safe for all functions.
I could introduce this and wait for things to go wrong...

Actually, I would expect X libraries to break when the locale isn't set
properly.  Perhaps we should do it for Win32 only.

> BTW, what is the `standard' way of building Vim on Windows?  Currently I
> use
>
> nmake -f Make_mvc.mak GUI=yes OLE=yes IME=yes

I mostly use the "bigvim.bat" script, but it requires installing Python,
Perl, etc.

- Bram

--
LARGE MAN:   Who's that then?
CART DRIVER: (Grudgingly) I dunno, Must be a king.
LARGE MAN:   Why?
CART DRIVER: He hasn't got shit all over him.
                 "Monty Python and the Holy Grail" PYTHON (MONTY) PICTURES LTD

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Very bad smell in MSVCRT (Was: Re: Details of the crash caused by "language messages zh_CN.UTF-8")

A.J.Mechelynck
In reply to this post by adah
[hidden email] wrote:
[...]
> BTW, what is the `standard' way of building Vim on Windows?  Currently I
> use
>
> nmake -f Make_mvc.mak GUI=yes OLE=yes IME=yes
>
> Best regards,
>
> Yongwei

I think I can answer that one. There is no single "exclusive" standard
way of building Vim on Windows. The following are "supported":

        MS Visual C/C++ with src/Make_mvc.mak
        Borland BCC32 5.5 with src/Make_bc5.mak
        Cygwin gcc with src/Make_cyg.mak
        MinGW gcc with src/Make_ming.mak

All of the above produce "native Windows" versions of Vim. In addition,

        Cygwin gcc with the top-level Makefile

will produce a "Unix-like" Vim for use with the Cygwin dll.

The procedures are slightly (and sometimes not so slightly) different
for each of the above methods. The second and third ones (using
Make_bc5.mak and Make_cyg.mak) are documented in, among others, my
"Compiling HowTo" for Vim on Windows,
http://users.skynet.be/vim/compile.htm . Each of the makefiles begins
with some documentation in the form of comments.


Best regards,
Tony.