Locale problems (Was: Re: Very bad smell in MSVCRT)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Locale problems (Was: Re: Very bad smell in MSVCRT)

adah
Bram wrote:

> > > This in fact changed between C standards.  Thus it's impossible to
> > > write code that works with every compiler/library.  Same applies
> > > to the backslash, which is an even bigger problem.  We ran into
> > > this problem with message translation.  Fortunately, there it
> > > could be solved by telling the message programs to use the old
> > > style strings.
> >
> > Or change the program appropriately.  Some time ago I changed
> > something like this
> >
> >   if (str[len - 1] == '\\')
> >
> > to
> >
> >   if (_tcsrchr(str, '\\') == str + len - 1)
> >
> > But _tcsrchr is not a cross-platform solution, I suppose.  Also it
> > is less efficient.  :-(
>
> No, you cannot solve it that way.  The C89 compiler will recognize the
> backslash in a trail byte as a backslash, the C99 compiler doesn't, it
> sees one character (depending on the locale).  Thus for C89 you need
> to double the backslash, for C99 you don't.  This is a clear
> incompatibility that is impossible to solve without #ifdefs.
>
> I would say this is a bug in the C99 standard, since you can break a
> program by compiling it in another locale.  The only way out seems to
> be using utf-8, which is locale-neutral.

If I understand you correctly, you meant that a MBCS character can be
wrongly recognized in C89 compilers and/or another locale.  I have not
read the C89 standard, but MSVC, as a C89 compiler, obviously behaves
more like what you attributes to C99 compilers.  I never need to "double
the backslash" in MSVC when I was writing native strings.  In fact, it
is very difficult to do that (because I never know the second byte of a
character without using a hex editor or writing a program).  However, I
see that GCC can have problems on this.

I do not see it a bug of the standard that a program can break when
compiled in another locale.  Javac has an -encoding exactly to specify
the source encoding.  Of course, one *should* not directly specify
localized string in C source files in non-ASCII-clear ways.  I am afraid
UTF-8 strings (if not encoded in ASCII-clear way) can have problems too,
because they are not valid Chinese strings as seen by MSVC, when
compiled on Chinese Windows.  I have one experience having to reboot in
an English locale in order to build GSView32 for Windows.

> > So setlocale(LC_CTYPE, "C") is safe?  Glad to know that.
>
> I didn't say it was safe, just that we won't have problems with those
> functions.  It's hard to know for sure it's safe for all functions.  I
> could introduce this and wait for things to go wrong...
>
> Actually, I would expect X libraries to break when the locale isn't
> set properly.  Perhaps we should do it for Win32 only.

What impacts it could have?  I would love to see a non-Win32-only
solution (but I would love more to see it in mainstream :-)).

Best regards,

Yongwei
Reply | Threaded
Open this post in threaded view
|

Locale problems (Was: Re: Very bad smell in MSVCRT)

Bram Moolenaar

Yongwei wrote:

> > No, you cannot solve it that way.  The C89 compiler will recognize the
> > backslash in a trail byte as a backslash, the C99 compiler doesn't, it
> > sees one character (depending on the locale).  Thus for C89 you need
> > to double the backslash, for C99 you don't.  This is a clear
> > incompatibility that is impossible to solve without #ifdefs.
> >
> > I would say this is a bug in the C99 standard, since you can break a
> > program by compiling it in another locale.  The only way out seems to
> > be using utf-8, which is locale-neutral.
>
> If I understand you correctly, you meant that a MBCS character can be
> wrongly recognized in C89 compilers and/or another locale.  I have not
> read the C89 standard, but MSVC, as a C89 compiler, obviously behaves
> more like what you attributes to C99 compilers.  I never need to "double
> the backslash" in MSVC when I was writing native strings.  In fact, it
> is very difficult to do that (because I never know the second byte of a
> character without using a hex editor or writing a program).  However, I
> see that GCC can have problems on this.
>
> I do not see it a bug of the standard that a program can break when
> compiled in another locale.  Javac has an -encoding exactly to specify
> the source encoding.  Of course, one *should* not directly specify
> localized string in C source files in non-ASCII-clear ways.  I am afraid
> UTF-8 strings (if not encoded in ASCII-clear way) can have problems too,
> because they are not valid Chinese strings as seen by MSVC, when
> compiled on Chinese Windows.  I have one experience having to reboot in
> an English locale in order to build GSView32 for Windows.

In my opinion the right solution would have been to introduce a new
string type that depends on the locale, instead of changing the meaning
of the string that everybody uses and breaking old programs.  Something
like L"Chinese-chars".  Perhaps Microsoft pushed the standard for the
solution that they were already using (commercial interest often
overrules good choices).

> > > So setlocale(LC_CTYPE, "C") is safe?  Glad to know that.
> >
> > I didn't say it was safe, just that we won't have problems with those
> > functions.  It's hard to know for sure it's safe for all functions.  I
> > could introduce this and wait for things to go wrong...
> >
> > Actually, I would expect X libraries to break when the locale isn't
> > set properly.  Perhaps we should do it for Win32 only.
>
> What impacts it could have?  I would love to see a non-Win32-only
> solution (but I would love more to see it in mainstream :-)).

The problem is that libraries are different on every system.  It's
nearly impossible to predict what breaks if you change something global
to the whole program, which is what the locale is.  We might end up
putting lots of library functions inside Vim (such as the snprintf()
that I now added).

--
Never under any circumstances take a sleeping pill
and a laxative on the same night.

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///
Reply | Threaded
Open this post in threaded view
|

Re: Locale problems (Was: Re: Very bad smell in MSVCRT)

adah
In reply to this post by adah
Bram wrote:

> Yongwei wrote:
>
> > If I understand you correctly, you meant that a MBCS character can
> > be wrongly recognized in C89 compilers and/or another locale.  I
> > have not read the C89 standard, but MSVC, as a C89 compiler,
> > obviously behaves more like what you attributes to C99 compilers.  I
> > never need to "double the backslash" in MSVC when I was writing
> > native strings.  In fact, it is very difficult to do that (because I
> > never know the second byte of a character without using a hex editor
> > or writing a program).  However, I see that GCC can have problems on
> > this.
> >
> > I do not see it a bug of the standard that a program can break when
> > compiled in another locale.  Javac has an -encoding exactly to
> > specify the source encoding.  Of course, one *should* not directly
> > specify localized string in C source files in non-ASCII-clear ways.
> > I am afraid UTF-8 strings (if not encoded in ASCII-clear way) can
> > have problems too, because they are not valid Chinese strings as
> > seen by MSVC, when compiled on Chinese Windows.  I have one
> > experience having to reboot in an English locale in order to build
> > GSView32 for Windows.
>
> In my opinion the right solution would have been to introduce a new
> string type that depends on the locale, instead of changing the
> meaning of the string that everybody uses and breaking old programs.
> Something like L"Chinese-chars".  Perhaps Microsoft pushed the
> standard for the solution that they were already using (commercial
> interest often overrules good choices).

A little off-topic.  But I do not agree with you on this.  Few existing
program will suffer from the "C99 change", if any.  I really cannot
imagine people inserting backslashes among localized strings.  On the
other hand, there are already many existing internationalized programs
using Microsoft's solution, possibly developed on a Far East version of
Windows.  It is not only in Microsoft's interest to make them legal.  I
think the only thing wanting is a Java-like -encoding option on the side
of the compilers.

Best regards,

Yongwei
Reply | Threaded
Open this post in threaded view
|

Re: Locale problems (Was: Re: Very bad smell in MSVCRT)

Bram Moolenaar

Yongwei wrote:

> > > If I understand you correctly, you meant that a MBCS character can
> > > be wrongly recognized in C89 compilers and/or another locale.  I
> > > have not read the C89 standard, but MSVC, as a C89 compiler,
> > > obviously behaves more like what you attributes to C99 compilers.  I
> > > never need to "double the backslash" in MSVC when I was writing
> > > native strings.  In fact, it is very difficult to do that (because I
> > > never know the second byte of a character without using a hex editor
> > > or writing a program).  However, I see that GCC can have problems on
> > > this.
> > >
> > > I do not see it a bug of the standard that a program can break when
> > > compiled in another locale.  Javac has an -encoding exactly to
> > > specify the source encoding.  Of course, one *should* not directly
> > > specify localized string in C source files in non-ASCII-clear ways.
> > > I am afraid UTF-8 strings (if not encoded in ASCII-clear way) can
> > > have problems too, because they are not valid Chinese strings as
> > > seen by MSVC, when compiled on Chinese Windows.  I have one
> > > experience having to reboot in an English locale in order to build
> > > GSView32 for Windows.
> >
> > In my opinion the right solution would have been to introduce a new
> > string type that depends on the locale, instead of changing the
> > meaning of the string that everybody uses and breaking old programs.
> > Something like L"Chinese-chars".  Perhaps Microsoft pushed the
> > standard for the solution that they were already using (commercial
> > interest often overrules good choices).
>
> A little off-topic.  But I do not agree with you on this.  Few existing
> program will suffer from the "C99 change", if any.  I really cannot
> imagine people inserting backslashes among localized strings.

It did break the message translations for Vim.  Fortunately the people
making the GNU tools were willing to profide a fix, otherwise we would
have to make a clumsy solution.

> On the other hand, there are already many existing internationalized
> programs using Microsoft's solution, possibly developed on a Far East
> version of Windows.  It is not only in Microsoft's interest to make
> them legal.  I think the only thing wanting is a Java-like -encoding
> option on the side of the compilers.

There is always a good reason to make a bad choice.

--
Back off man, I'm a scientist.
              -- Peter, Ghostbusters

 /// Bram Moolenaar -- [hidden email] -- http://www.Moolenaar.net   \\\
///        Sponsor Vim, vote for features -- http://www.Vim.org/sponsor/ \\\
\\\              Project leader for A-A-P -- http://www.A-A-P.org        ///
 \\\     Buy LOTR 3 and help AIDS victims -- http://ICCF.nl/lotr.html   ///