enc,fenc (again!?)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

enc,fenc (again!?)

Alessandro Antonello
Hi, all.

I know that this has been discussed for a long time but I still don't
understand why Vim behaves like that.

Some times I need to work with several buffer. Some in natural 'utf-8'
encoding. I say natural because it is the system default (I am in a Mac) and
is defined in Vim 'encoding' option. But some buffer I need to be in 'latin1'.
Opening the files in the right encoding is easy, since I use the '++enc'
option. So the files are read correctly. The problem begins when I use
'bnext' and 'bprevious' (or another similar command) to navigate between the
buffers. Since my encoding is 'utf-8', when I navigate from a buffer with this
encoding to a buffer that was opened with '++enc=latin1' this buffer get its
encoding changed to 'utf-8'.

I am trying to solve this problem for a long time and never accomplished it.
What I can't understand is why Vim ignores the buffer 'fileencoding' variable?
It is set correctly and is set when the buffer is opened. To be sure about
that I manually set the 'fileencoding' variable when the buffer is opened.

Let me explain this right. I open a file with '++enc=latin1'. In the 'BufRead'
auto command I set the 'fenc' variable as this: call setbufvar('%', '&fenc',
'latin1').

Then I navigate to another buffer that is in another encoding (like 'utf-8').
When I come back to the first buffer, its 'fenc' variable is 'utf-8' and I
don't understand why!

My current global configuration is as follows:

encoding=utf-8
fileencodings=ucs-bom,utf-8,default,latin1
fileencoding=utf-8

--
Alessandro Antonello

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: enc,fenc (again!?)

Nikolay Aleksandrovich Pavlov
Reply to message «enc,fenc (again!?)»,
sent 20:57:23 24 October 2010, Sunday
by Alessandro Antonello:

> Some times I need to work with several buffer. Some in natural 'utf-8'
> encoding. I say natural because it is the system default (I am in a Mac)
> and is defined in Vim 'encoding' option. But some buffer I need to be in
> 'latin1'. Opening the files in the right encoding is easy, since I use the
> '++enc' option. So the files are read correctly. The problem begins when I
> use 'bnext' and 'bprevious' (or another similar command) to navigate
> between the buffers. Since my encoding is 'utf-8', when I navigate from a
> buffer with this encoding to a buffer that was opened with '++enc=latin1'
> this buffer get its encoding changed to 'utf-8'.
You should have written a script that shows your problem and uses `-u NONE'. I
managed to reproduce this behavior with the following script:
    touch a
    touch b
    vim -u NONE -c 'set nohidden' \
                -c 'e a' \
                -c 'echo &fenc' \
                -c 'e ++enc=latin1 b' \
                -c 'echo &fenc' \
                -c 'bn' \
                -c 'bn' \
                -c 'echo &fenc' \
                -c 'qa!'
Output:
    "a" строк: 0, символов: 0
    utf-8
    "b" [перекодировано] строк: 0, символов: 0
    latin1
    "a" строк: 0, символов: 0
    "b" строк: 0, символов: 0
    utf-8
vim-7.3 from Gentoo repos. I do not know, whether this is intended behavior when
buffer is abandoned, but this does not happen if you add `set hidden' somewhere
into .vimrc.

Original message:

> Hi, all.
>
> I know that this has been discussed for a long time but I still don't
> understand why Vim behaves like that.
>
> Some times I need to work with several buffer. Some in natural 'utf-8'
> encoding. I say natural because it is the system default (I am in a Mac)
> and is defined in Vim 'encoding' option. But some buffer I need to be in
> 'latin1'. Opening the files in the right encoding is easy, since I use the
> '++enc' option. So the files are read correctly. The problem begins when I
> use 'bnext' and 'bprevious' (or another similar command) to navigate
> between the buffers. Since my encoding is 'utf-8', when I navigate from a
> buffer with this encoding to a buffer that was opened with '++enc=latin1'
> this buffer get its encoding changed to 'utf-8'.
>
> I am trying to solve this problem for a long time and never accomplished
> it. What I can't understand is why Vim ignores the buffer 'fileencoding'
> variable? It is set correctly and is set when the buffer is opened. To be
> sure about that I manually set the 'fileencoding' variable when the buffer
> is opened.
>
> Let me explain this right. I open a file with '++enc=latin1'. In the
> 'BufRead' auto command I set the 'fenc' variable as this: call
> setbufvar('%', '&fenc', 'latin1').
>
> Then I navigate to another buffer that is in another encoding (like
> 'utf-8'). When I come back to the first buffer, its 'fenc' variable is
> 'utf-8' and I don't understand why!
>
> My current global configuration is as follows:
>
> encoding=utf-8
> fileencodings=ucs-bom,utf-8,default,latin1
> fileencoding=utf-8

signature.asc (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: enc,fenc (again!?)

Alessandro Antonello
> You should have written a script that shows your problem and uses `-u NONE'. I
> managed to reproduce this behavior with the following script:
>    touch a
>    touch b
>    vim -u NONE -c 'set nohidden' \
>                -c 'e a' \
>                -c 'echo &fenc' \
>                -c 'e ++enc=latin1 b' \
>                -c 'echo &fenc' \
>                -c 'bn' \
>                -c 'bn' \
>                -c 'echo &fenc' \
>                -c 'qa!'
> Output:
>    "a" строк: 0, символов: 0
>    utf-8
>    "b" [перекодировано] строк: 0, символов: 0
>    latin1
>    "a" строк: 0, символов: 0
>    "b" строк: 0, символов: 0
>    utf-8
> vim-7.3 from Gentoo repos. I do not know, whether this is intended behavior when
> buffer is abandoned, but this does not happen if you add `set hidden' somewhere
> into .vimrc.

Hi, ZyX.

Thank for your reply.

Well, I read the help about 'hidden' and what you said makes a lot of sense.
Since my 'hidden' option is 'off' the buffers are unloaded and their
respective internal variables are also unloaded.

I don't have 'nohidden' in my .vimrc. I partially solve the problem with a
modeline. But I know that this is not a real solution since the modeline is
only read after the buffer is already loaded, which could be done in a wrong
character set. For now I am only mixing utf-8 with latin1 in Portuguese
language which does not have much issues. I will try the 'nohidden' setting.

Thanks again.

Alessandro Antonello

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: enc,fenc (again!?)

Tony Mechelynck
In reply to this post by Alessandro Antonello
On 24/10/10 18:57, Alessandro Antonello wrote:

> Hi, all.
>
> I know that this has been discussed for a long time but I still don't
> understand why Vim behaves like that.
>
> Some times I need to work with several buffer. Some in natural 'utf-8'
> encoding. I say natural because it is the system default (I am in a Mac) and
> is defined in Vim 'encoding' option. But some buffer I need to be in 'latin1'.
> Opening the files in the right encoding is easy, since I use the '++enc'
> option. So the files are read correctly. The problem begins when I use
> 'bnext' and 'bprevious' (or another similar command) to navigate between the
> buffers. Since my encoding is 'utf-8', when I navigate from a buffer with this
> encoding to a buffer that was opened with '++enc=latin1' this buffer get its
> encoding changed to 'utf-8'.
>
> I am trying to solve this problem for a long time and never accomplished it.
> What I can't understand is why Vim ignores the buffer 'fileencoding' variable?
> It is set correctly and is set when the buffer is opened. To be sure about
> that I manually set the 'fileencoding' variable when the buffer is opened.
>
> Let me explain this right. I open a file with '++enc=latin1'. In the 'BufRead'
> auto command I set the 'fenc' variable as this: call setbufvar('%', '&fenc',
> 'latin1').
>
> Then I navigate to another buffer that is in another encoding (like 'utf-8').
> When I come back to the first buffer, its 'fenc' variable is 'utf-8' and I
> don't understand why!
>
> My current global configuration is as follows:
>
> encoding=utf-8
> fileencodings=ucs-bom,utf-8,default,latin1
> fileencoding=utf-8
>

When loading an existing file (or reloading one which you just edited)
which contains only bytes in the range 0-127, it will be detected as
UTF-8 without BOM, in preference to Latin1. This is not an error,
because these 128 characters are represented identically in US-ASCII,
Latin1 (, most non-EBCDIC encodings) and UTF-8; and UTF-8 has to come
before Latin1 in 'fileencodings', or it would never be detected. As long
as you don't add any characters with the high bit set, the file will be
read or written exactly the same way if its 'fileencoding' is set to
utf-8, latin1, or even us-ascii.

See ":help 'fileencodings'" (or ":h 'fencs'" if you're a lazy typist ;-)
) for an explanation of how the charset, and BOM if any, of an existing
file are detected.

If you want to be sure that a given file will be loaded as Latin1
(assuming 'fileencodings' is set to "ucs-bom,utf-8,latin1" -- or maybe
"ucs-bom,utf-8,default,latin1": trying UTF-8 once as itself and once as
default entails only a negligible performance loss in most cases), make
sure that it contains one or more characters in the range 128-255 (maybe
by using accented letters, possibly in a string literal or in a comment,
or maybe by underlining the top title of a plaintext file by a line of
"divided by" signs rather than dashes or equals...), then ":setlocal
fenc=latin1" will have it detected as Latin1 after the next time the
file is saved.


Best regards,
Tony.
--
ARTHUR:       You are indeed brave Sir knight, but the fight is mine.
BLACK KNIGHT: Had enough?
ARTHUR:       You stupid bastard.  You havn't got any arms left.
                  "Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: enc,fenc (again!?)

Tony Mechelynck
On 03/11/10 22:56, Alessandro Antonello wrote:

>> When loading an existing file (or reloading one which you just edited) which
>> contains only bytes in the range 0-127, it will be detected as UTF-8 without
>> BOM, in preference to Latin1. This is not an error, because these 128
>> characters are represented identically in US-ASCII, Latin1 (, most
>> non-EBCDIC encodings) and UTF-8; and UTF-8 has to come before Latin1 in
>> 'fileencodings', or it would never be detected. As long as you don't add any
>> characters with the high bit set, the file will be read or written exactly
>> the same way if its 'fileencoding' is set to utf-8, latin1, or even
>> us-ascii.
>>
>> See ":help 'fileencodings'" (or ":h 'fencs'" if you're a lazy typist ;-) )
>> for an explanation of how the charset, and BOM if any, of an existing file
>> are detected.
>>
>> If you want to be sure that a given file will be loaded as Latin1 (assuming
>> 'fileencodings' is set to "ucs-bom,utf-8,latin1" -- or maybe
>> "ucs-bom,utf-8,default,latin1": trying UTF-8 once as itself and once as
>> default entails only a negligible performance loss in most cases), make sure
>> that it contains one or more characters in the range 128-255 (maybe by using
>> accented letters, possibly in a string literal or in a comment, or maybe by
>> underlining the top title of a plaintext file by a line of "divided by"
>> signs rather than dashes or equals...), then ":setlocal fenc=latin1" will
>> have it detected as Latin1 after the next time the file is saved.
>>
>
> Hi, Tony.
>
> The problem I was facing of were, indeed, caused by the 'hidden' option. It
> was 'nohidden' (the default), so buffers was unloaded when *abandoned* and its
> local variables, like 'fenc', was lost. Using 'set hidden' in my *.vimrc*
> solves everything and now I am able to go out an 'utf-8' buffer to a 'latin1'
> one and get back to the first or second and everything works fine.

You'll still see the problem (but see below) if you edit a Latin1 file,
close Vim, and reopen that file later.

>
> My primary working machine is a PC with Windows XP where 'latin1' is the
> default encoding but right now I am working in a MacBook where 'utf-8' is the
> default encoding. To be able to exchange files and projects without loosing
> information I set 'fileencodings=ucs-bom,latin1,default,utf-8' in the *.vimrc*
> of Mac so 'latin1' takes precedence over 'utf-8'. When needed '++enc=utf-8' is
> used to load 'utf-8' encoded files and everything is working beautifully. Vim
> is just a superb editor!

Since latin1 is an 8-bit encoding, it cannot give a "fail" signal:
fencs=ucs-bom,latin1,default,utf-8 means the same as
fencs=ucs-bom,latin1 i.e. whenever there is no BOM, the file will be
detected as Latin1 because none of the 256 possible byte values, in any
sequence, is invalid for Latin1 -- and if it is actually UTF-8 without
BOM, anything above U+007F wil appear as two or more characters of
gibberish.

In the 'fileencodings' option, "ucs-bom", if present, should be first,
and an 8-bit encoding, if present, should be last (which means that at
most one 8-bit encoding should be used), because anything that comes
after the first 8-bit encoding will never be used.

Setting fencs=ucs-bom,utf-8,latin1 means the following:

1) Is there a BOM at the very start of the file? Then setlocal bomb,
"eat" the BOM, and setlocal the corresponding Unicode 'fileencoding',
otherwise setlocal nobomb and:

2) Are the full contents of the file valid for UTF-8? (and note: 7-bit
ASCII is valid for both UTF-8 and Latin1 and is displayed the same in
both) -- if yes, setlocal fenc=utf-8; otherwise

3) Unconditionally setlocal fenc=latin1

>
> Thanks for your help.
>
> Alessandro Antonello
>

My pleasure.

Best regards,
Tony.
--
The pitcher wound up and he flang the ball at the batter.  The batter
swang and missed.  The pitcher flang the ball again and this time the
batter connected.  He hit a high fly right to the center fielder.  The
center fielder was all set to catch the ball, but at the last minute
his eyes were blound by the sun and he dropped it.
                -- Dizzy Dean

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: enc,fenc (again!?)

Tony Mechelynck
On 05/11/10 03:59, Alessandro Antonello wrote:

>> Since latin1 is an 8-bit encoding, it cannot give a "fail" signal:
>> fencs=ucs-bom,latin1,default,utf-8 means the same as fencs=ucs-bom,latin1
>> i.e. whenever there is no BOM, the file will be detected as Latin1 because
>> none of the 256 possible byte values, in any sequence, is invalid for Latin1
>> -- and if it is actually UTF-8 without BOM, anything above U+007F wil appear
>> as two or more characters of gibberish.
>>
>> In the 'fileencodings' option, "ucs-bom", if present, should be first, and
>> an 8-bit encoding, if present, should be last (which means that at most one
>> 8-bit encoding should be used), because anything that comes after the first
>> 8-bit encoding will never be used.
>>
>> Setting fencs=ucs-bom,utf-8,latin1 means the following:
>>
>> 1) Is there a BOM at the very start of the file? Then setlocal bomb, "eat"
>> the BOM, and setlocal the corresponding Unicode 'fileencoding', otherwise
>> setlocal nobomb and:
>>
>> 2) Are the full contents of the file valid for UTF-8? (and note: 7-bit ASCII
>> is valid for both UTF-8 and Latin1 and is displayed the same in both) -- if
>> yes, setlocal fenc=utf-8; otherwise
>>
>> 3) Unconditionally setlocal fenc=latin1
>
> Hi!
>
> I see what you mean now.
>
> 1) Yes, I have a BOM in the start of every utf-8 file. You are saying that I
> should set 'fencs=ucs-bom,utf-8,latin1'. What would happen if I open an utf-8
> file from the command line using just 'gvim filename.ext'? Assuming that the
> file has a BOM. Vim would recognize the BOM and set 'fenc=utf-8'? What if I
> use 'gvim filename.ext' for a file in latin1 encoding and no BOM? Vim would
> recognize that it has no BOM and set 'fenc=latin1'? Assuming that I have
> 'enc=latin1' defined.

If a file has a BOM, it is recognised by the first heuristic (ucs-bom),
and nothing else comes into play.

A Latin1 file never has a BOM, so the "ucs-bom" heuristic will fail and
the next heuristics will be tried in turn. If that file contains
characters above 0x7F, it will be found "invalid for UTF-8" by the
second (utf-8) heuristic, which will also fail. The latin1 heuristic
cannot fail and the file will get the equivalent of ":setlocal nobomb
fenc=latin1". The 'bomb' option is immaterial when fenc=latin1, but it
is set to false by the failing ucs-bom heuristic, which cannot know
whether or not a Unicode encoding will later be detected for this file.

A file entirely in 7-bit ASCII is valid for both UTF-8 and Latin1, so
whichever of these is tried first will give a "success" signal, and that
encoding will be used as the file's 'fileencoding'. As long as the file
contains only 7-bit ASCII data, it makes no difference whether it is
read and written as UTF-8 (without BOM) or Latin1, since 7-bit ASCII is
represented identically in both.

If 'encoding' is set to latin1, however, you have a bigger problem: in
that case most Unicode codepoints (in fact, any codepoint above U+00FF)
cannot be represented in Vim's _internal memory_, and such
"unrepresentable" codepoints will be garbled, probably replaced by
inverted question marks or something like that. See the code snippet I
wrote in some earlier post in this thread, or the page
http://vim.wikia.com/Working_with_Unicode , about how to make sure at
startup (in your vimrc) that Vim can edit anything, including if
necessary a page like my own homepage,
http://users.skynet.be/antoine.mechelynck/index.htm , which contains not
only text in several "Western" languages, but also in Esperanto (which
is Latin script but not Latin1-compatible; however the "incompatible"
accented letters don't appear in that page) and in Russian, Arabic,
Chinese and Japanese — or like
http://users.skynet.be/antoine.mechelynck/other/imbecile.htm , which
contains a single sentence in many languages including Portuguese (I
don't know if from Portugal, Angola, Mozambique, Macau, Brazil, or which
combination of them); and how to do it cleanly (because if you change
'encoding' after Vim has started, once some editfile(s) has been loaded
in memory, there's a good chance the data in memory for such file(s)
will get hopelessly corrupt).

>
> 2) No, not all files are valid for both UTF-8 and Latin1. Some files are in
> Portuguese (Brazilian) with accents, cedillas, etc.

Latin1 files with accents, cedillas, etc. will not be accepted by Vim as
UTF-8 files, see my second paragraph from top, above, and the discussion
about the difference between 7-bit ASCII (which is valid in both) and
Latin1 with "higher-ASCII" characters in the hex range 80-FF (which isn't).

>
> Right now I don't thrust in the Vim's automatic behavior. Almost all source
> files that I have are in latin1 encoding. This is why I put 'set enc=latin1'
> in my *.vimrc*, even in the Mac that is UTF-8 by default. Just a few XML files
> are in 'utf-8'. For these I always use '++enc=utf-8' when open/create them.

That won't work. ++enc=utf-8 changes 'fileencoding', not 'encoding' (and
rightly so, because of the risk of corrupting other already loaded
files' data by changing 'encoding'); if 'encoding' is at Latin1 Vim
won't be able to represent in memory any codepoint above U+00FF. For
instance you won't be able to load correctly into Vim either of the two
webpages I mentioned above if 'encoding' is set to Latin1. Not only the
non-Latin1 letters won't be visible, but they will be garbled in Vim memory.

>
> Don't get me wrong. I love Vim. I don't thrust its behavior because I had
> problems in the past with utf-16le encoded files. Maybe I just don't get the
> right configuration at that time. Since then I use the configuration I show.
> But I am open for your advices, and I'll try the way you said.
>
> Thanks again.
> Alessandro Antonello
>

If 'encoding' is set to UTF-8, Vim will correctly load and edit:
- any UTF-16 (be or le) file with BOM, if "ucs-bom" comes first in
'fileencodings';
- any UTF-16 (be or le) file without BOM, if read with the proper ++enc=
modifier. If the file is read as Latin1 by the "automatic" process,
about every second character will be shown as ^@ (a null); then reread
it with e.g. ":e ++enc=utf-16le" and all will be OK.

Best regards,
Tony.
--
Acquaintance, n.:
        A person whom we know well enough to borrow from, but not well
enough to lend to.
                -- Ambrose Bierce, "The Devil's Dictionary"

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php