Automatically change encoding when opening file

classic Classic list List threaded Threaded
6 messages Options
BPJ
Reply | Threaded
Open this post in threaded view
|

Automatically change encoding when opening file

BPJ
I regularly get files encoded in UTF16LE sent to me and want them
to be automatically converted to UTF8 when opening them.
I'm not sure what command to use/put in .vimrc for that though, so
I wonder if anybody is already doing this, and how?

/bpj

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: Automatically change encoding when opening file

Tony Mechelynck
On 15/11/12 09:18, BPJ wrote:
> I regularly get files encoded in UTF16LE sent to me and want them to be
> automatically converted to UTF8 when opening them.
> I'm not sure what command to use/put in .vimrc for that though, so
> I wonder if anybody is already doing this, and how?
>
> /bpj
>

If a file in UTF-16le has a BOM (the codepoint U+FEFF at the very
beginning of the file, which for UTF-16le means the bytes 0xFF 0xFE),
then if you have set Vim to use UTF-8 'encoding' in your vimrc that file
will usually be opened correctly (because the default 'fileencodings'
-plural- starts with "ucs-bom"). See
http://vim.wikia.com/wiki/Working_with_Unicode about how to set Vim up
like that.

If the file has no BOM it is a little harder to detect the correct
'fileencoding'. I think there is a Chinese regular of this list who has
a plugin to do more detailed encoding matching than what Vim does out of
the box but I don't know the details.

Now, once the 'fileencoding' -singular- has been detected as UTF-16le it
is possible to convert it, even automatically, like this:

if has('multi_byte') && &encoding == 'utf-8'
     augroup vimrc_utf8
         autocmd VimEnter * autocmd vimrc_utf8 BufReadPost *
             \ if &fenc ==? 'utf-16le' || &fenc ==? 'ucs-2le' |
                 \ setlocal fenc=utf-8 |
                 \ w! |
             \ endif
     augroup END
endif

The exclamation mark is necessary if the file is marked readonly, but is
not on a readonly filesystem. Alternatively, if you don't want to
convert read-only files (including files opened by :view), replace the
"if" line inside the autocommand by

        \ if !&ro && (&fenc ==? 'utf-16le' || &fenc ==? 'ucs-2le') |

and then the exclamation mark may be left out on the "w[rite]" line.

It may look weird to define an autocommand inside an autocommand, but I
do it like this to ensure that this BufReadPost autocommand comes last,
and in particular, after anything defined by any additional
encoding-detection plugin you might install.


Best regards,
Tony.
--
The mome rath isn't born that could outgrabe me.
                -- Nicol Williamson

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: Automatically change encoding when opening file

Ven Tadipatri
On Thu, Nov 15, 2012 at 8:31 AM, Tony Mechelynck
<[hidden email]> wrote:

>
> If a file in UTF-16le has a BOM (the codepoint U+FEFF at the very beginning
> of the file, which for UTF-16le means the bytes 0xFF 0xFE), then if you have
> set Vim to use UTF-8 'encoding' in your vimrc that file will usually be
> opened correctly (because the default 'fileencodings' -plural- starts with
> "ucs-bom"). See http://vim.wikia.com/wiki/Working_with_Unicode about how to
> set Vim up like that.
>

Hi Antoine,

I'm not really that familiar with the different encoding types (UTF-8,
UTF-16, etc), but when I came across a strange <feff> character which
I think is related to what you're describing.
  I open up two files in gedit and they seem to contain the same exact
line. But in vim, there's a strange character at the beginning
"<feff>". It's not a string, because if I go to the beginning of the
line and hit 'x', it deletes the entire <feff>, indicating it's some
sort of special hidden character.
  What is this strange character?  In Vi's hex mode (%!xxd), I can see
there is a sequence of bits "efbbbf", and the rest of the file seems
to somehow be offset

Thanks,
Ven



>
> Best regards,
> Tony.
> --

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: Automatically change encoding when opening file

Benjamin Fritz
On Thursday, November 29, 2012 2:41:16 PM UTC-6, vtadipatri wrote:

>
> I'm not really that familiar with the different encoding types (UTF-8,
>
> UTF-16, etc), but when I came across a strange <feff> character which
>
> I think is related to what you're describing.
>
>   I open up two files in gedit and they seem to contain the same exact
>
> line. But in vim, there's a strange character at the beginning
>
> "<feff>". It's not a string, because if I go to the beginning of the
>
> line and hit 'x', it deletes the entire <feff>, indicating it's some
>
> sort of special hidden character.
>
>   What is this strange character?  In Vi's hex mode (%!xxd), I can see
>
> there is a sequence of bits "efbbbf", and the rest of the file seems
>
> to somehow be offset
>

This strange character is the byte-order-mark ( http://en.wikipedia.org/wiki/Byte_order_mark ). The exact byte sequence you see indicates the file is in utf-8. Vim probably did not detect the file as utf-8.

Check that:
1. your Vim is compiled with multibyte support
2. your 'encoding' option is set AT THE VERY BEGINNING OF YOUR .VIMRC to utf-8
3. your 'fileencodings' option contains ucs-bom or utf-8 or both, before any 8-bit encodings.

If these are all the case your Vim should automatically detect the utf-8 fileencoding and the presence of a BOM, and set 'fenc' and 'bomb' appropriately.

See the http://vim.wikia.com/wiki/Working_with_Unicode linked by Tony.

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: Automatically change encoding when opening file

Marcin Szamotulski
In reply to this post by Ven Tadipatri
On 15:41 Thu 29 Nov     , Ven Tadipatri wrote:

> On Thu, Nov 15, 2012 at 8:31 AM, Tony Mechelynck
> <[hidden email]> wrote:
>
> >
> > If a file in UTF-16le has a BOM (the codepoint U+FEFF at the very beginning
> > of the file, which for UTF-16le means the bytes 0xFF 0xFE), then if you have
> > set Vim to use UTF-8 'encoding' in your vimrc that file will usually be
> > opened correctly (because the default 'fileencodings' -plural- starts with
> > "ucs-bom"). See http://vim.wikia.com/wiki/Working_with_Unicode about how to
> > set Vim up like that.
> >
>
> Hi Antoine,
>
> I'm not really that familiar with the different encoding types (UTF-8,
> UTF-16, etc), but when I came across a strange <feff> character which
> I think is related to what you're describing.
>   I open up two files in gedit and they seem to contain the same exact
> line. But in vim, there's a strange character at the beginning
> "<feff>". It's not a string, because if I go to the beginning of the
> line and hit 'x', it deletes the entire <feff>, indicating it's some
> sort of special hidden character.
>   What is this strange character?  In Vi's hex mode (%!xxd), I can see
> there is a sequence of bits "efbbbf", and the rest of the file seems
> to somehow be offset
>
> Thanks,
> Ven
>
>
>
> >
> > Best regards,
> > Tony.
> > --
>
> --
> You received this message from the "vim_use" maillist.
> Do not top-post! Type your reply below the text you are replying to.
> For more information, visit http://www.vim.org/maillist.php

Hi,

This is the bom (byte order mark) character:
http://en.wikipedia.org/wiki/Byte_order_mark

<feff> is the BOM character for UTF-16 encoding.  UTF-16 uses 2 bytes to
encode a character, but the order of them might differ. This BOM
character tells which byte comes first.

Best,
Marcin

Best,
Marcin

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php
Reply | Threaded
Open this post in threaded view
|

Re: Automatically change encoding when opening file

Benjamin Fritz
On Thursday, November 29, 2012 3:16:33 PM UTC-6, coot_. wrote:
>
> <feff> is the BOM character for UTF-16 encoding.  UTF-16 uses 2 bytes to
>
> encode a character, but the order of them might differ. This BOM
>
> character tells which byte comes first.
>

feff is the BOM character for UTF-8 as well, where it does not have any meaning in terms of byte ordering, but can be used to identify a file as UTF-8.

In UTF-8, the feff character is represented as efbbbf (three bytes) due to the way UTF-8 encodes multi-byte values in varying length.

The interesting thing about UTF-8 is that often even if an editor misidentifies a UTF-8 file as Latin1, or as windows-1252, for example, most of the file will remain readable, because UTF-8 has the same byte representation for many characters as Latin1 does.

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php