Match the BOM

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Match the BOM

zod-3

How do you match the BOM in a file you are editing in vim?

I've tried the following:

/\xfeff
/\o177377

I know you can just set nobomb to get rid of it, but I'm just
wondering how you would regex it. Does the BOM fall outside of the
unicode range that vim's regex engine uses or do I just have the
syntax wrong?

Thanks,

zod
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Match the BOM

Tony Mechelynck

On 19/11/08 18:34, zod wrote:

> How do you match the BOM in a file you are editing in vim?
>
> I've tried the following:
>
> /\xfeff
> /\o177377
>
> I know you can just set nobomb to get rid of it, but I'm just
> wondering how you would regex it. Does the BOM fall outside of the
> unicode range that vim's regex engine uses or do I just have the
> syntax wrong?
>
> Thanks,
>
> zod

The BOM doesn't fall outside the Unicode range, it is not considered
part of the text, so when reading a file in ucs-bom mode, Vim strips the
BOM and sets the local 'bomb' option to remember that it must be added
back when writing the file.

To know if the current file has a BOM at its start, just use

        if &bomb

You can't match it with a regex because it is not stored with the file's
data.


Best regards,
Tony.
--
What does it mean if there is no fortune for you?

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Match the BOM

zod-3

Thanks, Tony. That was driving me crazy.
--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Match the BOM

François Pinard
In reply to this post by Tony Mechelynck

[Tony Mechelynck]

>On 19/11/08 18:34, zod wrote:

>> Does the BOM fall outside of the  unicode range that vim's regex
>> engine uses or do I just have the syntax wrong?

>The BOM doesn't fall outside the Unicode range, it is not considered
>part of the text [...]

Just for the record (and to be pedantic), while the BOM is a Unicode
character, the reversed BOM is not part of Unicode.  This is more
a philosophical detail than a technical issue. :-)

--
François Pinard   http://pinard.progiciels-bpi.ca

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---

Reply | Threaded
Open this post in threaded view
|

Re: Match the BOM

Tony Mechelynck

On 20/11/08 13:04, François Pinard wrote:

> [Tony Mechelynck]
>
>> On 19/11/08 18:34, zod wrote:
>
>>> Does the BOM fall outside of the  unicode range that vim's regex
>>> engine uses or do I just have the syntax wrong?
>
>> The BOM doesn't fall outside the Unicode range, it is not considered
>> part of the text [...]
>
> Just for the record (and to be pedantic), while the BOM is a Unicode
> character, the reversed BOM is not part of Unicode.  This is more
> a philosophical detail than a technical issue. :-)
>

To be still more pedantic, U+FFFE is part of the Unicode range, where it
is listed as "Not a character", i.e., it is one of the "forbidden"
codepoints which are "in range". (The "original" Unicode range as still
supported by Vim for UTF-8, UTF-32be and UTF-32le, used to be from
U+0000 to U+7FFFFFFF. The Unicode Consortium later invalidated, among
others, (a) all planes above plane 0x10, and (b) the last two codepoints
U+xxFFFE and U+xxFFFF in every plane, which brings the highest "valid"
codepoint down to U+10FFFD at most.)


Best regards,
Tony.
--
        His head smashed in,  and his heart cut out,
        And his liver removed, and his bowels unplugged,
        And his nostrils raped, and his bottom burned off,
        And his penis split ... and his ...
                  "Monty Python and the Holy Grail" PYTHON (MONTY)
PICTURES LTD

--~--~---------~--~----~------------~-------~--~----~
You received this message from the "vim_multibyte" maillist.
For more information, visit http://www.vim.org/maillist.php
-~----------~----~----~----~------~----~------~--~---