UTF8 In :s

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF8 In :s

Paul-433
I'm not sure if this is a bug or if I did something wrong.

I have a file in which every apostrophe is replaced with a blue "~R". My
default encoding is latin1, so I set encoding=utf8 and they changed to a blue
"<92>". I want to replace them all with apostrophes, so I do ":%s/<92>/'/g",
entering the "<92>" as "ctrl-v u 0092". They changed, but the character that
was after the "<92>" went missing, eg "don<92>t jump" changed to "don' jump".
It was as if the "<92>" in the substitute command was including proceeding
character.

Incidentally, I changed encoding back to latin1 and all the "<92>"s changed
back to "~R"s. I then performed the substitute command again, only this time,
"~R" appeared when I typed "ctrl-v u 0092", and the effect was as originally
expected, ie. "don~Rt jump" changed to "don't jump".

Maybe substitution is a bit broken with encoding=utf8?

--

.
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 In :s

A.J.Mechelynck
----- Original Message -----
From: "Vigil" <[hidden email]>
To: "Vim Mailing List" <[hidden email]>
Sent: Thursday, August 25, 2005 9:41 PM
Subject: UTF8 In :s


> I'm not sure if this is a bug or if I did something wrong.
>
> I have a file in which every apostrophe is replaced with a blue "~R". My
> default encoding is latin1, so I set encoding=utf8 and they changed to a
> blue "<92>". I want to replace them all with apostrophes, so I do
> ":%s/<92>/'/g", entering the "<92>" as "ctrl-v u 0092". They changed, but
> the character that was after the "<92>" went missing, eg "don<92>t jump"
> changed to "don' jump". It was as if the "<92>" in the substitute command
> was including proceeding character.
>
> Incidentally, I changed encoding back to latin1 and all the "<92>"s
> changed back to "~R"s. I then performed the substitute command again, only
> this time, "~R" appeared when I typed "ctrl-v u 0092", and the effect was
> as originally expected, ie. "don~Rt jump" changed to "don't jump".
>
> Maybe substitution is a bit broken with encoding=utf8?
>
> --

I believe it wasn't UTF-8. 0x92 cannot appear by itself in UTF-8 encoding,
it is a "trailing" byte and can only appear in a sequence of two or more
bytes, the first of which is >= 0xC0 and all others of which are in the
range 0x80-0xBF. (The number of consecutive high-order "one" bits in the
heading byte is the total number of bytes in the sequence.)

OTOH, 0x92 is sometimes used in Latin1 as a non-standard representation of a
typographical (left-concave) apostrophe. Its official name, however, is
<control> = PRIVATE USE TWO, which could be anything.

To see 0x92 in Windows latin1 as an apostrophe (not a vertical quote which
is 0x27), be sure to include 146 in your 'isprint' option. For instance,

    :set isprint=@,~-255

will do it. To _replace_ it by an apostrophe, set 'encoding' to latin1 and
_then_ do the substitute, for instance

    :%s/^Vx92/'/g

where ^Vx92 means "hit Ctrl-V, hit small-x-for-X-ray, hit digit-nine, hit
digit-two" and appears either as a blue ~R or as a left-concave apostrophe,
depending on your 'isprint' setting. (see ":help i_CTRL-V_digit" which
applies not only to Insert mode but also to Command-line mode). If you use
Ctrl-V to paste, then replace it here by Ctrl-Q.


Best regards,
Tony.


Reply | Threaded
Open this post in threaded view
|

Re: UTF8 In :s

Wes Potts
In reply to this post by Paul-433
Unfortunately, I don't have an answer for your problem.  But, I have a
similar problem (although it doesn't present itself in vim) and I hope
someone might have seen this before.  I have a Suse linux box for my
personal use.  The man pages on this machine are all displayed with
backticks ( '`' ) appearing as 'รข'  (if it displays incorrectly, it is
an 'a' with a circumflex '^').  If anyone could point me toward the
problem, I would greatly appreciate it.

Wes

On 8/25/05, Vigil <[hidden email]> wrote:

> I'm not sure if this is a bug or if I did something wrong.
>
> I have a file in which every apostrophe is replaced with a blue "~R". My
> default encoding is latin1, so I set encoding=utf8 and they changed to a blue
> "<92>". I want to replace them all with apostrophes, so I do ":%s/<92>/'/g",
> entering the "<92>" as "ctrl-v u 0092". They changed, but the character that
> was after the "<92>" went missing, eg "don<92>t jump" changed to "don' jump".
> It was as if the "<92>" in the substitute command was including proceeding
> character.
>
> Incidentally, I changed encoding back to latin1 and all the "<92>"s changed
> back to "~R"s. I then performed the substitute command again, only this time,
> "~R" appeared when I typed "ctrl-v u 0092", and the effect was as originally
> expected, ie. "don~Rt jump" changed to "don't jump".
>
> Maybe substitution is a bit broken with encoding=utf8?
>
> --
>
> .
>