bug with :s/a.\{-}$//

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

bug with :s/a.\{-}$//

Milan Vancura-2
Hi all,

recently I was edited a file containing several columns and wanted to delete
just last one. The surprise was that vim had deleted all but the first columns.
But the bug appears with any text, not just with special char like <TAB>.

Description:

.\{-\} should match as few chars as possible. However if used in regexp just
before $, it acts greedy - it matches as many chars as possible.

Version of vim: 6.2.154
        (Of course I've checked ftp://ftp.vim.org/pub/vim/patches/*/README
        if it is known bug, but I haven't found anything similar)

How to reproduce:

        vim -u NONE -U NONE

        insert the following text:

        the first line with abc and next abc with some text after that
        the second line with abc and next abc with some text after that
        the third line with abc and next abc with some text after that
        the fourth line with abc and next abc with some text after that


        Then type:

        :%s/abc.\{-}$//

What is expected:

        the first line with abc and next
        the second line with abc and next
        the third line with abc and next
        the fourth line with abc and next

What I get:

        the first line with
        the second line with
        the third line with
        the fourth line with


Milan Vancura
--
Milan Vancura, Prague, Czech Republic, Europe
Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

Nikolai Weibull-11
On 3/23/06, Milan Vancura <[hidden email]> wrote:
>         :%s/abc.\{-}$//

Use :%s/^\(.*\)abc.*$/\1/

The matcher doesn't work the way you seem to think it does.  It's not
that "clever".

  nikolai
Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

Matthew Winn
In reply to this post by Milan Vancura-2
On Thu, Mar 23, 2006 at 09:41:47AM +0100, Milan Vancura wrote:
> recently I was edited a file containing several columns and wanted to delete
> just last one. The surprise was that vim had deleted all but the first columns.
> But the bug appears with any text, not just with special char like <TAB>.
>
> Description:
>
> .\{-\} should match as few chars as possible. However if used in regexp just
> before $, it acts greedy - it matches as many chars as possible.

You misunderstand the meaning of "as few chars as possible".

The way regular expressions work is that they start from the left and
look for the first match they can find.  In the case of your example...

> the first line with abc and next abc with some text after that
> :%s/abc.\{-}$//

...the regular expression engine finds the first occurrence of "abc",
then checks that the rest of the expression matches (which it does), so
it stops.  It won't search onwards for another match to try to reduce
the length of the text matched by ".\{-}": all the ".\{-}" means is "as
few characters from the point we've just reached".

To get the effect you want you have to precede the expression with ".*"
to match the longest possible string before the "abc".  Either of the
following will work:

        :%s/\(.*\)abc.*/\1/
        :%s/.*\zsabc.*/\1/

(Note that you can replace the trailing ".\{-}$" with ".*": the leading
".*" forces the match as far to the right as possible, so you just need
to gobble up all the remaining text.)

The important thing to remember about regular expressions is that they
always start the match as early as possible, and will NEVER try to start
a match further to the right unless there is no way to match at an
earlier position.

--
Matthew Winn ([hidden email])
Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

Milan Vancura-2
> You misunderstand the meaning of "as few chars as possible".
>
> The way regular expressions work is that they start from the left and
> look for the first match they can find.  In the case of your example...

Ouch!

Thank you for your explanation. I did expect that the answer from the list will
be "this is a feature, not a bug" but I couldn't find the right reason
before... :-)

In fact, I think this still is a bug, but much harder and complex: I think this
way how to deal with regexps (check from the left and start at the first match)
is a relict from the times when there were not the "\{...}" - probably before
vim version 6 (?). It was good optimization at that times. However, when the
regexp language expanded, this optimization stopped to work, so it should be
rewritten.

However, nor Bram nor I have anough time to do this huge work (and testing!),
I know... What about vim 8? :-)

Milan Vancura
Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

Stefan Walk
Milan Vancura wrote:
> In fact, I think this still is a bug, but much harder and complex: I think this
> way how to deal with regexps (check from the left and start at the first match)
> is a relict from the times when there were not the "\{...}" - probably before
> vim version 6 (?). It was good optimization at that times. However, when the
> regexp language expanded, this optimization stopped to work, so it should be
> rewritten.

It's not a vim thing, it is that way in every regexp engine i know (i'm
100% sure it's that way in PCRE and Ruby's Regexp), so changing that
would create a huge WTF factor for those proficient with regex.

Regards,
Stefan
Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

Nikolai Weibull-11
In reply to this post by Milan Vancura-2
On 3/23/06, Milan Vancura <[hidden email]> wrote:

> > You misunderstand the meaning of "as few chars as possible".
> >
> > The way regular expressions work is that they start from the left and
> > look for the first match they can find.  In the case of your example...
>
> Ouch!
>
> Thank you for your explanation. I did expect that the answer from the list will
> be "this is a feature, not a bug" but I couldn't find the right reason
> before... :-)
>
> In fact, I think this still is a bug, but much harder and complex: I think this
> way how to deal with regexps (check from the left and start at the first match)
> is a relict from the times when there were not the "\{...}" - probably before
> vim version 6 (?). It was good optimization at that times. However, when the
> regexp language expanded, this optimization stopped to work, so it should be
> rewritten.

No, it's not a bug.  It's just how these things work.  It will try
every position in the line, look for an 'a', then look for a 'b', then
a 'c', then any number of characters (as few as possible) until the
end-of-line is found.  There is no reason for it to then go back and
try shorter overall matches.  \{-} and its cousins only guarantee as
few matches of the preceding atom (in this case . or "any character"),
not a "shortest overall match".

What _is_ a bug is that depending on how you order your branches,
different things will be matched.  You would expect the left-most
longest match (as is what's happening with your example above) most of
the time, but there is no such guarantee for branches with the way the
backtracking engine used in Vim is designed.  And don't blame Bram,
it's all Henry Spencer's fault! ;-)

Try :s/^\(a\|aa\)/b/ on a line with two 'a''s on it.  Then try
:s/^\(a\|aa\)/b/ on a line with two 'a''s on it.  Now that's a bug.
But it's only a bug in the specification, not in the actual
implementation.

 nikolai
Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

Milan Vancura-2
In reply to this post by Stefan Walk
> It's not a vim thing, it is that way in every regexp engine i know (i'm
> 100% sure it's that way in PCRE and Ruby's Regexp), so changing that
> would create a huge WTF factor for those proficient with regex.

Oh, really! Even perl does the same thing. I'm sorry for mystification.
It's still surprising for me [ :-) ] but really the fact.


Milan Vancura
Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

John (Eljay) Love-Jensen
In reply to this post by Milan Vancura-2
Hi Milan,

Sidenote...

Here is *THE* definitive book on the topic of regular expressions:

Mastering Regular Expressions (2nd ed)
by Jeffrey Friedl
http://www.amazon.com/gp/product/0596002890/

Highly recommended, regardless if your RE-du-jour is grep, vim, perl, or
whatever.

HTH,
--Eljy

Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

Benji Fisher
On Thu, Mar 23, 2006 at 06:21:38AM -0600, John Love-Jensen wrote:

> Hi Milan,
>
> Sidenote...
>
> Here is *THE* definitive book on the topic of regular expressions:
>
> Mastering Regular Expressions (2nd ed)
> by Jeffrey Friedl
> http://www.amazon.com/gp/product/0596002890/
>
> Highly recommended, regardless if your RE-du-jour is grep, vim, perl, or
> whatever.
>
> HTH,
> --Eljy

     Follow the link from

http://iccf-holland.org/click5.html

and a portion of the cost goes to Vim's charity.  (:help Uganda)

                                        --Benji Fisher
Reply | Threaded
Open this post in threaded view
|

Re: bug with :s/a.\{-}$//

Milan Vancura-2
In reply to this post by John (Eljay) Love-Jensen
> Here is *THE* definitive book on the topic of regular expressions:

Yes, of course, I know about this book. This e-mail thread is a  good "lesson
of life" - even after 10 years of work with perl, vim, grep, sed etc. there is
something in regexps which can surprise me.

Good to be not too much self-confident :-))

Milan Vancura

P.S.: This is already mentioned even in a help file of vim, no external book is
needed, I should just read more carefuly next time:

:help \{

...
        If a "-" appears immediately after the "{", then a shortest match
        first algorithm is used (see example below).  In particular, "\{-}" is
        the same as "*" but uses the shortest match first algorithm.  BUT: A
        match that starts earlier is preferred over a shorter match: "a\{-}b"
        matches "aaab" in "xaaab".
...