regexp help needed

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

regexp help needed

Alexei Alexandrov
Hi All,

I thought I knew regular expressions well, but today I've stumbled over a problem which seems to be simple, but I cannot find a regexp to make it.

I need the following: find all multiline fragments of 200M text file so that first line of the fragment contains WORD1, and the last line contains WORD2, and no WORD3 appears between WORD1 and WORD2.

So this should match:

asldkjfasfklasjd WORD1 asldkjf asldj adslkfj
sldkfja ;asdlkjf asdlkf lsakdjf a;lsfj
alsdkjfa dlkjasdf WORD3 laskdjfasldkfj
asdlfj a;sldjfoiertu;kj asdf ;askdjf
aslkdfj asd WORD2

But this must not:

asldkjfasfklasjd WORD1 asldkjf asldj adslkfj
sldkfja ;asdlkjf asdlkf lsakdjf a;lsfj
asdlfj a;sldjfoiertu;kj asdf ;askdjf
aslkdfj asd WORD2

Finally, I was able to solve this problem by using something like

:substitute/WORD1\_.\{-}WORD2/\=match(submatch(0),"WORD3") ? submatch(0) : ">>>" . submatch(0) . "<<<"/g

and then examining all the text between >>> and <<< markers. But I'm sure there should be a just-regexp-match way. Can anyone point me at it?

--
Alexei Alexandrov
Reply | Threaded
Open this post in threaded view
|

Re: regexp help needed

Tim Chase-2
 > I need the following: find all multiline fragments of 200M
 > text file so that first line of the fragment contains
 > WORD1, and the last line contains WORD2, and no WORD3
 > appears between WORD1 and WORD2.
 >
 > So this should match:
 >
 > asldkjfasfklasjd WORD1 asldkjf asldj adslkfj
 > sldkfja ;asdlkjf asdlkf lsakdjf a;lsfj
 > alsdkjfa dlkjasdf WORD3 laskdjfasldkfj
 > asdlfj a;sldjfoiertu;kj asdf ;askdjf
 > aslkdfj asd WORD2
 >
 > But this must not:
 >
 > asldkjfasfklasjd WORD1 asldkjf asldj adslkfj
 > sldkfja ;asdlkjf asdlkf lsakdjf a;lsfj
 > asdlfj a;sldjfoiertu;kj asdf ;askdjf
 > aslkdfj asd WORD2

I'm not quite sure I follow what you're asking, though it
might just be that your description of what *should* match
("no WORD3 appears between WORD1 and WORD2") and what you
describe "this should match" don't jive with each other :)

If you want what you describe first ("no WORD3...between
WORD1 and WORD2"), then this should do it for you:

     /WORD1\%(\%(WORD[23]\)\@!\_.\)\{-}WORD2

If, however, you want to find those instances that *do* have
WORD3 between WORD1 and WORD2, then something like

     /WORD1\%(\%(WORD2\)\@!\_.\)\{-}WORD3\_.\{-}WORD2

or if you want *one and only one* instance of WORD3 between them:

 
/WORD1\%(\%(WORD[23]\)\@!\_.\)\{-}WORD3\%(\%(WORD3\)\@!\_.\)\{-}WORD2

Hope this gives you some regexps to work with that do what
you want.

-tim






Reply | Threaded
Open this post in threaded view
|

Re: regexp help needed

Alexei Alexandrov
Hi Tim Chase, you wrote:

>
> I'm not quite sure I follow what you're asking, though it
> might just be that your description of what *should* match
> ("no WORD3 appears between WORD1 and WORD2") and what you
> describe "this should match" don't jive with each other :)
>
> If you want what you describe first ("no WORD3...between
> WORD1 and WORD2"), then this should do it for you:
>
>      /WORD1\%(\%(WORD[23]\)\@!\_.\)\{-}WORD2
>
> If, however, you want to find those instances that *do* have
> WORD3 between WORD1 and WORD2, then something like
>
>      /WORD1\%(\%(WORD2\)\@!\_.\)\{-}WORD3\_.\{-}WORD2
>
> or if you want *one and only one* instance of WORD3 between them:
>
>  
> /WORD1\%(\%(WORD[23]\)\@!\_.\)\{-}WORD3\%(\%(WORD3\)\@!\_.\)\{-}WORD2
>
> Hope this gives you some regexps to work with that do what
> you want.
>
> -tim

Tim, thanks a lot for such a full and helpful answer!

--
Alexei Alexandrov