How can I “grep” patterns across multiple lines?

It seems I am misusing grep/egrep.

I was trying to search for strings in multiple line and could not find a match while I know that what I’m looking for should match. Originally I thought that my regexes were wrong but I eventually read that these tools operate per line (also my regexes were so trivial it could not be the issue).

So which tool would one use to search patterns across multiple lines?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Here’s a sed one that will give you grep-like behavior across multiple lines:

sed -n '/foo/{:start /bar/!{N;b start};/your_regex/p}' your_file

How it works

  • -n suppresses the default behavior of printing every line
  • /foo/{} instructs it to match foo and do what comes inside the squigglies to the matching lines. Replace foo with the starting part of the pattern.
  • :start is a branching label to help us keep looping until we find the end to our regex.
  • /bar/!{} will execute what’s in the squigglies to the lines that don’t match bar. Replace bar with the ending part of the pattern.
  • N appends the next line to the active buffer (sed calls this the pattern space)
  • b start will unconditionally branch to the start label we created earlier so as to keep appending the next line as long as the pattern space doesn’t contain bar.
  • /your_regex/p prints the pattern space if it matches your_regex. You should replace your_regex by the whole expression you want to match across multiple lines.

Method 2

I generally use a tool called pcregrep which can be installed in most of the linux flavour using yum or apt.

For eg.

Suppose if you have a file named testfile with content

abc blah
blah blah
def blah
blah blah

You can run the following command:

$ pcregrep -M  'abc.*(n|.)*def' testfile

to do pattern matching across multiple lines.

Moreover, you can do the same with sed as well.

$ sed -e '/abc/,/def/!d' testfile

Method 3

Simply a normal grep which supports Perl-regexp parameter P will do this job.

$ echo 'abc blah
blah blah
def blah
blah blah' | grep -oPz  '(?s)abc.*?def'
abc blah
blah blah
def

(?s) called DOTALL modifier which makes dot in your regex to match not only the characters but also the line breaks.

Method 4

Here’s a simpler approach using Perl:

perl -e '$f=join("",<>); print $& if $f=~/foonbar.*n/m' file

or (since JosephR took the sed route, I’ll shamelessly steal his suggestion)

perl -n000e 'print $& while /^foo.*nbar.*n/mg' file

Explanation

$f=join("",<>); : this reads the entire file and saves its contents (newlines and all) into the variable $f. We then attempt to match foonbar.*n, and print it if it matches (the special variable $& holds the last match found). The ///m is needed to make the regular expression match across newlines.

The -0 sets the input record separator. Setting this to 00 activates ‘paragraph mode’ where Perl will use consecutive newlines (nn) as the record separator. In cases where there are no consecutive newlines, the entire file is read (slurped) at once.

Warning:

Do not do this for large files, it will load the entire file into memory and that may be a problem.

Method 5

Supppose we have the file test.txt containing :

blabla
blabla
foo
here
is the
text
to keep between the 2 patterns
bar
blabla
blabla

The following code can be used :

sed -n '/foo/,/bar/p' test.txt

For the following output :

foo
here
is the
text
to keep between the 2 patterns
bar

Method 6

The grep alternative sift supports multiline matching (disclaimer: I am the author).

Suppose testfile contains:

<book>
  <title>Lorem Ipsum</title>
  <description>Lorem ipsum dolor sit amet, consectetur
  adipiscing elit, sed do eiusmod tempor incididunt ut
  labore  et dolore magna aliqua</description>
</book>

sift -m '<description>.*?</description>' (show the lines containing the description)

Result:

testfile:  <description>Lorem ipsum dolor sit amet, consectetur
testfile:  adipiscing elit, sed do eiusmod tempor incididunt ut
testfile:  labore  et dolore magna aliqua</description>

sift -m '<description>(.*?)</description>' --replace 'description="$1"' --no-filename (extract and reformat the description)

Result:

description="Lorem ipsum dolor sit amet, consectetur
  adipiscing elit, sed do eiusmod tempor incididunt ut
  labore  et dolore magna aliqua"

Method 7

I solved this one for me using grep and -A option with another grep.

grep first_line_word -A 1 testfile | grep second_line_word

The -A 1 option prints 1 line after the found line. Of course it depends on your file and word combination. But for me it was the fastest and reliable solution.

Method 8

One way to do this is with Perl. e.g. here’s the contents of a file named foo:

foo line 1
bar line 2
foo
foo
foo line 5
foo
bar line 6

Now, here’s some Perl which will match against any line that begins with foo followed by any line that begins with bar:

cat foo | perl -e 'while(<>){$all .= $_}
  while($all =~ /^(foo[^n]*nbar[^n]*n)/m) {
  print $1; $all =~ s/^(foo[^n]*nbar[^n]*n)//m;
}'

The Perl, broken down:

  • while(<>){$all .= $_} This loads the entire standard input in to the variable $all
  • while($all =~ While the variable all has the regular expression…
  • /^(foo[^n]*nbar[^n]*n)/m The regex: foo at the beginning of the line, followed by any number of non-newline chars, followed by a newline, followed immediately by “bar”, and the rest of the line with bar in it. /m at the end of the regex means “match across multiple lines”
  • print $1 Print the part of the regex that was in parenthesis (in this case, the entire regular expression)
  • s/^(foo[^n]*nbar[^n]*n)//m Erase the first match for the regex, so we can match multiple cases of the regex in the file in question

And the output:

foo line 1
bar line 2
foo
bar line 6

Method 9

If we want to get the text between the 2 patterns excluding themselves.

Supppose we have the file test.txt containing :

blabla
blabla
foo
here
is the
text
to keep between the 2 patterns
bar
blabla
blabla

The following code can be used :

 sed -n '/foo/{
 n
 b gotoloop
 :loop
 N
 :gotoloop
 /bar/!{
 h
 b loop
 }
 /bar/{
 g
 p
 }
 }' test.txt

For the following output :

here
is the
text
to keep between the 2 patterns

How does it work, let’s make it step by step

  1. /foo/{ is triggered when line contains “foo”
  2. n replace the pattern space with next line, i.e. the word “here”
  3. b gotoloop branch to the label “gotoloop”
  4. :gotoloop defines the label “gotoloop”
  5. /bar/!{ if the pattern doesn’t contain “bar”
  6. h replace the hold space with pattern, so “here” is saved in the hold space
  7. b loop branch to the label “loop”
  8. :loop defines the label “loop”
  9. N appends the pattern to the hold space.
    Now hold space contains :
    “here”
    “is the”
  10. :gotoloop We are now at step 4, and loop until a line contains “bar”
  11. /bar/ loop is finished, “bar” has been found, it’s the pattern space
  12. g pattern space is replaced with hold space that contains all the lines between “foo” and “bar” that have saved during the main loop
  13. p copy pattern space to standard output

Done !

Method 10

cat file | egrep "<pattern1>|<pattern2>"

would list all lines matching with either <pattern1> or <pattern2>.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x