How can I get the page numbers only of a pattern in a pdf file, regardless if the pattern is multiline?

I find the page numbers of a multiline pattern in a pdf file, by How shall I grep a multi-line pattern in a pdf file and in a text file? and How can I search a string in a pdf file, and find the physical page number of each page where the string appears?

$ pdfgrep -Pn '(?s)images+?nots+?available'  main_text.pdf 
49: image
   not
available
51: image
   not
available
53: image
   not
available
54: image
   not
available
55: image
   not
available

I would like to extract the page number only, but because the pattern is multiline, I get

$ pdfgrep -Pn '(?s)images+?nots+?available'  main_text.pdf | awk -F":" '{print $1}'
49
   not
available
51
   not
available
53
   not
available
54
   not
available
55
   not
available

instead of

I wonder how I can extract the page numbers only, regardless if the pattern is multiline? Thanks.

Contents hide

Answers:

Method 1

Method 2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

It’s a bit hacky, but since you are already using a perl compatible RE, you could use K “keep left” modifier to match everything in your expression (and anything else up to the next line end) but exclude it from the output:

pdfgrep -Pn '(?s)images+?nots+?available.*?$K'  main_text.pdf

The output will still include the : separator however.

Method 2

Add $0~":" as awk recognizer. I.e., you get the line:

 .... | awk -F":" '$0~":"{print $1}'

By that, the printout will only be when the input line has a “:” in it, and other lines are just discarded.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating