I find the page numbers of a multiline pattern in a pdf file, by How shall I grep a multi-line pattern in a pdf file and in a text file? and How can I search a string in a pdf file, and find the physical page number of each page where the string appears?
$ pdfgrep -Pn '(?s)images+?nots+?available' main_text.pdf 49: image not available 51: image not available 53: image not available 54: image not available 55: image not available
I would like to extract the page number only, but because the pattern is multiline, I get
$ pdfgrep -Pn '(?s)images+?nots+?available' main_text.pdf | awk -F":" '{print $1}'
49
not
available
51
not
available
53
not
available
54
not
available
55
not
available
instead of
49 51 53 54 55
I wonder how I can extract the page numbers only, regardless if the pattern is multiline? Thanks.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
It’s a bit hacky, but since you are already using a perl compatible RE, you could use K “keep left” modifier to match everything in your expression (and anything else up to the next line end) but exclude it from the output:
pdfgrep -Pn '(?s)images+?nots+?available.*?$K' main_text.pdf
The output will still include the : separator however.
Method 2
Add $0~":" as awk recognizer. I.e., you get the line:
.... | awk -F":" '$0~":"{print $1}'
By that, the printout will only be when the input line has a “:” in it, and other lines are just discarded.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0