Text between two tags

I want to retrieve whatever is between these two tags – <tr> </tr> – from an html doc.
Now I don’t have any specific html requirements that would warrant for an html parser. I just plain need something that matches <tr> and </tr> and gets everything in between and there could be multiple trs.
I tried awk, which works, but for some reason it ends up giving me duplicates of each row extracted.

awk '
/<TR/{p=1; s=$0}
p && /</TR>/{print $0 FS s; s=""; p=0}
p' htmlfile> newfile

How to go about this?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

grep

Method 6

pup

xpup

Method 7

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If you only want ... of all <tr>...</tr> do:

grep -o '<tr>.*</tr>' HTMLFILE | sed 's/(<tr>|</tr>)//g' > NEWFILE

For multiline do:

Check the HTMLFILE first of the char “|” (not usual, but possible) and if it exists, change to one which doesn’t exist.

Method 2

You do have a requirement that warrants an HTML parser: you need to parse HTML. Perl’s HTML::TreeBuilder, Python’s BeautifulSoup and others are easy to use, easier than writing complex and brittle regular expressions.

perl -MHTML::TreeBuilder -le '
    $html = HTML::TreeBuilder->new_from_file($ARGV[0]) or die $!;
    foreach ($html->look_down(_tag => "tr")) {
        print map {$_->as_HTML()} $_->content_list();
    }
' input.html

python -c 'if True:
    import sys, BeautifulSoup
    html = BeautifulSoup.BeautifulSoup(open(sys.argv[1]).read())
    for tr in html.findAll("tr"):
        print "".join(tr.contents)
' input.html

Method 3

sed and awk are not well suited for this task, you should rather use a proper html parser. For example hxselect from w3.org:

<htmlfile hxselect -s 'n' -c 'tr'

Method 4

If ruby is available you can do the following

ruby -e 'puts readlines.join[/(?<=<tr>).+(?=</tr>)/m].gsub(/</?tr>/, "")' file

where file is your input html file. The command executes a Ruby one-liner. First, it reads all lines from file and joins them to a string, readlines.join. Then, from the string it selects anything between (but not including) <tr> and </tr> that is one character or longer irrespective of newlines, [/(?<=<tr>).+(?=</tr>)/m]. Then, it removes any <tr> or </tr> from the string, gsub(/</?tr>/, "") (this is necessary to handle nested tr tags). Finally, it prints the string, puts.

You said that a html parser is not warranted for you but it is very easy to use Nokogiri with ruby and it makes the command simpler.

ruby -rnokogiri -e 'puts Nokogiri::HTML(readlines.join).xpath("//tr").map { |e| e.content }' file

-rnokogiri loads Nokogiri. Nokogiri::HTML(readlines.join) reads all lines of file. xpath("//tr") picks out every tr element and map { |e| e.content } picks out the content for each element, i.e. what is between <tr> and </tr>.

Method 5

`grep`

To retrieve content within tr tag across multiple lines, pass it through xargs first, for example:

curl -sL https://www.iana.org/ | xargs | egrep -o "<tr>.*?</tr>"

To return only inner HTML, use:

curl -sL https://www.iana.org/ | xargs | grep -Po "<tr>K(.*?)</tr>" | sed "s/..tr.//g"

^{Check the syntax for perlre extended patterns.}

^{Note: For quicker performance, you may consider ripgrep which has similar syntax.}

Method 6

`pup`

Example using pup (which uses CSS selectors):

pup -f myfile.html tr

To print only text without tags, use: pup -f myfile.html tr text{}.

Here are few examples with curl:

curl -sL https://www.iana.org/ | pup tr text{}
pup -f <(curl -sL https://www.iana.org/) tr text{}

`xpup`

Example using xpup for HTML/XML parsing (which supports XPath):

xpup -f myfile.html "//tr"

Method 7

if it is just a quick listing of <tr>s this could help:

perl -ne 'print if /<tr>/../</tr>/' your.html > TRs.log

cheers

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating