How to parse hundred html source code files in shell?

I have a couple of hundred html source code files. I need to extract the contents of a particular <div> element from each of these file so I’m going to write a script to loop through each file. The element structure is like this:

<div id='the_div_id'>
  <div id='some_other_div'>
  <h3>Some content</h3>
  </div>
</div>

Can anyone suggest a method by which I can extract the div the_div_id and all the child elements and content from a file using the linux command line?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

The html-xml-utils package, available in most major Linux distributions, has a number of tools that are useful when dealing with HTML and XML documents. Particularly useful for your case is hxselect which reads from standard input and extracts elements based on CSS selectors. Your use case would look like:

hxselect '#the_div_id' <file

You might get a complaint about input not being well formed depending on what you are feeding it. This complaint is given over standard error and thus can be easily suppressed if needed. An alternative to this would to be to use Perl’s HTML::PARSER package; however, I will leave that to someone with Perl skills less rusty than my own.

Method 2

Try pup, a command line tool for processing HTML. For example:

pup '#the_div_id' < file.html

Method 3

Here’s an untested Perl script that extracts <div id="the_div_id"> elements and their contents using HTML::TreeBuilder.

#!/usr/bin/env perl
use strict;
use warnings;
use HTML::TreeBuilder;
foreach my $file_name (@ARGV) {
    my $tree = HTML::TreeBuilder->new;
    $tree->parse_file($file_name);
    for my $subtree ($tree->look_down(_tag => "div", id => "the_div_id")) {
        my $html = $subtree->as_HTML;
        $html =~ s/(?<!n)z/n/;
        print $html;
    }
    $tree = $tree->delete;
}

If you’re allergic to Perl, Python has HTMLParser.

P.S. Do not try using regular expressions..

Method 4

Here is Ex one-liner to extract that part from each file:

ex -s +'bufdo!/<div.*id=.the_div_id/norm nvatdggdG"2p' +'bufdo!%p' -cqa! *.html

To save/replace in-place, change -cqa! into -cxa and remove %p section. For recursivity, consider using globbing (**/*.html).

It basically for each buffer/file (bufdo), it’s doing the following actions:

/pattern – find the pattern
norm – start simulating normal Vi keystrokes
- n – jump into next pattern (required in Ex mode)
- vatd – remove the selected outer tag section (see: jumping between html tags)
- ggdG – remove the whole buffer (equivalent to :%d)
- "2p – re-paste previosly deleted text

Maybe not very efficient and not POSIX (:bufdo), but it should work.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating