I am trying to use grep and cut to extract URLs from an HTML file. The links look like:
<a href="http://examplewebsite.com/" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">
Other websites have .net, .gov, but I assume I could make the cut off point right before >. So I know I can use grep and cut somehow to cut off everything before http and after .com, but I have been stuck on it for a while.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Not sure if you are limited on tools:
But regex might not be the best way to go as mentioned, but here is an example that I put together:
cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*" | sort -u
grep -E: is the same as egrepgrep -o: only outputs what has been grepped(http|https): is an either / ora-z: is all lower caseA-Z: is all upper case.: is dot/: is the slash?: is ?=: is equal sign_: is underscore%: is percentage sign:: is colon-: is dash*: is repeat the […] groupsort -u: will sort & remove any duplicates
Output:
<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="40222f2200222f226d0e05777272">[email protected]</a>:~s$ wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u https://stackauth.com https://meta.stackoverflow.com https://cdn.sstatic.net/Img/svg-icons https://stackoverflow.com https://www.stackoverflowbusiness.com/talent https://www.stackoverflowbusiness.com/advertising https://stackoverflow.com/users/login?ssrc=head https://stackoverflow.com/users/signup?ssrc=head https://stackoverflow.com https://stackoverflow.com/help https://chat.stackoverflow.com https://meta.stackoverflow.com ...
You can also add in d to catch other numeral types.
Method 2
As I said in my comment, it’s generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you’re parsing is well-behaved.
In order to only get URLs that are in the href attribute of <a> elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:
grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener"]+"' |
grep -Eo '(http|https)://[^/"]+'
where source.html is the file containing the HTML code to parse.
This code will print all top-level URLs that occur as the href attribute of any <a> elements in each line. The -i option to the first grep command is to ensure that it will work on both <a> and <A> elements. I guess you could also give -i to the 2nd grep to capture upper case HREF attributes, OTOH, I’d prefer to ignore such broken HTML. 🙂
To process the contents of http://google.com/
wget -qO- http://google.com/ | grep -Eoi '<a [^>]+>' | grep -Eo 'href="[^" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener"]+"' | grep -Eo '(http|https)://[^/"]+'
output
http://www.google.com.au
http://maps.google.com.au
https://play.google.com
http://www.youtube.com
http://news.google.com.au
https://mail.google.com
https://drive.google.com
http://www.google.com.au
http://www.google.com.au
https://accounts.google.com
http://www.google.com.au
https://www.google.com
https://plus.google.com
http://www.google.com.au
My output is a little different from the other examples as I get redirected to the Australian Google page.
Method 3
If your grep supports Perl regexes:
grep -Po '(?<=href=")[^" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener"]*(?=")'
(?<=href=")and(?=" rel="nofollow noreferrer noopener")are lookaround expressions for thehrefattribute. This needs the-Poption.-oprints the matching text.
For example:
$ curl -sL https://www.google.com | grep -Po '(?<=href=")[^" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener"]*(?=")' /search? https://www.google.co.in/imghp?hl=en&tab=wi https://maps.google.co.in/maps?hl=en&tab=wl https://play.google.com/?hl=en&tab=w8 https://www.youtube.com/?gl=IN&tab=w1 https://news.google.co.in/nwshp?hl=en&tab=wn ...
As usual, there’s no guarantee that these are valid URIs, or that the HTML you’re parsing will be valid.
Method 4
As a non-regex alternative, use pup:
pup 'a[href] attr{href}' < yourfile.html
Will find all a elements that have a href attribute, then display the value of the href attribute.
You can get it from the Releases page in the GitHub, or by compiling it yourself, in which case you’ll need Go (a programming language).
The advantage of this solution is that it doesn’t rely on the HTML being properly formatted.
Method 5
I have found a solution here that is IMHO much simpler and potentially faster than what was proposed here. I have adjusted a little bit to support https files. But the TD;TR version is …
PS: You can replace the site URL with a path to a file and it will work the same way.
lynx -dump -listonly -nonumbers "http://www.goggle.com" > links.txt lynx -dump -listonly -nonumbers "some-file.html" > links.txt
If you just want to see the links instead of placing them in a file, then try this instead …
lynx -dump -listonly -nonumbers "http://www.google.com" lynx -dump -listonly -nonumbers "some-file.html"
The result will look similar to the following …
http://www.google.ca/imghp?hl=en&tab=wi http://maps.google.ca/maps?hl=en&tab=wl https://play.google.com/?hl=en&tab=w8 http://www.youtube.com/?gl=CA&tab=w1 http://news.google.ca/nwshp?hl=en&tab=wn https://mail.google.com/mail/?tab=wm https://drive.google.com/?tab=wo https://www.google.ca/intl/en/options/ http://www.google.ca/history/optout?hl=en ... etc.
For my use case, this worked just fine. But beware of the fact that nowadays, people add links like src=”//blah.tld” for CDN URI of libraries. I didn’t want to see those in the retrieved links.
No need to try to check for href or other sources for links because “lynx -dump” will by default extract all the clickable links from a given page. So the only think you need to do after that is to parse the result of “lynx -dump” using grep to get a cleaner raw version of the same result.
Method 6
wget -qO- google.com | tr " \n | grep https*://
…would probably do pretty well. As written, it prints:
http://schema.org/WebPage http://www.google.com/imghp?hl=en&tab=wi http://maps.google.com/maps?hl=en&tab=wl https://play.google.com/?hl=en&tab=w8 http://www.youtube.com/?tab=w1 http://news.google.com/nwshp?hl=en&tab=wn https://mail.google.com/mail/?tab=wm https://drive.google.com/?tab=wo http://www.google.com/intl/en/options/ http://www.google.com/history/optout?hl=en https://accounts.google.com/ServiceLogin?hl=en&continue=http://www.google.com/ https://www.google.com/culturalinstitute/project/the-holocaust?utm_source=google&utm_medium=hppromo&utm_campaign=auschwitz_q1&utm_content=desktop https://plus.google.com/116899029375914044550
If it is important that you only match links and from among those top-level domains, you can do:
wget -qO- google.com | sed '/n/P;//!s|<a[^>]*(https*://[^/"]*)|n1n|;D'
…or something like it – though for some seds you may need to substitute a literal newline character for each of the last two ns.
As written, the above command prints:
http://www.google.com http://maps.google.com https://play.google.com http://www.youtube.com http://news.google.com https://mail.google.com https://drive.google.com http://www.google.com http://www.google.com http://www.google.com https://www.google.com https://plus.google.com
…and for either case (but probably most usefully with the latter) you can tack on a |sort -u filter to the end to get the list sorted and to drop duplicates.
Method 7
Shortest
grep -r http . --color
Method 8
echo "<a href="http://examplewebsite.com/" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener">"|sed -r 's:<.*"::g'|sed 's:/">$::g'
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0