I am extracting URLs from a website using cURL as below.
curl www.somesite.com | grep "<a href=.*title=" > new.txt
My new.txt file is as below.
<a href="http://website1.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" title="something"> <a href="http://website1.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" information="something" title="something"> <a href="http://website2.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" title="some_other_thing"> <a href="http://website2.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" information="something" title="something"> <a href="http://websitenotneeded.com" rel="nofollow noreferrer noopener" title="something NOTNEEDED">
However, I need to extract only the below information.
<a href="http://website1.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" title="something"> <a href="http://website2.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" information="something" title="something">
I am trying to ignore the <a href which have information in them and whose title end with NOTNEEDED.
How can I modify my grep statement?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I’m not fully following your example + the description but it sounds like what you want is this:
$ grep -v "<a href=.*title=.*NOTNEEDED" sample.txt <a href="http://website1.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" title="something"> <a href="http://website1.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" information="something" title="something"> <a href="http://website2.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" title="some_other_thing"> <a href="http://website2.com" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" rel="nofollow noreferrer noopener" information="something" title="something">
So for your example:
$ curl www.example.com | grep -v "<a href=.*title=" | grep -v NOTNEEDED > new.txt
Method 2
The grep man page says:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is specified by POSIX .)
You can use regular expressions for multiple inversions:
grep -v 'red|green|blue'
or
grep -v red | grep -v green | grep -v blue
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0