Download recursively with wget

I have a problem with the following wget command:

wget -nd -r -l 10 http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

It should download recursively all of the linked documents on the original web but it downloads only two files (index.html and robots.txt).

How can I achieve recursive download of this web?

Contents hide

Answers:

Method 1

Method 2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

wget by default honours the robots.txt standard for crawling pages, just like search engines do, and for archive.org, it disallows the entire /web/ subdirectory.
To override, use -e robots=off,

wget -nd -r -l 10 -e robots=off http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

Method 2

$ wget --random-wait -r -p -e robots=off -U Mozilla 
    http://web.archive.org/web/20110726051510/http://feedparser.org/docs/

Downloads recursively the content of the url.

--random-wait - wait between 0.5 to 1.5 seconds between requests.
-r - turn on recursive retrieving.
-e robots=off - ignore robots.txt.
-U Mozilla - set the "User-Agent" header to "Mozilla". Though a better choice is a real User-Agent like "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)".

Some other useful options are:

--limit-rate=20k - limits download speed to 20kbps.
-o logfile.txt - log the downloads.
-l 0 - remove recursion depth (which is 5 by default).
--wait=1h - be sneaky, download one file every hour.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating