I am using beautiful soup and I am writing a crawler and have the following code in it:
print soup.originalEncoding
#self.addtoindex(page, soup)
links=soup('a')
for link in links:
if('href' in dict(link.attrs)):
link['href'].replace('..', '')
url=urljoin(page, link['href'])
if url.find("'") != -1:
continue
url = url.split('?')[0]
url = url.split('#')[0]
if url[0:4] == 'http':
newpages.add(url)
pages = newpages
The link['href'].replace('..', '') is supposed to fix links that come out as ../contact/orderform.aspx, ../contact/requestconsult.aspx, etc. However, it is not working. Links still have the leading “..” Is there something I am missing?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
string.replace() returns the string with the replaced values. It doesn’t modify the original so do something like this:
link['href'] = link['href'].replace("..", "")
Method 2
string.replace() returns a copy of the string with characters replaced, as strings in Python are immutable. Try
s = link['href'].replace("..", '')
url=urljoin(page, s)
Method 3
It is not an inplace replacement. You need to do:
link['href'] = link['href'].replace('..', '')
Example:
a = "abc.."
print a.replace("..","")
'abc'
print a
'abc..'
a = a.replace("..","")
print a
'abc'
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0