I’m crawling through some directories with ASP.NET programming via Scrapy.
The pages to crawl through are encoded as such:
javascript:__doPostBack('ctl00$MainContent$List','Page$X')
where X is an int between 1 and 180. The MainContent argument is always the same. I have no idea how to crawl into these. I would love to add something to the SLE rules as simple as allow=('Page$') or attrs='__doPostBack', but my guess is that I have to be trickier in order to pull the info from the javascript “link.”
If it’s easier to “unmask” each of the absolute links from the javascript code and save those to a csv, then use that csv to load requests into a new scraper, that’s okay, too.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
This kind of pagination is not that trivial as it may seem. It was an interesting challenge to solve it. There are several important notes about the solution provided below:
- the idea here is to follow the pagination page by page passing around the current page in the
Request.metadictionary - using a regular
BaseSpidersince there is some logic involved in the pagination - it is important to provide
headerspretending to be a real browser - it is important to yield
FormRequests withdont_filter=Truesince we are basically making aPOSTrequest to the same URL but with different parameters
The code:
import re
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
HEADERS = {
'X-MicrosoftAjax': 'Delta=true',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
}
URL = 'http://exitrealty.com/agent_list.aspx?firstName=&lastName=&country=USA&state=NY'
class ExitRealtySpider(BaseSpider):
name = "exit_realty"
allowed_domains = ["exitrealty.com"]
start_urls = [URL]
def parse(self, response):
# submit a form (first page)
self.data = {}
for form_input in response.css('form#aspnetForm input'):
name = form_input.xpath('@name').extract()[0]
try:
value = form_input.xpath('@value').extract()[0]
except IndexError:
value = ""
self.data[name] = value
self.data['ctl00$MainContent$ScriptManager1'] = 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList'
self.data['__EVENTTARGET'] = 'ctl00$MainContent$List'
self.data['__EVENTARGUMENT'] = 'Page$1'
return FormRequest(url=URL,
method='POST',
callback=self.parse_page,
formdata=self.data,
meta={'page': 1},
dont_filter=True,
headers=HEADERS)
def parse_page(self, response):
current_page = response.meta['page'] + 1
# parse agents (TODO: yield items instead of printing)
for agent in response.xpath('//a[@class="regtext"]/text()'):
print agent.extract()
print "------"
# request the next page
data = {
'__EVENTARGUMENT': 'Page$%d' % current_page,
'__EVENTVALIDATION': re.search(r"__EVENTVALIDATION|(.*?)|", response.body, re.MULTILINE).group(1),
'__VIEWSTATE': re.search(r"__VIEWSTATE|(.*?)|", response.body, re.MULTILINE).group(1),
'__ASYNCPOST': 'true',
'__EVENTTARGET': 'ctl00$MainContent$agentList',
'ctl00$MainContent$ScriptManager1': 'ctl00$MainContent$UpdatePanel1|ctl00$MainContent$agentList',
'': ''
}
return FormRequest(url=URL,
method='POST',
formdata=data,
callback=self.parse_page,
meta={'page': current_page},
dont_filter=True,
headers=HEADERS)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0