I’m trying to scrape data from a multi-page table that is returned after filling out a form.
The URL of the original form in question is https://ndber.seai.ie/Pass/assessors/search.aspx
From https://kaijento.github.io/2017/05/04/web-scraping-requests-eventtarget-viewstate/ I get the code that extracts the hidden variables from the blank form that are then sent with the POST request to get the data
import requests
from bs4 import BeautifulSoup
url='https://ndber.seai.ie/PASS/Assessors/Search.aspx'
with requests.session() as s:
s.headers['user-agent'] = 'Mozilla/5.0'
r = s.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
target = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager'
# unsupported CSS Selector 'input[name^=ctl00][value]'
data = { tag['name']: tag['value']
for tag in soup.select('input[name^=ctl00]') if tag.get('value')
}
state = { tag['name']: tag['value']
for tag in soup.select('input[name^=__]')
}
data.update(state)
data['__EVENTTARGET'] = ''
data['__EVENTARGUMENT'] = ''
print(data)
r = s.post(url, data=data)
new_soup = BeautifulSoup(r.content, 'html5lib')
print(new_soup)
The initial .get goes fine, I get the html for the blank form, and I can extract the parameters into data.
However the .post returns a html page that indicates an error has occurred with no useful data.
Note that the results are split over multiple pages and when you go from page to page the following parameters are given values
data['__EVENTTARGET'] = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager' data['__EVENTARGUMENT'] = '1$n' # where n is the number of the age to retrieve
In the code above I’m initially just trying to get the first page of results and then once that’s working I’ll work out the loop to go through all the results and join them.
Does anyone have an idea of how to handle such as case ?
Thanks / Colm
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can get the tabular content traversing multiple pages from that website using requests module. In that case, you have to send multiple post requests with appropriate parameters to access the content.
Unlike other parameters, there is one key ctl00$DefaultContent$AssessorSearch$captcha whose value is generated dynamically and not present in page source.
However, you can still fetch the value of that key using this requests_html library. Fyi, requests and requests_html libraries are of the same author. You just need to use this function get_captcha_value() once to get the value of captcha and then you can reuse the same value till the end.
The script below currently fetches all the names from all the pages. You can modify the selector to get other fields of your interest.
This is how you can go:
import requests
from bs4 import BeautifulSoup
from requests_html import HTMLSession
link = 'https://ndber.seai.ie/Pass/assessors/search.aspx'
payload = {
'ctl00$DefaultContent$AssessorSearch$dfSearch$Name': '',
'ctl00$DefaultContent$AssessorSearch$dfSearch$CompanyName': '',
'ctl00$DefaultContent$AssessorSearch$dfSearch$County': '',
'ctl00$DefaultContent$AssessorSearch$dfSearch$searchType': 'rbnDomestic',
'ctl00$DefaultContent$AssessorSearch$dfSearch$Bottomsearch': 'Search'
}
page = 1
def get_captcha_value():
with HTMLSession() as session:
r = session.get(link)
r.html.render(sleep=5)
captcha_value = r.html.find("input[name$='$AssessorSearch$captcha']",first=True).attrs['value']
return captcha_value
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (WindowMozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
payload['__VIEWSTATE'] = soup.select_one("#__VIEWSTATE")['value']
payload['__VIEWSTATEGENERATOR'] = soup.select_one("#__VIEWSTATEGENERATOR")['value']
payload['__EVENTVALIDATION'] = soup.select_one("#__EVENTVALIDATION")['value']
payload['ctl00$forgeryToken'] = soup.select_one("#ctl00_forgeryToken")['value']
payload['ctl00$DefaultContent$AssessorSearch$captcha'] = get_captcha_value()
while True:
res = s.post(link,data=payload)
soup = BeautifulSoup(res.text,"lxml")
if not soup.select_one("table[id$='gridAssessors_gridview'] tr[class$='RowStyle']"): break
for items in soup.select("table[id$='gridAssessors_gridview'] tr[class$='RowStyle']"):
_name = items.select_one("td > span").get_text(strip=True)
print(_name)
page+=1
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload.pop('ctl00$DefaultContent$AssessorSearch$dfSearchAgain$Feedback')
payload.pop('ctl00$DefaultContent$AssessorSearch$dfSearchAgain$Search')
payload['__EVENTTARGET'] = 'ctl00$DefaultContent$AssessorSearch$gridAssessors$grid_pager'
payload['__EVENTARGUMENT'] = f'1${page}'
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0