I’m trying web scraping with scrapy. But I got “duplicates” warning. Can’t jump next page.
How can I scrape all pages with pagination?
example site: teknosa.com
scraping url: https://www.teknosa.com/bilgisayar-tablet-c-116
pagination structure: ?s=%3Arelevance&page=0 (1,2,3,4,5, and more..)
My pagination code:
next_page = soup.find('button', {'title': 'Daha Fazla Ürün Gör'})['data-gotohref']
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can make the pagination in start_urls and increase or decrease range of page numbers.
import scrapy
from scrapy.crawler import CrawlerProcess
class CarsSpider(scrapy.Spider):
name = 'car'
start_urls=['https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page='+str(x)+'' for x in range(1,11)]
def parse(self, response):
print(response.url)
if __name__ == "__main__":
process =CrawlerProcess()
process.crawl()
process.start()
Output:
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=1 2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=2> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=2 2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=5> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=5 2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=6> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=6 2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=7> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=7 2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=3> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=3 2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=4> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=4 2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=8> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=8 2022-05-01 08:55:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=9> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=9 2022-05-01 08:55:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=10> (referer: None) https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=10
Multiple urls, pagination using for loop
import scrapy
class CarsSpider(scrapy.Spider):
name = 'car'
def start_requests(self):
urls=['url_1', 'url_2', 'url_3', ...]
for url in urls:
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
...
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0