Web scraping pagination with scrapy

I’m trying web scraping with scrapy. But I got “duplicates” warning. Can’t jump next page.

How can I scrape all pages with pagination?

example site: teknosa.com

scraping url: https://www.teknosa.com/bilgisayar-tablet-c-116

pagination structure: ?s=%3Arelevance&page=0 (1,2,3,4,5, and more..)

My pagination code:

    next_page = soup.find('button', {'title': 'Daha Fazla Ürün Gör'})['data-gotohref']
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can make the pagination in start_urls and increase or decrease range of page numbers.

import scrapy
from scrapy.crawler import CrawlerProcess

class CarsSpider(scrapy.Spider):
    name = 'car'
    start_urls=['https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page='+str(x)+'' for x in range(1,11)]
        
    def parse(self, response):
       print(response.url)

if __name__ == "__main__":
    process =CrawlerProcess()
    process.crawl()
    process.start()

Output:

https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=1
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=2> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=2
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=5> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=5
2022-05-01 08:55:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=6> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=6
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=7> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=7
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=3> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=3
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=4> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=4
2022-05-01 08:55:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=8> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=8
2022-05-01 08:55:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=9> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=9
2022-05-01 08:55:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=10> (referer: None)
https://www.teknosa.com/bilgisayar-tablet-c-116?s=%3Arelevance&page=10

Multiple urls, pagination using for loop

import scrapy

class CarsSpider(scrapy.Spider):
    name = 'car'
    
    def start_requests(self):
        urls=['url_1', 'url_2', 'url_3', ...]
        for url in urls:
            yield scrapy.Request(url=url,callback=self.parse)
           
    def parse(self, response):
        ...


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x