Scraping dynamic content using python-Scrapy

Disclaimer: I’ve seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don’t seem to work on this website.

I’m using Python-Scrapy for getting data from koovs.com.

However, I’m not able to get the product size, which is dynamically generated. Specifically, if someone could guide me a little on getting the ‘Not available’ size tag from the drop-down menu on this link, I’d be grateful.

I am able to get the size list statically, but doing that I only get the list of sizes but not which of them are available.

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can also solve it with ScrapyJS (no need for selenium and a real browser):

This library provides Scrapy+JavaScript integration using Splash.

Follow the installation instructions for Splash and ScrapyJS, start the splash docker container:

$ docker run -p 8050:8050 scrapinghub/splash

Put the following settings into settings.py:

SPLASH_URL = 'http://192.168.59.103:8050' 

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

And here is your sample spider that is able to see the size availability information:

# -*- coding: utf-8 -*-
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["koovs.com"]
    start_urls = (
        'http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376',
    )

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        for option in response.css("div.select-size select.sizeOptions option")[1:]:
            print option.xpath("text()").extract()

Here is what is printed on the console:

[u'S / 34 -- Not Available']
[u'L / 40 -- Not Available']
[u'L / 42']

Method 2

From what I understand, the size availability is determined dynamically in javascript being executed in the browser. Scrapy is not a browser and cannot execute javascript.

If you are okay with switching to selenium browser automation tool, here is a sample code:

from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Firefox()  # can be webdriver.PhantomJS()
browser.get('http://www.koovs.com/only-onlall-stripe-ls-shirt-59554.html?from=category-651&skuid=236376')

# wait for the select element to become visible
select_element = WebDriverWait(browser, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.select-size select.sizeOptions")))

select = Select(select_element)
for option in select.options[1:]:
    print option.text

browser.quit()

It prints:

S / 34 -- Not Available
L / 40 -- Not Available
L / 42

Note that in place of Firefox you can use other webdrivers like Chrome or Safari. There is also an option to use a headless PhantomJS browser.

You can also combine Scrapy with Selenium if needed, see:

Method 3

I faced that problem and solved easily by following these steps

pip install splash
pip install scrapy-splash
pip install scrapyjs

download and install docker-toolbox

open docker-quickterminal and enter

$ docker run -p 8050:8050 scrapinghub/splash

To set the SPLASH_URL check the default ip configured in the docker machine by entering
$ docker-machine ip default (My IP was 192.168.99.100)

SPLASH_URL = 'http://192.168.99.100:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'

That’s it!

Method 4

You have to interpret the json of the website, examples
scrapy.readthedocs and
testingcan.github.io

import scrapy
import json
class QuoteSpider(scrapy.Spider):
   name = 'quote'
   allowed_domains = ['quotes.toscrape.com']
   page = 1
   start_urls = ['http://quotes.toscrape.com/api/quotes?page=1']

   def parse(self, response):
      data = json.loads(response.text)
      for quote in data["quotes"]:
        yield {"quote": quote["text"]}
      if data["has_next"]:
          self.page += 1
          url = "http://quotes.toscrape.com/api/quotes?page={}".format(self.page)
          yield scrapy.Request(url=url, callback=self.parse)

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating