Access django models inside of Scrapy

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model?

I’ve seen this, but I don’t really get how to set it up?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If anyone else is having the same problem, this is how I solved it.

I added this to my scrapy settings.py file:

def setup_django_env(path):
    import imp, os
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

setup_django_env('/path/to/django/project/')

Note: the path above is to your django project folder, not the settings.py file.

Now you will have full access to your django models inside of your scrapy project.

Method 2

The opposite solution (setup scrapy in a django management command):

# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py 

from __future__ import absolute_import
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

and in django’s settings.py:

import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_project.settings'

Then instead of scrapy foo run ./manage.py scrapy foo.

UPD: fixed the code to bypass django’s options parsing.

Method 3

Add DJANGO_SETTINGS_MODULE env in your scrapy project’s settings.py

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'your_django_project.settings'

Now you can use DjangoItem in your scrapy project.

Edit:
You have to make sure that the your_django_project projects settings.py is available in PYTHONPATH.

Method 4

For Django 1.4, the project layout has changed. Instead of /myproject/settings.py, the settings module is in /myproject/myproject/settings.py.

I also added path’s parent directory (/myproject) to sys.path to make it work correctly.

def setup_django_env(path):
    import imp, os, sys
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

    # Add path's parent directory to sys.path
    sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))

setup_django_env('/path/to/django/myproject/myproject/')

Method 5

Check out django-dynamic-scraper, it integrates a Scrapy spider manager into a Django site.

https://github.com/holgerd77/django-dynamic-scraper

Method 6

Why not create a __init__.py file in the scrapy project folder and hook it up in INSTALLED_APPS? Worked for me. I was able to simply use:

piplines.py

from my_app.models import MyModel

Hope that helps.

Method 7

setup-environ is deprecated. You may need to do the following in scrapy’s settings file for newer versions of django 1.4+

def setup_django_env():
    import sys, os, django

    sys.path.append('/path/to/django/myapp')
    os.environ['DJANGO_SETTINGS_MODULE'] = 'myapp.settings'

django.setup()

Method 8

Minor update to solve KeyError. Python(3)/Django(1.10)/Scrapy(1.2.0)

from django.core.management.base import BaseCommand

class Command(BaseCommand):    
    help = 'Scrapy commands. Accessible from: "Django manage.py". '

    def __init__(self, stdout=None, stderr=None, no_color=False):
        super().__init__(stdout=None, stderr=None, no_color=False)

        # Optional attribute declaration.
        self.no_color = no_color
        self.stderr = stderr
        self.stdout = stdout

        # Actual declaration of CLI command
        self._argv = None

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute(stdout=None, stderr=None, no_color=False)

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

The SCRAPY_SETTINGS_MODULE declaration is still required.

os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'scrapy_project.settings')


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x