How to create crawl a web page with scrapy and python
Scrapy is a Python system for large scope web scratching. web-crawlers with Scrapy, an amazing system for separating, preparing, and putting away web information. Slithering and scratching site pages with scraps and python.
1) Create app
Also read : How to Install Django | Basic Configuration
2) Define the data structure, so our django model
#devnote/student/models.py
from django.db import models
class BlogPost(models.Model):
name = models.TextField(null=True)
image = models.TextField(null=True)
category = models.TextField(null=True)
created_at = models.DateTimeField(auto_now_add=True,null=True)
class Meta:
db_table = 'posts'
Also read : How to display scraping data in django admin panel
3) Install Scrapy
pip install scrapy
Scrapy should take a few minutes to pull down its dependencies, compile, and install. you can test that Scrapy is installed correctly. run bellow command :
$ python
>>> import scrapy
>>>
If you get an import error it is like that Scrapy was not linked against a particular dependency correctly.
4) Creating the Scrapy project
Create our Scrapy project, just execute the following command :
root@devnote:/devnote-scrapy# scrapy startproject scraper
See the following Scrapy project structure :
|-- scraper |-- scrapy.cfg | |-- scraper | | |-- __init__.py | | |-- items.py | | |-- middlewares.py | | |-- pipelines.py | | |-- settings.py | | |-- spiders | | | |-- __init__.py | | | |-- run.py
5) Connect using DjangoItem
#devnote-scrapy/scraper/scraper/items.py
import scrapy
from scrapy_djangoitem import DjangoItem
from student.models import BlogPost
class ScraperItem(scrapy.Item):
pass
class BlogPostItem(DjangoItem):
django_model = BlogPost
6) Create run.py file.
Create a new file in the spiders directory, name it run.py.
#devnote-scrapy/scraper/scraper/spiders/run.py
import scrapy
from student.models import BlogPost
from scraper.items import BlogPostItem
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import Rule
class DevnoteSpider(scrapy.Spider):
name = 'devnote'
allowed_domains = ["devnote.in"]
start_urls = ('https://devnote.in/',)
def parse(self, response):
for idx, values in enumerate(response.css("#main_post_display .home-page-column")):
post = BlogPostItem()
title = values.css("h4 a::text").extract_first()
image = values.css(".featured-image img, .language-image-home img").xpath("@src").extract_first()
count_p_a = len(values.css(".entry-categories p a::text"))
category_p_a = values.css(".entry-categories p a::text").extract()
categories_join = ""
for i, data_p in enumerate(category_p_a):
categories_join += data_p.join(" ,")
categories = categories_join.rstrip(" ,")
#Database in save
post['name'] = title
post['image'] = image
post['category'] = categories
yield post
8) Pipeline
Use it to save the items to the database.
#devnote-scrapy/scraper/scraper/pipelines.py
class ScraperPipeline(object):
def process_item(self, item, spider):
item.save() //added
return item
9) Configure item pipelines
Start with the settings.py file which only requires to quick updates.
#devnote-scrapy/scraper/scraper/settings.py
import os
import sys
PROJECT_DIR = os.path.dirname(
os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
)
sys.path.append(os.path.join(PROJECT_DIR, 'devnote'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'devnote.settings'
import django
django.setup()
/* Configure item pipelines */
ITEM_PIPELINES = {
'scraper.pipelines.ScraperPipeline': 300,
}
10) Run the spider
root@devnote:/devnote-scrapy/scraper# scrapy crawl devnote

#You can create json file data store
root@devnote:/devnote-scrapy/scraper# scrapy crawl devnote -o output.json
output.json. follows a screenshot of scraping process running :

You can individually open json file. run bellow commands:
$ python
>>> import json
>>> data = open("output.json").read()
>>> response = json.loads(data)
>>> len(response)
12
Checking first row value :
>>> response[0]
output : {'name': 'Django admin panel app model list_display not showing', 'image': 'https://devnote.in/wp-content/uploads/2020/04/Django-admin-panel-app-model-list_display-not-showing.png', 'category': ' django, python'}
Download here : devnote-scrapy
Also read : How to display scraping data in django admin panel