How to create crawl a web page with scrapy and python
Scrapy is a Python system for large scope web scratching. web-crawlers with Scrapy, an amazing system for separating, preparing, and putting away web information. Slithering and scratching site pages with scraps and python.
1) Create app
Also read : How to Install Django | Basic Configuration
2) Define the data structure, so our django model
#devnote/student/models.py from django.db import models class BlogPost(models.Model): name = models.TextField(null=True) image = models.TextField(null=True) category = models.TextField(null=True) created_at = models.DateTimeField(auto_now_add=True,null=True) class Meta: db_table = 'posts'
Also read : How to display scraping data in django admin panel
3) Install Scrapy
pip install scrapy
Scrapy should take a few minutes to pull down its dependencies, compile, and install. you can test that Scrapy is installed correctly. run bellow command :
$ python
>>> import scrapy
>>>
If you get an import error it is like that Scrapy was not linked against a particular dependency correctly.
4) Creating the Scrapy project
Create our Scrapy project, just execute the following command :
root@devnote:/devnote-scrapy# scrapy startproject scraper
See the following Scrapy project structure :
|-- scraper |-- scrapy.cfg | |-- scraper | | |-- __init__.py | | |-- items.py | | |-- middlewares.py | | |-- pipelines.py | | |-- settings.py | | |-- spiders | | | |-- __init__.py | | | |-- run.py
5) Connect using DjangoItem
#devnote-scrapy/scraper/scraper/items.py import scrapy from scrapy_djangoitem import DjangoItem from student.models import BlogPost class ScraperItem(scrapy.Item): pass class BlogPostItem(DjangoItem): django_model = BlogPost
6) Create run.py file.
Create a new file in the spiders directory, name it run.py.
#devnote-scrapy/scraper/scraper/spiders/run.py import scrapy from student.models import BlogPost from scraper.items import BlogPostItem from scrapy.linkextractors import LinkExtractor from scrapy.contrib.spiders import Rule class DevnoteSpider(scrapy.Spider): name = 'devnote' allowed_domains = ["devnote.in"] start_urls = ('https://devnote.in/',) def parse(self, response): for idx, values in enumerate(response.css("#main_post_display .home-page-column")): post = BlogPostItem() title = values.css("h4 a::text").extract_first() image = values.css(".featured-image img, .language-image-home img").xpath("@src").extract_first() count_p_a = len(values.css(".entry-categories p a::text")) category_p_a = values.css(".entry-categories p a::text").extract() categories_join = "" for i, data_p in enumerate(category_p_a): categories_join += data_p.join(" ,") categories = categories_join.rstrip(" ,") #Database in save post['name'] = title post['image'] = image post['category'] = categories yield post
8) Pipeline
Use it to save the items to the database.
#devnote-scrapy/scraper/scraper/pipelines.py class ScraperPipeline(object): def process_item(self, item, spider): item.save() //added return item
9) Configure item pipelines
Start with the settings.py file which only requires to quick updates.
#devnote-scrapy/scraper/scraper/settings.py import os import sys PROJECT_DIR = os.path.dirname( os.path.dirname(os.path.dirname(os.path.realpath(__file__))) ) sys.path.append(os.path.join(PROJECT_DIR, 'devnote')) os.environ['DJANGO_SETTINGS_MODULE'] = 'devnote.settings' import django django.setup() /* Configure item pipelines */ ITEM_PIPELINES = { 'scraper.pipelines.ScraperPipeline': 300, }
10) Run the spider
root@devnote:/devnote-scrapy/scraper# scrapy crawl devnote
#You can create json file data store
root@devnote:/devnote-scrapy/scraper# scrapy crawl devnote -o output.json
output.json. follows a screenshot of scraping process running :
You can individually open json file. run bellow commands:
$ python >>> import json >>> data = open("output.json").read() >>> response = json.loads(data) >>> len(response) 12
Checking first row value :
>>> response[0] output : {'name': 'Django admin panel app model list_display not showing', 'image': 'https://devnote.in/wp-content/uploads/2020/04/Django-admin-panel-app-model-list_display-not-showing.png', 'category': ' django, python'}
Download here : devnote-scrapy
Also read : How to display scraping data in django admin panel