博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
scrapy框架入门和使用实践
阅读量:4298 次
发布时间:2019-05-27

本文共 14230 字,大约阅读时间需要 47 分钟。

文章目录

scrapy框架入门和使用实践

一、前言

  • 操作系统:Windows 10 专业版

  • 虚拟环境:Anaconda

  • python 版本:3.7

  • XPath 工具:xpath-helper

  • 开发工具:PyCharm 2020.1

  • 参考

scrapy 官网:

scrapy 教程:

scrapy 架构:

scrapy 调参:

XPath:

Anaconda 教程:

PyMySQL:

二、正文

  • 一个开放源代码和协作框架,用于从网站提取所需数据。

1.架构简介

在这里插入图片描述

组件简介:

  • engine 组件:负责各个组件的通信和数据传输
  • spiders 组件:爬虫入口及网页解析
  • scheduler 组件:请求队列
  • downloader 组件:页面下载
  • item pipelines 组件:数据处理及存储
  • downloader middlewares:负责处理 engine 组件和 downloader 组件之间的请求处理和响应,可自定义扩展,例如:封装代理,修改 http 头信息
  • spiders middlewares:处理请求(输出)和响应(输入),可自定义扩展,可对输入和输出进行调整

数据流:

(1) (2) engine 组件将 spiders 组件的请求转发到 scheduler 组件进行排队

(3) (4) 排队完成后, engine 组件将请求转发给 downloader 组件进行页面下载

(5) (6) enginx 组件将 downlader 组件下载的页面转发给 spiders 组件进行解析

(7) (8) enginx 组件将 spiders 组件解析的数据转发到 item pipeline 组件中进行处理及存储

(1) (2) spiders 组件将解析的数据分为两部分,一部分 engine 组件转发到 item pipeline 组件进行处理及存储,另一部分 engine 组件转发到 scheduler 组件进行排队

(3) (4) 如果网页下载失败,engine 组件会重新将请求转发给 scheduler 组件进行排队

2.使用实践

场景说明:爬取豆瓣电影的 TOP 250 数据

1)创建项目

Anaconda 安装和操作,请查看 前言参考 链接

  • 指令创建和激活 scrapy 环境,并创建模板项目
# 创建 scrapy 环境> conda create -n scrapy_env python=3.7 scrapy#-- 激活 scrapy 环境> activate scrapy_env#-- 创建模板项目;scrapy startproject [项目名称]> scrapy startproject scrapy_douban
  • 将项目导入 Pycharm 编辑器,python 环境切换为 scrapy 环境

2)配置文件

  • 修改 spiders/settings.py 文件
# Scrapy settings for scrapy_douban project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##     https://docs.scrapy.org/en/latest/topics/settings.html#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'scrapy_douban'SPIDER_MODULES = ['scrapy_douban.spiders']NEWSPIDER_MODULE = 'scrapy_douban.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'# Obey robots.txt rulesROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docsDOWNLOAD_DELAY = 0.5# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en',#}# Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {
# 'scrapy_douban.middlewares.ScrapyDoubanSpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {
# 'scrapy_douban.middlewares.ScrapyDoubanDownloaderMiddleware': 543, # 'scrapy_douban.middlewares.proxy_ip': 544, # 代理IP 544 优先等级 'scrapy_douban.middlewares.random_user_agent': 545, # 随机客户端信息}# Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {
'scrapy_douban.pipelines.ScrapyDoubanPipeline': 300,}# Enable and configure the AutoThrottle extension (disabled by default)# See https://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'mysql_host = '127.0.0.1'mysql_port = 3306mysql_dbname = 'python-db'mysql_username = 'root'mysql_pwd = '123456'
参数 说明
USER_AGENT 客户端信息
ROBOTSTXT_OBEY robots 协议
CONCURRENT_REQUESTS 请求并发量
DOWNLOAD_DELAY 下载延迟
CONCURRENT_REQUESTS_PER_DOMAIN 域名并发量
CONCURRENT_REQUESTS_PER_IP IP并发量
COOKIES_ENABLED 是否使用 cookies;用于登录操作
DEFAULT_REQUEST_HEADERS 默认请求头
SPIDER_MIDDLEWARES spider 中间件
DOWNLOADER_MIDDLEWARES downloader 中间件
EXTENSIONS 扩展中间件
ITEM_PIPELINES item pipelines 组件

3)请求及解析

  • 进入项目的 spiders 目录,指令创建入口文件 douban_spider.py
#-- scrapy genspider [文件名] [入口域名]> scrapy genspider douban_spider movie.douban.com
  • 项目目录下创建 main.py 作为程序启动的入口
from scrapy import cmdline# 执行 cmd 命令,用于程序启动cmdline.execute('scrapy crawl douban_spider'.split())# 输出 csv 文件 ( 使用 notepad++ 修改编码为 UTF-8 BOM )# cmdline.execute('scrapy crawl douban_spider -o douban.csv'.split())
  • 编辑 items.py (数据属性类)
# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScrapyDoubanItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    serial_number = scrapy.Field() # 排名    movie_name = scrapy.Field() # 电影名    introduce = scrapy.Field() # 简介    star = scrapy.Field() # 星级    evaluate = scrapy.Field() # 评论数    slogan = scrapy.Field() # 标语
  • 编辑 douban_spider.py (实现数据抓取和页面解析)
import scrapyfrom scrapy_douban.items import ScrapyDoubanItemclass DoubanSpiderSpider(scrapy.Spider):    name = 'douban_spider' # 爬出名称;不要跟项目名一样    allowed_domains = ['movie.douban.com'] # 只抓取当前域名下的链接    start_urls = ['https://movie.douban.com/top250'] # 入口链接    # 解析下载组件返回的响应数据    def parse(self, response):        # print(response.text)        # 获取 li 标签列表        list = response.xpath("//div[@class='article']//ol[@class='grid_view']//li")        # 循环解析 li 标签的数据        for item in list:            # print(item)            douban_item = ScrapyDoubanItem()            # 通过 XPath 获取数据项            douban_item["serial_number"] = item.xpath(".//div[@class='item']//em//text()").extract_first()            douban_item["movie_name"] = item.xpath(".//div[@class='info']//div[@class='hd']//a//span[1]//text()").extract_first()            introduces = item.xpath(".//div[@class='info']//div[@class='bd']//p[1]//text()").extract() # 处理列表数据            for introduce_item in introduces:                i_content = "".join(introduce_item.split())                douban_item["introduce"] = i_content            douban_item["star"] = item.xpath(".//span[@class='rating_num']//text()").extract_first()            douban_item["evaluate"] = item.xpath(".//div[@class='star']//span[4]//text()").extract_first()            douban_item["slogan"] = item.xpath(".//p[@class='quote']//span//text()").extract_first()            # print(douban_item)            # 将数据提交到 pipelines.py(需要配置 settings.py 的 item pipelines 组件)            yield douban_item        # 获取"后页"的链接        next_link = response.xpath("//span[@class='next']//link//@href").extract()        # 判断是否为最后一页        if next_link:            next_link = next_link[0]            # 将请求提交到 schedulers 组件,响应回调到 parse 方法            yield scrapy.Request(self.start_urls[0] + next_link, callback=self.parse)

4)数据存储

  • 指令安装 PyMSQL
> conda install pymysql
  • 编辑 pipelines.py (数据处理及存储)
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom itemadapter import ItemAdapterimport pymysqlfrom scrapy_douban.settings import mysql_dbname,mysql_host,mysql_port,mysql_pwd,mysql_usernameclass ScrapyDoubanPipeline:    # 处理每一项数据    def process_item(self, item, spider):        # 插入数据        self.insert(item)        return item    # 插入数据    def insert(self, item):        # 封装 sql 参数值        values = (int(item["serial_number"]), item["movie_name"], item["introduce"]                  , float(item["star"]), item["evaluate"], item["slogan"])        # 连接数据库        conn = pymysql.connect(host=mysql_host, user=mysql_username, password=mysql_pwd, port=mysql_port,                                  db=mysql_dbname)        # 获取游标        cursor = conn.cursor()        # 插入数据语句        sql = 'INSERT INTO douban(serial_number,movie_name,introduce, star, evaluate, slogan) VALUES (%s, %s, %s, %s, %s, %s)'        try:            cursor.execute(sql, values)            conn.commit()            print("数据插入成功:"+str(values))        except Exception as ex:            print("出现如下异常:%s"%ex)            conn.rollback()            print("数据回滚:"+str(values))        # 关闭数据库连接        finally:            conn.close()

5)隐藏身份

  • 代理IP(阿布云 http 隧道)和 随机客户端信息
  • 编辑 middlewares.py ,之后在 settings.py 中配置 DOWNLOADER_MIDDLEWARES
# Define here the models for your spider middleware## See documentation in:# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlimport base64import randomfrom scrapy import signals# useful for handling different item types with a single interfacefrom itemadapter import is_item, ItemAdapterclass ScrapyDoubanSpiderMiddleware:    # Not all methods need to be defined. If a method is not defined,    # scrapy acts as if the spider middleware does not modify the    # passed objects.    @classmethod    def from_crawler(cls, crawler):        # This method is used by Scrapy to create your spiders.        s = cls()        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)        return s    def process_spider_input(self, response, spider):        # Called for each response that goes through the spider        # middleware and into the spider.        # Should return None or raise an exception.        return None    def process_spider_output(self, response, result, spider):        # Called with the results returned from the Spider, after        # it has processed the response.        # Must return an iterable of Request, or item objects.        for i in result:            yield i    def process_spider_exception(self, response, exception, spider):        # Called when a spider or process_spider_input() method        # (from other spider middleware) raises an exception.        # Should return either None or an iterable of Request or item objects.        pass    def process_start_requests(self, start_requests, spider):        # Called with the start requests of the spider, and works        # similarly to the process_spider_output() method, except        # that it doesn’t have a response associated.        # Must return only requests (not items).        for r in start_requests:            yield r    def spider_opened(self, spider):        spider.logger.info('Spider opened: %s' % spider.name)class ScrapyDoubanDownloaderMiddleware:    # Not all methods need to be defined. If a method is not defined,    # scrapy acts as if the downloader middleware does not modify the    # passed objects.    @classmethod    def from_crawler(cls, crawler):        # This method is used by Scrapy to create your spiders.        s = cls()        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)        return s    def process_request(self, request, spider):        # Called for each request that goes through the downloader        # middleware.        # Must either:        # - return None: continue processing this request        # - or return a Response object        # - or return a Request object        # - or raise IgnoreRequest: process_exception() methods of        #   installed downloader middleware will be called        return None    def process_response(self, request, response, spider):        # Called with the response returned from the downloader.        # Must either;        # - return a Response object        # - return a Request object        # - or raise IgnoreRequest        return response    def process_exception(self, request, exception, spider):        # Called when a download handler or a process_request()        # (from other downloader middleware) raises an exception.        # Must either:        # - return None: continue processing this exception        # - return a Response object: stops process_exception() chain        # - return a Request object: stops process_exception() chain        pass    def spider_opened(self, spider):        spider.logger.info('Spider opened: %s' % spider.name)# 代理IPclass proxy_ip(object):    def process_request(self, request, spider):        request.meta['proxy'] = 'aaaaaaaaaa:1234'        proxy_name_pwd = b'pppppppppp:xxxxxxxxx'        encode_name_pwd = base64.b64encode(proxy_name_pwd)        request.headers['Proxy-Authorization'] = 'Basic '+ encode_name_pwd.decode()# 随机客户端信息class random_user_agent(object):    def process_request(self, request, spider):        USER_AGENT_LIST = [            'MSIE (MSIE 6.0; X11; Linux; i686) Opera 7.23',            'Opera/9.20 (Macintosh; Intel Mac OS X; U; en)',            'Opera/9.0 (Macintosh; PPC Mac OS X; U; en)',            'iTunes/9.0.3 (Macintosh; U; Intel Mac OS X 10_6_2; en-ca)',            'Mozilla/4.76 [en_jp] (X11; U; SunOS 5.8 sun4u)',            'iTunes/4.2 (Macintosh; U; PPC Mac OS X 10.2)',            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:5.0) Gecko/20100101 Firefox/5.0',            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0) Gecko/20100101 Firefox/9.0',            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20120813 Firefox/16.0',            'Mozilla/4.77 [en] (X11; I; IRIX;64 6.5 IP30)',            'Mozilla/4.8 [en] (X11; U; SunOS; 5.7 sun4u)'        ]        user_agent = random.choice(USER_AGENT_LIST)        request.headers['User_Agent'] = user_agent

三、其它

1.XPath工具使用

第一种:

  • 点击 chrome 浏览器的工具栏(书签栏上面)的插件图标

  • 在 QUERY 输入框中输入 XPath 内容

第二种:

  • F12 打开 chrome 浏览器的调试栏
  • 定位元素节点
  • 元素节点右击 - Copy - Copy XPath
    在这里插入图片描述

转载地址:http://sznws.baihongyu.com/

你可能感兴趣的文章
浙大博士生刘汉唐:带你回顾图像分割的经典算法 | 分享总结
查看>>
VMware Ubuntu安装详细过程(新)
查看>>
ITK4.12+VS2015配置详解
查看>>
python图像处理---python的图像处理模块Image
查看>>
python在局域网中实现文件上传和下载功能
查看>>
功能1:人脸检测,效果不好,基本上一般判断错误。
查看>>
功能2:播放视频 + 摄像头视频
查看>>
功能3:读取摄像头视频,人脸检测
查看>>
Demo1:视频人体检测
查看>>
Demo2:图片人体检测 (图片hog参数,效果还可以了)
查看>>
Demo3:视频人体检测
查看>>
TensorFlow实现卷积神经网络CNN
查看>>
CT值
查看>>
TensorFlow入门:给小狗分类
查看>>
基于YOLOv3+Kalman-Filter实现Multi-target tracking
查看>>
OpenCV优化:图像的遍历4种方式
查看>>
图像方面面试问题汇集
查看>>
在ARM-Linux下实现车牌识别(一)------车牌提取
查看>>
在ARM-Linux下实现车牌识别(二)------车牌识别
查看>>
detectormorph.cpp源码解读
查看>>