打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
Python爬虫:Scrapy中间件Middleware和Pipeline

Scrapy提供了可自定义2种中间件,1个数据处理器

名称作用用户设置
数据收集器(Item-Pipeline)处理item覆盖
下载中间件(Downloader-Middleware)处理request/response合并
爬虫中间件(Spider-Middleware)处理item/response/request合并

解释:
用户设置:是指custom_settings

可是它们继承的父类竟然是Object…,每次都要查文档。

正常来说应该提供一个抽象函数作为接口,给使用者实现自己的具体功能,不知道为啥这么设计

通过几段代码及注释,简要说明三个中间件的功能

1、Spider

baidu_spider.py

from scrapy import Spider, cmdlineclass BaiduSpider(Spider):    name = "baidu_spider"    start_urls = [        "https://www.baidu.com/"    ]    custom_settings = {        "SPIDER_DATA": "this is spider data",        "DOWNLOADER_MIDDLEWARES": {                "scrapys.mymiddleware.MyMiddleware": 100,            },        "ITEM_PIPELINES": {            "scrapys.mypipeline.MyPipeline": 100,        },        "SPIDER_MIDDLEWARES":{            "scrapys.myspidermiddleware.MySpiderMiddleware": 100,        }    }    def parse(self, response):        passif __name__ == '__main__':    cmdline.execute("scrapy crawl baidu_spider".split())

2、Pipeline

mypipeline.py

class MyPipeline(object):    def __init__(self, spider_data):        self.spider_data = spider_data    @classmethod    def from_crawler(cls, crawler):        """        获取spider的settings参数,返回Pipeline实例对象        """        spider_data = crawler.settings.get("SPIDER_DATA")        print("### pipeline get spider_data: {}".format(spider_data))        return cls(spider_data)    def process_item(self, item, spider):        """        return Item 继续处理        raise DropItem 丢弃        """        print("### call process_item")        return item    def open_spider(self, spider):        """        spider开启时调用        """        print("### spdier open {}".format(spider.name))    def close_spider(self, spider):        """        spider关闭时调用        """        print("### spdier close {}".format(spider.name))

3、Downloader-Middleware

mymiddleware.py

class MyMiddleware(object):    def __init__(self, spider_data):        self.spider_data = spider_data    @classmethod    def from_crawler(cls, crawler):        """        获取spider的settings参数,返回中间件实例对象        """        spider_data = crawler.settings.get("SPIDER_DATA")        print("### middleware get spider_data: {}".format(spider_data))        return cls(spider_data)    def process_request(self, request, spider):        """        return            None: 继续处理Request            Response: 返回Response            Request: 重新调度        raise IgnoreRequest:  process_exception -> Request.errback        """        print("### call process_request")    def process_response(self, request, response, spider):        """        return            Response: 继续处理Response            Request: 重新调度        raise IgnoreRequest: Request.errback        """        print("### call process_response")        return response    def process_exception(self, request, exception, spider):        """        return            None: 继续处理异常            Response: 返回Response            Request: 重新调用        """        pass

4、Spider-Middleware

myspidermiddleware.py

class MySpiderMiddleware(object):    def __init__(self, spider_data):        self.spider_data = spider_data    @classmethod    def from_crawler(cls, crawler):        """        获取spider的settings参数,返回中间件实例对象        """        spider_data = crawler.settings.get("SPIDER_DATA")        print("### spider middleware get spider_data: {}".format(spider_data))        return cls(spider_data)    def process_spider_input(self, response, spider):        """        url下载完成后执行,交给parse处理response (parse之前执行)        return None  继续处理response        raise Exception        """        print("### call process_spider_input")    def process_spider_output(self, response, result, spider):        """        response返回result时调用(result必须返回包含item 或者是 Request的可迭代对象)-----》yield item、yield Request(url)        return            iterable of Request、dict or Item        """        print("### call process_spider_output")
for i in result: yield i def process_spider_exception(self, response, exception, spider): """ return None iterable of Response, dict, or Item """ pass
    def process_start_requests(self,start_requests,spider):
          """
	#爬虫刚开始调用的时候执行 (start_request 之前执行)
	return: 包含Request对象的可迭代对象
	  """
	return start_requests

运行爬虫后,查看日志

### middleware get spider_data: this is spider data### spider middleware get spider_data: this is spider data### pipeline get spider_data: this is spider data### spdier open baidu_spider### call process_request### call process_response### call process_spider_input### call process_spider_output### spdier close baidu_spider
中间件启动顺序download middlewarespider middlewarepipeline处理函数调用顺序spdier openprocess_requestprocess_responseprocess_spider_inputprocess_spider_outputspdier close--------------------- 

处理函数调用顺序

参考

  1. Item Pipeline
  2. 下载器中间件(Downloader Middleware)
  3. Spider中间件(Middleware)
  4. Scrapy 1.5 documentation
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
python模块之Scrapy爬虫框架
收藏| Scrapy框架各组件详细设置
【视频讲解】Scrapy递归抓取简书用户信息
Python 爬虫(六):使用 Scrapy 爬取去哪儿网景区信息
网页爬虫
第76天:Scrapy 模拟登陆
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服