Scrapy提供了可自定义2种中间件,1个数据处理器
名称 | 作用 | 用户设置 |
---|---|---|
数据收集器(Item-Pipeline) | 处理item | 覆盖 |
下载中间件(Downloader-Middleware) | 处理request/response | 合并 |
爬虫中间件(Spider-Middleware) | 处理item/response/request | 合并 |
解释:
用户设置:是指custom_settings
可是它们继承的父类竟然是Object…,每次都要查文档。
正常来说应该提供一个抽象函数作为接口,给使用者实现自己的具体功能,不知道为啥这么设计
通过几段代码及注释,简要说明三个中间件的功能
baidu_spider.py
from scrapy import Spider, cmdlineclass BaiduSpider(Spider): name = "baidu_spider" start_urls = [ "https://www.baidu.com/" ] custom_settings = { "SPIDER_DATA": "this is spider data", "DOWNLOADER_MIDDLEWARES": { "scrapys.mymiddleware.MyMiddleware": 100, }, "ITEM_PIPELINES": { "scrapys.mypipeline.MyPipeline": 100, }, "SPIDER_MIDDLEWARES":{ "scrapys.myspidermiddleware.MySpiderMiddleware": 100, } } def parse(self, response): passif __name__ == '__main__': cmdline.execute("scrapy crawl baidu_spider".split())
mypipeline.py
class MyPipeline(object): def __init__(self, spider_data): self.spider_data = spider_data @classmethod def from_crawler(cls, crawler): """ 获取spider的settings参数,返回Pipeline实例对象 """ spider_data = crawler.settings.get("SPIDER_DATA") print("### pipeline get spider_data: {}".format(spider_data)) return cls(spider_data) def process_item(self, item, spider): """ return Item 继续处理 raise DropItem 丢弃 """ print("### call process_item") return item def open_spider(self, spider): """ spider开启时调用 """ print("### spdier open {}".format(spider.name)) def close_spider(self, spider): """ spider关闭时调用 """ print("### spdier close {}".format(spider.name))
mymiddleware.py
class MyMiddleware(object): def __init__(self, spider_data): self.spider_data = spider_data @classmethod def from_crawler(cls, crawler): """ 获取spider的settings参数,返回中间件实例对象 """ spider_data = crawler.settings.get("SPIDER_DATA") print("### middleware get spider_data: {}".format(spider_data)) return cls(spider_data) def process_request(self, request, spider): """ return None: 继续处理Request Response: 返回Response Request: 重新调度 raise IgnoreRequest: process_exception -> Request.errback """ print("### call process_request") def process_response(self, request, response, spider): """ return Response: 继续处理Response Request: 重新调度 raise IgnoreRequest: Request.errback """ print("### call process_response") return response def process_exception(self, request, exception, spider): """ return None: 继续处理异常 Response: 返回Response Request: 重新调用 """ pass
myspidermiddleware.py
class MySpiderMiddleware(object): def __init__(self, spider_data): self.spider_data = spider_data @classmethod def from_crawler(cls, crawler): """ 获取spider的settings参数,返回中间件实例对象 """ spider_data = crawler.settings.get("SPIDER_DATA") print("### spider middleware get spider_data: {}".format(spider_data)) return cls(spider_data) def process_spider_input(self, response, spider): """ url下载完成后执行,交给parse处理response (parse之前执行) return None 继续处理response raise Exception """ print("### call process_spider_input") def process_spider_output(self, response, result, spider): """ response返回result时调用(result必须返回包含item 或者是 Request的可迭代对象)-----》yield item、yield Request(url) return iterable of Request、dict or Item """ print("### call process_spider_output")
for i in result: yield i def process_spider_exception(self, response, exception, spider): """ return None iterable of Response, dict, or Item """ pass
def process_start_requests(self,start_requests,spider):
"""
#爬虫刚开始调用的时候执行 (start_request 之前执行)
return: 包含Request对象的可迭代对象
"""
return start_requests
运行爬虫后,查看日志
### middleware get spider_data: this is spider data### spider middleware get spider_data: this is spider data### pipeline get spider_data: this is spider data### spdier open baidu_spider### call process_request### call process_response### call process_spider_input### call process_spider_output### spdier close baidu_spider
中间件启动顺序download middlewarespider middlewarepipeline处理函数调用顺序spdier openprocess_requestprocess_responseprocess_spider_inputprocess_spider_outputspdier close---------------------
参考
联系客服