打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
python – 如何通过scrapy捕获错误,以便在收到User Timeout错误时可以执行某些操作?

ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.

我在使用刮刀的时候偶尔会遇到这个问题.有没有办法可以解决这个问题并在它发生时运行一个函数?我无法在任何地方找到如何在线进行.

解决方法:

您可以做的是在Request实例中定义一个errback

errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives 07001 as first parameter.

以下是您可以使用的一些示例代码(对于scrapy 1.0):

# -*- coding: utf-8 -*-# errbacks.pyimport scrapy# from scrapy.contrib.spidermiddleware.httperror import HttpErrorfrom scrapy.spidermiddlewares.httperror import HttpErrorfrom twisted.internet.error import DNSLookupErrorfrom twisted.internet.error import TimeoutErrorclass ErrbackSpider(scrapy.Spider):    name = "errbacks"    start_urls = [        "http://www.httpbin.org/",              # HTTP 200 expected        "http://www.httpbin.org/status/404",    # Not found error        "http://www.httpbin.org/status/500",    # server issue        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected        "http://www.httphttpbinbin.org/",       # DNS error expected    ]    def start_requests(self):        for u in self.start_urls:            yield scrapy.Request(u, callback=self.parse_httpbin,                                    errback=self.errback_httpbin,                                    dont_filter=True)    def parse_httpbin(self, response):        self.logger.error('Got successful response from {}'.format(response.url))        # do something useful now    def errback_httpbin(self, failure):        # log all errback failures,        # in case you want to do something special for some errors,        # you may need the failure's type        self.logger.error(repr(failure))        #if isinstance(failure.value, HttpError):        if failure.check(HttpError):            # you can get the response            response = failure.value.response            self.logger.error('HttpError on %s', response.url)        #elif isinstance(failure.value, DNSLookupError):        elif failure.check(DNSLookupError):            # this is the original request            request = failure.request            self.logger.error('DNSLookupError on %s', request.url)        #elif isinstance(failure.value, TimeoutError):        elif failure.check(TimeoutError):            request = failure.request            self.logger.error('TimeoutError on %s', request.url)

和scrapy shell中的输出(只有1次重试和5次下载超时):

$scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=12015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http112015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'}2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 2015-06-30 23:45:56 [scrapy] INFO: Spider opened2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:60232015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>>2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None)2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/4042015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None)2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/5002015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure.2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure.2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>>2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished)2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats:{'downloader/exception_count': 4, 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2, 'downloader/request_bytes': 1748, 'downloader/request_count': 8, 'downloader/request_method_count/GET': 8, 'downloader/response_bytes': 12506, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'downloader/response_status_count/500': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191), 'log_count/DEBUG': 10, 'log_count/ERROR': 9, 'log_count/INFO': 7, 'response_received_count': 3, 'scheduler/dequeued': 8, 'scheduler/dequeued/memory': 8, 'scheduler/enqueued': 8, 'scheduler/enqueued/memory': 8, 'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)}2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)

请注意scrapy如何在其统计信息中记录异常:

'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
来源:https://www.icode9.com/content-1-485301.html
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
安装scrapy的方法详解
安装python爬虫scrapy踩过的那些坑和编程外的思考
目前最受欢迎的10个Python开源框架
使用python,scrapy写(定制)爬虫的经验,资料,杂。
记录Python的scrapy爬虫——房天下(附源码)
requests库快速入门(参照官方文档翻译整理)
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服