ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.
我在使用刮刀的时候偶尔会遇到这个问题.有没有办法可以解决这个问题并在它发生时运行一个函数?我无法在任何地方找到如何在线进行.
解决方法:
您可以做的是在Request实例中定义一个errback
:
errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives 07001 as first parameter.
以下是您可以使用的一些示例代码(对于scrapy 1.0):
# -*- coding: utf-8 -*-# errbacks.pyimport scrapy# from scrapy.contrib.spidermiddleware.httperror import HttpErrorfrom scrapy.spidermiddlewares.httperror import HttpErrorfrom twisted.internet.error import DNSLookupErrorfrom twisted.internet.error import TimeoutErrorclass ErrbackSpider(scrapy.Spider): name = "errbacks" start_urls = [ "http://www.httpbin.org/", # HTTP 200 expected "http://www.httpbin.org/status/404", # Not found error "http://www.httpbin.org/status/500", # server issue "http://www.httpbin.org:12345/", # non-responding host, timeout expected "http://www.httphttpbinbin.org/", # DNS error expected ] def start_requests(self): for u in self.start_urls: yield scrapy.Request(u, callback=self.parse_httpbin, errback=self.errback_httpbin, dont_filter=True) def parse_httpbin(self, response): self.logger.error('Got successful response from {}'.format(response.url)) # do something useful now def errback_httpbin(self, failure): # log all errback failures, # in case you want to do something special for some errors, # you may need the failure's type self.logger.error(repr(failure)) #if isinstance(failure.value, HttpError): if failure.check(HttpError): # you can get the response response = failure.value.response self.logger.error('HttpError on %s', response.url) #elif isinstance(failure.value, DNSLookupError): elif failure.check(DNSLookupError): # this is the original request request = failure.request self.logger.error('DNSLookupError on %s', request.url) #elif isinstance(failure.value, TimeoutError): elif failure.check(TimeoutError): request = failure.request self.logger.error('TimeoutError on %s', request.url)
和scrapy shell中的输出(只有1次重试和5次下载超时):
$scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=12015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http112015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'}2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 2015-06-30 23:45:56 [scrapy] INFO: Spider opened2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:60232015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname.2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>>2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None)2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None)2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/4042015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None)2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/5002015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure.2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure.2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>>2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished)2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats:{'downloader/exception_count': 4, 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2, 'downloader/request_bytes': 1748, 'downloader/request_count': 8, 'downloader/request_method_count/GET': 8, 'downloader/response_bytes': 12506, 'downloader/response_count': 4, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'downloader/response_status_count/500': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191), 'log_count/DEBUG': 10, 'log_count/ERROR': 9, 'log_count/INFO': 7, 'response_received_count': 3, 'scheduler/dequeued': 8, 'scheduler/dequeued/memory': 8, 'scheduler/enqueued': 8, 'scheduler/enqueued/memory': 8, 'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)}2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)
请注意scrapy如何在其统计信息中记录异常:
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
来源:https://www.icode9.com/content-1-485301.html
联系客服