2015-06-30 7 views

Antwort

20

Was können Sie tun, ist eine errback in Ihrem Request Instanzen definieren:

errback (aufrufbar) - eine Funktion, die aufgerufen wird, wenn eine Ausnahme ausgelöst wurde, während der Verarbeitung der Anforderung. Dies gilt auch für Seiten, bei denen 404-HTTP-Fehler fehlgeschlagen sind. Es erhält a Twisted Failure instance als ersten Parameter.

Hier einige Beispiel-Code (für scrapy 1.0), die Sie verwenden können:

# -*- coding: utf-8 -*- 
# errbacks.py 
import scrapy 

# from scrapy.contrib.spidermiddleware.httperror import HttpError 
from scrapy.spidermiddlewares.httperror import HttpError 
from twisted.internet.error import DNSLookupError 
from twisted.internet.error import TimeoutError 


class ErrbackSpider(scrapy.Spider): 
    name = "errbacks" 
    start_urls = [ 
     "http://www.httpbin.org/",    # HTTP 200 expected 
     "http://www.httpbin.org/status/404", # Not found error 
     "http://www.httpbin.org/status/500", # server issue 
     "http://www.httpbin.org:12345/",  # non-responding host, timeout expected 
     "http://www.httphttpbinbin.org/",  # DNS error expected 
    ] 

    def start_requests(self): 
     for u in self.start_urls: 
      yield scrapy.Request(u, callback=self.parse_httpbin, 
            errback=self.errback_httpbin, 
            dont_filter=True) 

    def parse_httpbin(self, response): 
     self.logger.error('Got successful response from {}'.format(response.url)) 
     # do something useful now 

    def errback_httpbin(self, failure): 
     # log all errback failures, 
     # in case you want to do something special for some errors, 
     # you may need the failure's type 
     self.logger.error(repr(failure)) 

     #if isinstance(failure.value, HttpError): 
     if failure.check(HttpError): 
      # you can get the response 
      response = failure.value.response 
      self.logger.error('HttpError on %s', response.url) 

     #elif isinstance(failure.value, DNSLookupError): 
     elif failure.check(DNSLookupError): 
      # this is the original request 
      request = failure.request 
      self.logger.error('DNSLookupError on %s', request.url) 

     #elif isinstance(failure.value, TimeoutError): 
     elif failure.check(TimeoutError): 
      request = failure.request 
      self.logger.error('TimeoutError on %s', request.url) 

Und die Ausgabe in scrapy Shell (nur 1 Neuversuch und 5s Download Timeout):

$ scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1 
2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 
2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11 
2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'} 
2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 
2015-06-30 23:45:56 [scrapy] INFO: Spider opened 
2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httphttpbinbin.org/> (failed 1 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname. 
2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www.httphttpbinbin.org/> (failed 2 times): DNS lookup failed: address 'www.httphttpbinbin.org' not found: [Errno -5] No address associated with hostname. 
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>> 
2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www.httphttpbinbin.org/ 
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www.httpbin.org/> (referer: None) 
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www.httpbin.org/status/404> (referer: None) 
2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www.httpbin.org/ 
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>> 
2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/404 
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org/status/500> (failed 1 times): 500 Internal Server Error 
2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org/status/500> (failed 2 times): 500 Internal Server Error 
2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www.httpbin.org/status/500> (referer: None) 
2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>> 
2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www.httpbin.org/status/500 
2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.httpbin.org:12345/> (failed 1 times): User timeout caused connection failure. 
2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.httpbin.org:12345/> (failed 2 times): User timeout caused connection failure. 
2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>> 
2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.httpbin.org:12345/ 
2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished) 
2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/exception_count': 4, 
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2, 
'downloader/request_bytes': 1748, 
'downloader/request_count': 8, 
'downloader/request_method_count/GET': 8, 
'downloader/response_bytes': 12506, 
'downloader/response_count': 4, 
'downloader/response_status_count/200': 1, 
'downloader/response_status_count/404': 1, 
'downloader/response_status_count/500': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191), 
'log_count/DEBUG': 10, 
'log_count/ERROR': 9, 
'log_count/INFO': 7, 
'response_received_count': 3, 
'scheduler/dequeued': 8, 
'scheduler/dequeued/memory': 8, 
'scheduler/enqueued': 8, 
'scheduler/enqueued/memory': 8, 
'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)} 
2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished) 

Beachten Sie, wie scrapy die Ausnahmen in ihren Statistiken protokolliert:

'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2, 
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2, 
0

Ich bevorzuge eine benutzerdefinierte Retry Middleware wie folgt:

from scrapy.contrib.downloadermiddleware.retry import RetryMiddleware 

from fake_useragent import FakeUserAgentError 

class FakeUserAgentErrorRetryMiddleware(RetryMiddleware): 

    def process_exception(self, request, exception, spider): 
     if type(exception) == FakeUserAgentError: return self._retry(request, exception, spider)