2015-05-04 10 views
10

ich Schaben einige Seiten mit scrapy und die folgende Fehlermeldung erhalten:Wie verhindert man einen twisted.internet.error.ConnectionLost-Fehler bei der Verwendung von Scrapy?

twisted.internet.error.ConnectionLost

Ausgabe Meine Befehlszeile:

2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened 
2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080 
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:32+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy3.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy3.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy8.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy8.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy9.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy9.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy10.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy10.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy7.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy7.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy5.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy5.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy6.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy6.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:34+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy4.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy4.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>] 
2015-05-04 18:40:35+0800 [cnproxy] INFO: Closing spider (finished) 
2015-05-04 18:40:35+0800 [cnproxy] INFO: Dumping Scrapy stats: 
{'downloader/exception_count': 36, 
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 36, 
'downloader/request_bytes': 8121, 
'downloader/request_count': 36, 
'downloader/request_method_count/GET': 36, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2015, 5, 4, 10, 40, 35, 608377), 
'log_count/DEBUG': 38, 
'log_count/ERROR': 12, 
'log_count/INFO': 7, 
'scheduler/dequeued': 36, 
'scheduler/dequeued/memory': 36, 
'scheduler/enqueued': 36, 
'scheduler/enqueued/memory': 36, 
'start_time': datetime.datetime(2015, 5, 4, 10, 40, 32, 624695)} 
2015-05-04 18:40:35+0800 [cnproxy] INFO: Spider closed (finished) 

Mein settings.py:

SPIDER_MODULES = ['proxy.spiders'] 
    NEWSPIDER_MODULES = 'proxy.spiders' 

    DOWNLOAD_DELAY = 0 
    DOWNLOAD_TIMEOUT = 30 

    ITEM_PIPELINES = { 
       'proxy.pipelines.ProxyPipeline':100, 
    } 

    CONCURRENT_ITEMS = 100 
    CONCURRENT_REQUESTS_PER_DOMAIN = 64 
    #CONCURRENT_SPIDERS = 128 

    LOG_ENABLED = True 
    LOG_ENCODING = 'utf-8' 
    LOG_FILE = '/home/hadoop/modules/scrapy/myapp/proxy/proxy.log' 
    LOG_LEVEL = 'DEBUG' 
    LOG_STDOUT = False 

Meine Spinne proxy_spider.py :

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor 
from proxy.items import ProxyItem 
import re 

class ProxycrawlerSpider(CrawlSpider): 
    name = 'cnproxy' 
    allowed_domains = ['www.cnproxy.com'] 
    indexes = [1,2,3,4,5,6,7,8,9,10] 
    start_urls = [] 
    for i in indexes: 
     url = 'http://www.cnproxy.com/proxy%s.html' % i 
     start_urls.append(url) 
    start_urls.append('http://www.cnproxy.com/proxyedu1.html') 
    start_urls.append('http://www.cnproxy.com/proxyedu2.html') 

    def parse_ip(self,response): 
     sel = HtmlXPathSelector(response) 
     addresses = sel.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}') 
     protocols = sel.select('//tr[position()>1]/td[position()=2]').re('<td>(.*)<\/td>') 
     locations = sel.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>') 
     ports_re = re.compile('write\(":"(.*)\)') 
     raw_ports = ports_re.findall(response.body); 
     port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''} 
     ports = [] 
     for port in raw_ports: 
      tmp = port 
      for key in port_map: 
       tmp = tmp.replace(key,port_map[key]); 
      ports.append(tmp) 
     items = [] 
     for i in range(len(addresses)): 
      item = ProxyItem() 
      item['address'] = addresses[i] 
      item['protocol'] = protocols[i] 
      item['location'] = locations[i] 
      item['port'] = ports[i] 
      items.append(item) 
     return items 

Ist irgendetwas mit meinen Rohrleitungen oder Einstellungen falsch? Wenn nicht, wie kann ich den twisted.internet.error.ConnectionLost Fehler verhindern.

Ich versuchte, die scrapy shell

$scrapy shell http://www.cnproxy.com/proxy1.html 

und die gleichen Fehler wie betitelt. Aber ich kann die Seite mit meinem Chrom besuchen. Und ich habe andere Seiten versucht wie

$scrapy shell http://stackoverflow.com 

Sie alle funktionieren gut.

+1

versuchen kann dies als scrapy zu Twisted mehr aufeinander bezogene aussieht. – eLRuLL

+0

danke, was für ein problem könnte es dann mit dem twisted sein? Ich bin total neu in Twisted und habe keine Ahnung was zu tun ist. Jede Hilfe wäre willkommen! – April

Antwort

5

Sie müssen eine User-Agent-Zeichenfolge festlegen. Es scheint, dass einige Websites es nicht mögen und blockieren, wenn Ihr Benutzeragent kein Browser ist. Sie können examples of user agent strings finden.

Diese article identifiziert die Best Practices, um zu verhindern, dass Ihre Spinne blockiert wird.

öffnen settings.py: Fügen Sie den folgenden User-Agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'

Sie auch eine user-agent randomiser