ich Schaben einige Seiten mit scrapy
und die folgende Fehlermeldung erhalten:Wie verhindert man einen twisted.internet.error.ConnectionLost-Fehler bei der Verwendung von Scrapy?
twisted.internet.error.ConnectionLost
Ausgabe Meine Befehlszeile:
2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened
2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy3.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy3.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy8.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy8.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy9.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy9.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy10.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy10.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy7.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy7.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy5.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy5.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy6.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy6.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy4.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy4.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] INFO: Closing spider (finished)
2015-05-04 18:40:35+0800 [cnproxy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 36,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 36,
'downloader/request_bytes': 8121,
'downloader/request_count': 36,
'downloader/request_method_count/GET': 36,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 4, 10, 40, 35, 608377),
'log_count/DEBUG': 38,
'log_count/ERROR': 12,
'log_count/INFO': 7,
'scheduler/dequeued': 36,
'scheduler/dequeued/memory': 36,
'scheduler/enqueued': 36,
'scheduler/enqueued/memory': 36,
'start_time': datetime.datetime(2015, 5, 4, 10, 40, 32, 624695)}
2015-05-04 18:40:35+0800 [cnproxy] INFO: Spider closed (finished)
Mein settings.py
:
SPIDER_MODULES = ['proxy.spiders']
NEWSPIDER_MODULES = 'proxy.spiders'
DOWNLOAD_DELAY = 0
DOWNLOAD_TIMEOUT = 30
ITEM_PIPELINES = {
'proxy.pipelines.ProxyPipeline':100,
}
CONCURRENT_ITEMS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 64
#CONCURRENT_SPIDERS = 128
LOG_ENABLED = True
LOG_ENCODING = 'utf-8'
LOG_FILE = '/home/hadoop/modules/scrapy/myapp/proxy/proxy.log'
LOG_LEVEL = 'DEBUG'
LOG_STDOUT = False
Meine Spinne proxy_spider.py
:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from proxy.items import ProxyItem
import re
class ProxycrawlerSpider(CrawlSpider):
name = 'cnproxy'
allowed_domains = ['www.cnproxy.com']
indexes = [1,2,3,4,5,6,7,8,9,10]
start_urls = []
for i in indexes:
url = 'http://www.cnproxy.com/proxy%s.html' % i
start_urls.append(url)
start_urls.append('http://www.cnproxy.com/proxyedu1.html')
start_urls.append('http://www.cnproxy.com/proxyedu2.html')
def parse_ip(self,response):
sel = HtmlXPathSelector(response)
addresses = sel.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
protocols = sel.select('//tr[position()>1]/td[position()=2]').re('<td>(.*)<\/td>')
locations = sel.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>')
ports_re = re.compile('write\(":"(.*)\)')
raw_ports = ports_re.findall(response.body);
port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''}
ports = []
for port in raw_ports:
tmp = port
for key in port_map:
tmp = tmp.replace(key,port_map[key]);
ports.append(tmp)
items = []
for i in range(len(addresses)):
item = ProxyItem()
item['address'] = addresses[i]
item['protocol'] = protocols[i]
item['location'] = locations[i]
item['port'] = ports[i]
items.append(item)
return items
Ist irgendetwas mit meinen Rohrleitungen oder Einstellungen falsch? Wenn nicht, wie kann ich den twisted.internet.error.ConnectionLost
Fehler verhindern.
Ich versuchte, die scrapy shell
$scrapy shell http://www.cnproxy.com/proxy1.html
und die gleichen Fehler wie betitelt. Aber ich kann die Seite mit meinem Chrom besuchen. Und ich habe andere Seiten versucht wie
$scrapy shell http://stackoverflow.com
Sie alle funktionieren gut.
versuchen kann dies als scrapy zu Twisted mehr aufeinander bezogene aussieht. – eLRuLL
danke, was für ein problem könnte es dann mit dem twisted sein? Ich bin total neu in Twisted und habe keine Ahnung was zu tun ist. Jede Hilfe wäre willkommen! – April