Ich möchte einige Links zu einer Nachrichten-Website zu kratzen und erhalten Sie die vollständigen Nachrichten. Allerdings sind die Links relativFehler mit Links in scrapy
Die Nachrichten-Website ist http://www.puntal.com.ar/v2/
und die Verbindungen sind auch
<div class="article-title">
<a href="/v2/article.php?id=187222">Barros Schelotto: "No somos River y vamos a tratar de pasar a la final"</a>
</div>
dann die relative Link "/v2/article.php?id=187222"
Meine Spinne ist wie folgt (edit)
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
#from urlparse import urljoin
from scrapy.http.request import Request
try:
from urllib.parse import urljoin # Python3.x
except ImportError:
from urlparse import urljoin # Python2.7
from puntalcomar.items import PuntalcomarItem
class PuntalComArSpider(CrawlSpider):
name = 'puntal.com.ar'
allowed_domains = ['http://www.puntal.com.ar/v2/']
start_urls = ['http://www.puntal.com.ar/v2/']
rules = (
Rule(LinkExtractor(allow=(''),), callback="parse", follow=True),
)
def parse_url(self, response):
hxs = Selector(response)
urls = hxs.xpath('//div[@class="article-title"]/a/@href').extract()
print 'enlace relativo ', urls
for url in urls:
urlfull = urljoin('http://www.puntal.com.ar',url
print 'enlace completo ', urlfull
yield Request(urlfull, callback = self.parse_item)
def parse_item(self, response):
hxs = Selector(response)
dates = hxs.xpath('//span[@class="date"]')
title = hxs.xpath('//div[@class="title"]')
subheader = hxs.xpath('//div[@class="subheader"]')
body = hxs.xpath('//div[@class="body"]/p')
items = []
for date in dates:
item = PuntalcomarItem()
item["date"] = date.xpath('text()').extract()
item["title"] = title.xpath("text()").extract()
item["subheader"] = subheader.xpath('text()').extract()
item["body"] = body.xpath("text()").extract()
items.append(item)
return items
Aber ich t funktioniert nicht
ich Linux Mint mit Python 2.7.6
Shell habe:
$ scrapy crawl puntal.com.ar
2016-07-10 13:39:15 [scrapy] INFO: Scrapy 1.1.0 started (bot: puntalcomar)
2016-07-10 13:39:15 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'puntalcomar.spiders', 'SPIDER_MODULES': ['puntalcomar.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'puntalcomar'}
2016-07-10 13:39:15 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-07-10 13:39:15 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-07-10 13:39:15 [scrapy] INFO: Enabled item pipelines:
['puntalcomar.pipelines.XmlExportPipeline']
2016-07-10 13:39:15 [scrapy] INFO: Spider opened
2016-07-10 13:39:15 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (404) <GET http://www.puntal.com.ar/robots.txt> (referer: None)
2016-07-10 13:39:15 [scrapy] DEBUG: Redirecting (301) to <GET http://www.puntal.com.ar/v2/> from <GET http://www.puntal.com.ar/v2>
2016-07-10 13:39:15 [scrapy] DEBUG: Crawled (200) <GET http://www.puntal.com.ar/v2/> (referer: None)
enlace relativo [u'/v2/article.php?id=187334', u'/v2/article.php?id=187324', u'/v2/article.php?id=187321', u'/v2/article.php?id=187316', u'/v2/article.php?id=187335', u'/v2/article.php?id=187308', u'/v2/article.php?id=187314', u'/v2/article.php?id=187315', u'/v2/article.php?id=187317', u'/v2/article.php?id=187319', u'/v2/article.php?id=187310', u'/v2/article.php?id=187298', u'/v2/article.php?id=187300', u'/v2/article.php?id=187299', u'/v2/article.php?id=187306', u'/v2/article.php?id=187305']
enlace completo http://www.puntal.com.ar/v2/article.php?id=187334
2016-07-10 13:39:15 [scrapy] DEBUG: Filtered offsite request to 'www.puntal.com.ar': <GET http://www.puntal.com.ar/v2/article.php?id=187334>
enlace completo http://www.puntal.com.ar/v2/article.php?id=187324
enlace completo http://www.puntal.com.ar/v2/article.php?id=187321
enlace completo http://www.puntal.com.ar/v2/article.php?id=187316
enlace completo http://www.puntal.com.ar/v2/article.php?id=187335
enlace completo http://www.puntal.com.ar/v2/article.php?id=187308
enlace completo http://www.puntal.com.ar/v2/article.php?id=187314
enlace completo http://www.puntal.com.ar/v2/article.php?id=187315
enlace completo http://www.puntal.com.ar/v2/article.php?id=187317
enlace completo http://www.puntal.com.ar/v2/article.php?id=187319
enlace completo http://www.puntal.com.ar/v2/article.php?id=187310
enlace completo http://www.puntal.com.ar/v2/article.php?id=187298
enlace completo http://www.puntal.com.ar/v2/article.php?id=187300
enlace completo http://www.puntal.com.ar/v2/article.php?id=187299
enlace completo http://www.puntal.com.ar/v2/article.php?id=187306
enlace completo http://www.puntal.com.ar/v2/article.php?id=187305
2016-07-10 13:39:15 [scrapy] INFO: Closing spider (finished)
2016-07-10 13:39:15 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 660,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 50497,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 726952),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 16,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 7, 10, 16, 39, 15, 121104)}
2016-07-10 13:39:15 [scrapy] INFO: Spider closed (finished)
ich die absoluten Links versucht und richtig ist. Ich war nicht wirklich dabei.
Danke Freund, das war ein Fehler. Aber zu dieser Zeit habe ich versucht und nicht – dedio
@dedio was sind die Symptome, irgendwelche Fehler? – alecxe
nichts zurück – dedio