2016-03-29 7 views
1

Ich brauche Hilfe mit einer Lösung zum Durchsuchen einer HTML-Datei mit Python3 und retreive alle <a> Links auf der Seite. Dann den gegriffenen Wert an ein Wörterbuch mit der benachbarten href (url) anhängen.Python: Suche durch HTML-Datei Grabbing <a> Tags mit der href und Text Inhalt

Das habe ich schon ausprobiert.

import urllib3 
import re 

http = urllib3.PoolManager() 
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL" 
a = http.request("GET",my_url) 
html = a.data 

links = re.finditer(' href="?([^\s^"]+)', html) 

for link in links: 
    print(link) 

ich diesen Fehler ...

TypeError: can't use a string pattern on a bytes-like object 

Vielen Dank für Ihre Hilfe.

ich auch lxml habe versucht ...

links = lxml.html.parse("http://www.google.co.uk/?gws_rd=ssl#q=apple+stock&tbm=nws").xpath("//a/@href") 
for link in links: 
    print(link) 

Das Ergebnis nicht alle Links zeigt, und ich bin mir nicht sicher, warum.

UPDATE:

New code =>

def news_feed(self, stock): 
    http = urllib3.PoolManager() 
    my_url = "https://in.finance.yahoo.com/q/h?s="+stock 
    a = http.request("GET",my_url) 
    html = a.data.decode('utf-8') 
    xml = fromstring(html, HTMLParser()) 
    a_tags = xml.xpath("//a/@href") 
    xml = fromstring(html, HTMLParser()) 
    a_tags = xml.xpath("//table[@id='yfncsumtab']//a") 
    self.paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags) 
    pp(self.paired) 
+0

'html = a.data.decode ('utf-8')' –

+1

Sie können dies manuell analysieren (mit dem Vorschlag von James), aber ich denke wirklich, dass Sie [BeautifulSoup] (http: // www. crummy.com/software/BeautifulSoup/) – Bahrom

+0

Ehrfürchtig, das funktioniert. Sollte das gewusst haben! –

Antwort

4

einen HTML-Parser verwenden und die Bytes dekodieren, wie vorgeschlagen, BeautifulSoup wird die Arbeit sehr einfach machen und es viel zuverlässiger als ein regex beim Parsen html:

http = urllib3.PoolManager() 
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL" 
a = http.request("GET", my_url) 
html = a.data.decode("utf-8") 

from bs4 import BeautifulSoup 

print([a["href"] for a in BeautifulSoup(html).find_all("a",href=True)]) 

Wenn Sie nur die Links, beginnend mit http möchten, können Sie eine CSS verwenden wählen:

soup = BeautifulSoup(html) 

print([a["href"] for a in soup.select("a[href^=http]")]) 

Welche werden Sie geben:

['https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://help.yahoo.com/l/in/yahoo/finance/', 'http://in.yahoo.com/bin/set?cmp=uheader&src=others', 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 'http://in.my.yahoo.com', 'https://in.yahoo.com/', 'https://in.finance.yahoo.com', 'https://in.finance.yahoo.com/investing/', 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 'https://in.finance.yahoo.com/news/apple-sees-first-sales-dip-011402926.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-031840725.html', 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote', 'http://www.capitaliq.com', 'http://www.csidata.com', 'http://www.morningstar.com/'] 

Um den Text und href:

soup = BeautifulSoup(html) 

a_tags = soup.select("a[href^=http]") 
from pprint import pprint as pp 
paired = dict((a.text, a["href"]) for a in a_tags) 

pp(paired) 

Ausgang:

{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 
u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 
u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 
u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 
u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 
u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 
u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 
u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 
u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 
u'Capital IQ': 'http://www.capitaliq.com', 
u'Commodity Systems, Inc. (CSI)': 'http://www.csidata.com', 
u'Download the new Yahoo Mail app': 'https://in.mobile.yahoo.com/mail/', 
u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 
u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 
u'Help': 'https://help.yahoo.com/l/in/yahoo/finance/', 
u'Mail': 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 
u'Markets': 'https://in.finance.yahoo.com/investing/', 
u'Morningstar, Inc.': 'http://www.morningstar.com/', 
u'My Yahoo': 'http://in.my.yahoo.com', 
u'New User? Register': 'https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 
u'Report an Issue': 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 
u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 
u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 
u'Sign In': 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 
u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 
u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 
u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 
u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 
u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 
u'Yahoo': 'https://in.yahoo.com/', 
u'Yahoo India Finance': 'https://in.finance.yahoo.com', 
u'other exchanges': 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 
u'premium service.': 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote'} 

Die a[href^=http] Mittel geben Sie mir die alle eine Schlüsselpaare, haben hrefs und diese href Werte beginnen mit h ttp.

Mit lxml und mit Hilfe der Tabelle id nur die Geschichte Links zu erhalten, die Sie wahrscheinlich am meisten interessiert:

from lxml.etree import fromstring, HTMLParser 

xml = fromstring(_html, HTMLParser()) 

a_tags = xml.xpath("//table[@id='yfncsumtab']//a") 

paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags) 
from pprint import pprint as pp 
pp(paired) 

Gibt Ihnen:

{'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 
'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 
'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 
'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 
'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 
'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 
'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 
"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 
"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 
"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 
'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 
'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30', 
'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 
'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 
"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 
'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 
'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 
'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 
'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'} 

Wir können das tun gleich mit aus wählen:

soup = BeautifulSoup(_html) 

a_tags = soup.select("#yfncsumtab a") 
from pprint import pprint as pp 
paired = dict((a.text, a["href"]) for a in a_tags) 
pp(paired) 

die unsere lxml Ausgabe übereinstimmen:

{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 
u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 
u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 
u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 
u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 
u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 
u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 
u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 
u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 
u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 
u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 
u'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30', 
u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 
u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 
u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 
u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 
u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 
u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 
u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'} 

Sie nur //*[@id='yfncsumtab']//a als Ids sollte eindeutig sein nutzen könnten.

Um die ersten sechs Verbindungen aus der Tabelle über einen XPath zu bekommen, können wir die UL verwenden und die ersten 6 mit ul[position() < 7] extrahieren:

a_tags = xml.xpath("//*[@id='yfncsumtab']//ul[position() < 7]//a") 

paired = dict((a.xpath("./text()")[0].strip(), a.xpath("./@href")[0]) for a in a_tags) 
from pprint import pprint as pp 
pp(paired) 

die Ihnen:

{'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 
"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 
'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 
'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 
"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 
'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html'} 

Für kleine Tische, könnte man auch einfach schneiden.

+1

Ich machte den Fehler * nicht * mit BeautifulSoup in einem meiner Projekte – Bahrom

+0

@BAH, es macht das Leben sehr Ganz einfach, selbst wenn Sie fast nichts über html wüssten, mit einem schnellen Lesen der Dokumente, die Sie in kürzester Zeit erstellen würden –

+0

Ich habe versucht, BeautifulSoup zu verwenden, aber konnte es in Python3 keine Vorschläge bekommen? –