2016-08-03 21 views
1

Ich habe ein Problem, das mich absolut verrückt macht. Ich bin ein Neuling für Web-Scraping, und ich übe Web-Scraping, indem ich versuche, den Inhalt eines Forumsbeitrags zu kratzen, nämlich die tatsächlichen Posts, die Leute gemacht haben. Ich habe die Beiträge isoliert, was ich denke, enthält den Text, der div ist id = "post message_ 2793649 Screenshot_1Web Scrapping eines Forums Beitrag in Python mit schönen Suppe und Lxml kann nicht alle Beiträge

Das obige Beispiel ist nur eine von vielen Beiträge. Jeder Beitrag hat eine eigene eindeutige Identifikationsnummer, aber der Rest ist konsistent wie div id = "post_message_.

hier ist das, was ich gerade stecke bei

import requests 
from bs4 import BeautifulSoup 
import lxml 

r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one- billion-2016-a-120.html') 

soup = BeautifulSoup(r.content) 

data = soup.find_all("td", {"class": "alt1"}) 

for link in data: 
    print(link.find_all('div', {'id': 'post_message'})) 

der obige Code erzeugt nur ein paar leere Listen, die die Seite seiner so frustrierend nach unten gehen. (Siehe Screenshot_2 für den Code, den ich mit seiner Ausgabe daneben lief) Screenshot_2 Was vermisse ich.

Das Endergebnis, das ich suche, ist nur der ganze Inhalt von dem, was die Leute gesagt haben, in einer langen Schnur ohne irgendwelchen HTML-Durcheinander enthalten.

I Schöne Suppe 4 läuft den lxml Parser

Antwort

1

Sie haben ein paar Probleme, wobei die erste mehrere Räume in der URL, so dass Sie nicht auf die Seite gehen Sie denken, Sie sind:

In [50]: import requests 


In [51]: r.url # with spaces 
Out[51]: 'http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html' 
Out[49]: 'http://www.catforum.com/forum/' 

In [50]: r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html') 

In [51]: r.url # without spaces 
Out[51]: 'http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html' 

Die nächste Problem ist die Anfang des id mit post_message, keine gleich post_message genau können Sie einen CSS-Selektor verwenden, die IDs beginnend mit post_message entsprechen, werden alle divs zu ziehen Sie wollen, dann extrahieren Sie einfach den Text:

r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html') 

soup = BeautifulSoup(r.text) 


for div in soup.select('[id^=post_message]'): 
    print(div.get_text("\n", strip=True)) 

die Ihnen:

11311301 
Did you get the cortisone shots? Will they have to remove it? 
My Dad and stepmom got a new Jack Russell! Her name's Daisy. She's 2 years old, and she's a rescue(d) dog. She was rescued from an abusive situation. She can't stand noise, and WILL NOT allow herself to be picked up. They're working on that. Add to that the high-strung, hyper nature of a Jack Russell... But they love her. When I called last night, Pat was trying to teach her 'sit'! 
11302 
Well, I tidied, cleaned, and shopped. Rest of the list isn't done and I'm too tired and way too hot to care right now. 
Miss Luna is howling outside the Space Kitten's room because I let her out and gave them their noms. SHE likes to gobble their food.....little oink. 
11303 
Daisy sounds like she has found a perfect new home and will realize it once she feels safe. 
11304 
No, Kurt, I haven't gotten the cortisone shot yet. They want me to rest it for three weeks first to see if that helps. Then they would try a shot and remove it if the shot doesn't work. It might feel a smidge better today but not much. 
So have you met Daisy in person yet? She sounds like a sweetie. 
And Carrie, Amelia is a piggie too. She eats the dog food if I don't watch her carefully! 
11305 
I had a sore neck yesterday morning after turning it too quickly. Applied heat....took an anti-inflammatory last night. Thought I'd wake up feeling better....nope....still hurts. Grrrrrrrr. 
11306 
MM- Thanks for your welcome to the COUNTING thread. Would have been better if I remembered to COUNT. I've been a long time lurker on the thread but happy now to get involved in the chat. 
Hope your neck is feeling better. Lily and Lola are reminding me to say 'hello' from them too. 
11307 
Welcome back anniegirl and Lily and Lola! We didn't scare you away! Yeah! 
Nightmare afternoon. My SIL was in a car accident and he car pools with my daughter. So, in rush hour, I have to drive an hour into Vancouver to get them (I hate rush hour traffic....really hate it). Then an hour back to their place.....then another half hour to get home. Not good for the neck or the nerves (I really hate toll bridges and driving in Vancouver and did I mention rush hour traffic). At least he is unharmed. Things we do for love of our children! 
11308. Hi annegirl! None of us can count either - you'll fit right in. 
MM, yikes how scary. Glad he's ok, but that can't have been fun having to do all that driving, especially with an achy neck. 
I note that it's the teachers on this thread whose bodies promptly went down...coincidentally once the school year was over... 
DebS, how on earth are you supposed to rest your foot for 3 weeks, short of lying in bed and not moving? 
MM, how is your shoulder doing? And I missed the whole goodbye to Pyro. 
Gah, I hope it slowly gets easier over time as you remember that they're going to families who will love them. 
I'm finally not constantly hungry, just nearly constantly. 
My weight had gone under 100 lbs 
so I have quite a bit of catching up to do. Because of the partial obstruction I had after the surgery, the doctor told me to try to stay on a full liquid diet for a week. I actually told him no, that I was hungry, lol. So he told me to just be careful. I have been, mostly (bacon has entered the picture 3 times in the last 3 days 
) and the week expired today, so I'm off to the races. 
11309 
Welcome to you, annegirl, along with Lily and Lola! We always love having new friends on our counting thread. 
And Spirite, good to hear from you and I'm glad you are onto solid foods. 
11310 
DebS and Spirite thank you too for the Welcome. Oh MM what an ordeal with your daughter but glad everyone us on. 
DevS - hope your foot is improving Its so horrible to be in pain. 
Spirite - go wild on the bacon and whatever else you fancy. I'm making a chocolate orange cheese cake to bring to a dinner party this afternoon. It has so much marscapone in it you put on weight just looking at it. 

Wenn Sie find_all, würden Sie brauchen, um einen regulären Ausdruck verwenden, verwenden wollte:

import re 
r = requests.get('http://www.catforum.com/forum/43-forum-fun/350938-count-one-billion-2016-a-120.html') 
soup = BeautifulSoup(r.text) 
for div in soup.find_all(id=re.compile("^post_message")): 
    print(div.get_text("\n", strip=True)) 

Das Ergebnis wird das gleiche sein.

+0

Wie können Sie jeden einzelnen Beitrag in einer eigenen Zeile in einem Datenrahmen machen? – OptimusPrime

0

Es gibt nichts mit der ID post_message verwenden, so link.find_all gibt eine leere Liste. Sie möchten zunächst alle IDs in allen div s abrufen und dann diese Liste von IDs mit einer Regex (z. B.) filtern, um nur diejenigen zu erhalten, die mit post_message_ und dann eine Zahl beginnen. Dann können Sie tun

for message_id in message_ids: 
    print(link.find_all('div', {'id': message_id}))