2016-08-09 40 views
0

haben, so habe ich ein Scraping-Programm für meine Unternehmen Websites gemacht, aber ich habe ein Problem, im Grunde muss ich Test aus einem HTML-Code herauskratzen Aber ich habe Probleme, die Daten zu bekommen, die ich brauche.versuchen, Text aus HTML zu kratzen, die keine markanten Tags außer br, PYTHON 3

HTML CODE

<div> 
    <table class="style3" cellspacing="0" rules="all" border="1" id="ctl00_cpMainContent_gvNodes" style="border-color:White;border-style:None;width:1090px;border-collapse:collapse;"> 
     <tr> 
      <th scope="col">History</th> 
     </tr><tr> 
      <td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td> 
     </tr><tr> 
      <td style="color:White;background-color:Blue;border-color:Black;border-style:Inset;font-size:12pt;font-weight:normal;">date updated: 02/01/2014 21:42:52 | By: jakubkwasny | Status: Resolved</td> 
     </tr><tr> 
      <td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;"><br />Root Cause: Hardware Failure<br />Action Completed: Power supply/filter/cable swap<br /><br />Arrival Time: 02/01/2014 15:54:17<br />Leaving Time: 02/01/2014 16:27:44<br />Was the job successful: Yes<br /><br /><br />Notes:replaced dsl cable and filter. Also rebooted all equipment. All working fine now.<br />Next Action required:none<br />Added by jakubkwasny at 02/01/2014 21:41:40<br /><br />Pinging 99.99.99.99 with 32 bytes of data:<br />Reply from 99.99.99.99: bytes=32 time=67ms TTL=240<br />Reply from 99.999.999.99: bytes=32 time=92ms TTL=240<br />Reply from 99.99.65.65: bytes=32 time=76ms TTL=240<br />Reply from 67.45.32.12: bytes=32 time=82ms TTL=240<br /><br />Ping statistics for 12.12.12.12:<br />Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),<br />Approximate round trip times in milli-seconds:<br />Minimum = 67ms, Maximum = 92ms, Average = 79ms</td> 
     </tr><tr> 
      <td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td> 

ich in der Lage benötigen, um Daten innerhalb br Tags wie die Daten an die dritte td-Tag angebracht zu kratzen, ich es geschafft haben, alle Daten aus den Tisch zu bekommen kratzen aber kann nicht herausfinden, wie bestimmte Zeilen zu bekommen und dann das Zeug in dem br Tags

Code-Schnipsel

bsobjswap = BeautifulSoup(r2.content) 
print (bsobjswap.find('table',{'id':'ctl00_cpMainContent_gvNodes'}).find("style",{"color":"Black"})) 

Das ist mein letzter Versuch, funktioniert aber nicht. Jede Hilfe ist willkommen

mehr Daten

<div id="ctl00_cpMainContent_upNodes"> 

    <div> 
    <table class="style3" cellspacing="0" rules="all" border="1" id="ctl00_cpMainContent_gvNodes" style="border-color:White;border-style:None;width:1090px;border-collapse:collapse;"> 
     <tr> 
      <th scope="col">History</th> 
     </tr><tr> 
      <td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td> 
     </tr><tr> 
      <td style="color:White;background-color:Blue;border-color:Black;border-style:Inset;font-size:12pt;font-weight:normal;">date updated: 02/01/2014 21:21:16 | By: jakubkwasny | Status: Resolved</td> 
     </tr><tr> 
      <td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;"><br />Root Cause: Core/Authentication issue<br />Action Completed: No site visit required<br /><br />Hi Chris,<br /><br />There were no faults detected. As installation have been done recently, Lancom uses 2.05 configuration script. Our engineer was unable to see landing page, he was getting connected to the Internet with. I contacted Picopoint who informed me that this is due the fact that their system remembers MAC addresses of the devices that were logged into the system hence no landing page is needed. It have been confirmed by removing MAC addresses of the engineer's devices from the database. By doing so engineer was able to access the landing page again. Picopoint's engineer checked the configuration of the devices at both ends and haven't detected any problems. At the moment we are unable to state what are the issues with venue as we haven't experienced any. <br /><br />Arrival Time: 02/01/2014 16:19:23<br />Leaving Time: 02/01/2014 17:51:18<br />Was the job successful: Yes<br /><br /><br />Notes:Still physically missing lines 3 and 4. See screen shot.<br />Line 6 has a dial tone BUT no dsl is present on line.<br />Still getting some landing page errors.. My laptop now seems to work but my android phone justs connects to google with no landing page .<br /><br />Screen shots included but couldnt access youtube (was recieveing an block ID error)<br />ASDA resriction ?<br /><br />Picopoint still looking into problem according to Jakub<br /><br />Next Action required:Ask Jakub<br />Added by jakubkwasny at 02/01/2014 21:10:12<br /><br />Pinging 11.11.11.11 with 32 bytes of data:<br />Reply from 11.11.11.11: bytes=32 time=47ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=38ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=39ms TTL=50<br />Reply from 11.11.11.11: bytes=32 time=41ms TTL=50<br /><br />Ping statistics for 11.11.11.11:<br />Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),<br />Approximate round trip times in milli-seconds:<br />Minimum = 38ms, Maximum = 47ms, Average = 41ms</td> 
     </tr><tr> 
      <td style="color:White;background-color:White;font-size:11pt;font-weight:bold;"> </td> 

Mein Code Besuche Tausende von Seiten und es bei der Suche jede Tabelle nach dem gleichen Muster folgt und ich vermute, ich werde immer die Daten aus dem dritten td-Tag brauchen, aber nicht sicher, wie man es bekommt.

Prost

+0

Können Sie einen größeren Beispielsatz liefern, damit wir sehen können Unterschiede zwischen? HTML-Parsing ist so ziemlich Muster raten, so dass mehr Daten es leichter machen, etwas festzunageln, das bei allen bleibt. –

+0

hinzugefügt ein weiteres Snippet – ipmev12

+0

Was willst du genau bekommen? –

Antwort

0

Wie sei es damit:

from bs4 import BeautifulSoup 

html = """(your html from the example above)""" 

soup = BeautifulSoup(html, 'html.parser') 

row_data = soup.find('td', {'style':'color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;'}) 

clean_data = str(row_data).replace('<td style="color:Black;background-color:LightSkyBlue;border-color:LightSkyBlue;font-size:12pt;font-weight:normal;">','')\ 
    .replace('</td>','') 

print('\n'.join([x for x in clean_data.split('<br/>') if x != ''])) 

""" 
Generated output: 

Root Cause: Hardware Failure 
Action Completed: Power supply/filter/cable swap 
Arrival Time: 02/01/2014 15:54:17 
Leaving Time: 02/01/2014 16:27:44 
Was the job successful: Yes 
Notes:replaced dsl cable and filter. Also rebooted all equipment. All working fine now. 
Next Action required:none 
Added by jakubkwasny at 02/01/2014 21:41:40 
Pinging 99.99.99.99 with 32 bytes of data: 
Reply from 99.99.99.99: bytes=32 time=67ms TTL=240 
Reply from 99.999.999.99: bytes=32 time=92ms TTL=240 
Reply from 99.99.65.65: bytes=32 time=76ms TTL=240 
Reply from 67.45.32.12: bytes=32 time=82ms TTL=240 
Ping statistics for 12.12.12.12: 
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), 
Approximate round trip times in milli-seconds: 
Minimum = 67ms, Maximum = 92ms, Average = 79ms 
""" 
+0

süß wie eine Nuss Prost Kumpel – ipmev12

+0

Großartig, war froh, zu helfen – vadimhmyrov