ANTLR4 Python Parsing große Dateien

Ich versuche, Parser für Wacholder/Srx Router Access Control-Listen zu schreiben. Unten ist die Grammatik ich verwende:ANTLR4 Python Parsing große Dateien

grammar SRXBackend; 

acl: 
    'security' '{' 'policies' '{' COMMENT* replaceStmt '{' policy* '}' '}' '}' 
      applications 
      addressBook 
; 

replaceStmt: 
    'replace:' IDENT 
| 'replace:' 'from-zone' IDENT 'to-zone' IDENT 
; 

policy: 
    'policy' IDENT '{' 'match' '{' fromStmt* '}' 'then' (action | '{' action+ '}') '}' 
; 

fromStmt: 
    'source-address' addrBlock      # sourceAddrStmt 
| 'destination-address' addrBlock    # destinationAddrStmt 
| 'application' (srxName ';' | '[' srxName+ ']') # applicationBlock 
; 

action: 
    'permit' ';' 
| 'deny' ';' 
| 'log { session-close; }' 
; 

addrBlock: 
    '[' srxName+ ']' 
| srxName ';' 
; 

applications: 
    'applications' '{' application* '}' 
| 'applications' '{' 'apply-groups' IDENT ';' '}' 'groups' '{' replaceStmt '{' 'applications' '{' application* '}' '}' '}' 
; 

addressBook: 
    'security' '{' 'address-book' '{' replaceStmt '{' addrEntry* '}' '}' '}' 
| 'groups' '{' replaceStmt '{' 'security' '{' 'address-book' '{' IDENT '{' addrEntry* '}' '}' '}' '}' '}' 'security' '{' 'apply-groups' IDENT ';' '}' 
; 

application: 
    'replace:'? 'application' srxName '{' applicationStmt+ '}' 
; 

applicationStmt: 
    'protocol' srxName ';'   #applicationProtocol 
| 'source-port' portRange ';'  #applicationSrcPort 
| 'destination-port' portRange ';' #applicationDstPort 
; 

portRange: 
    NUMBER    #portRangeOne 
| NUMBER '-' NUMBER #portRangeMinMax 
; 

addrEntry: 
    'address-set' IDENT '{' addrEntryStmt+ '}' #addrEntrySet 
| 'address' srxName cidr ';'     #addrEntrySingle 
; 

addrEntryStmt: 
    ('address-set' | 'address') srxName ';' 
; 

cidr: 
    NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ('/' NUMBER)? 
; 

srxName: 
    NUMBER 
| IDENT 
| cidr 
; 

COMMENT : '/*' .*? '*/' ; 
NUMBER : [0-9]+ ; 
IDENT : [a-zA-Z][a-zA-Z0-9,\-_:\./]* ; 
WS  : [ \t\n]+ -> skip ;

Wenn ich versuche, eine ACL zu verwenden, um mit ~ 80.000 Zeilen, es dauert 10 Minuten bis zu ~ den Parsing-Baum zu erzeugen. Ich verwende für die Erstellung des Parsing-Baumes folgenden Code:

from antlr4 import * 
from SRXBackendLexer import SRXBackendLexer 
from SRXBackendParser import SRXBackendParser 
import sys 


    def main(argv): 
     ipt = FileStream(argv[1]) 
     lexer = SRXBackendLexer(ipt) 
     stream = CommonTokenStream(lexer) 
     parser = SRXBackendParser(stream) 
     parser.acl() 

    if __name__ == '__main__': 
     main(sys.argv)

Ich verwende Python 2.7 als Zielsprache. Ich habe auch cProfile ausgeführt, um herauszufinden, welcher Code die meiste Zeit benötigt. Im Folgenden finden Sie die ersten auf Zeit sortiert wenige Datensätze:

ncalls tottime percall cumtime percall filename:lineno(function) 
    608448 62.699 0.000 272.359 0.000 LexerATNSimulator.py:152(execATN) 
    5007036 41.253 0.000 71.458 0.000 LexerATNSimulator.py:570(consume) 
    5615722 32.048 0.000 70.416 0.000 DFAState.py:131(__eq__) 
11230968 24.709 0.000 24.709 0.000 InputStream.py:73(LA) 
    5006814 21.881 0.000 31.058 0.000 LexerATNSimulator.py:486(captureSimState) 
    5007274 20.497 0.000 29.349 0.000 ATNConfigSet.py:160(__eq__) 
10191162 18.313 0.000 18.313 0.000 {isinstance} 
10019610 16.588 0.000 16.588 0.000 {ord} 
    5615484 13.331 0.000 13.331 0.000 LexerATNSimulator.py:221(getExistingTargetState) 
    6832160 12.651 0.000 12.651 0.000 InputStream.py:52(index) 
    5007036 10.593 0.000 10.593 0.000 InputStream.py:67(consume) 
    449433 9.442 0.000 319.463 0.001 Lexer.py:125(nextToken) 
     1 8.834 8.834 16.930 16.930 InputStream.py:47(_loadString) 
    608448 8.220 0.000 285.163 0.000 LexerATNSimulator.py:108(match) 
    1510237 6.841 0.000 10.895 0.000 CommonTokenStream.py:84(LT) 
    449432 6.044 0.000 363.766 0.001 Parser.py:344(consume) 
    449433 5.801 0.000 9.933 0.000 Token.py:105(__init__)

ich nicht wirklich viel Sinn draus machen kann außer InputStream.LA eine halbe Minute dauert etwa. Ich schätze, das liegt daran, dass die gesamte Textzeichenfolge gepuffert/geladen wird. Gibt es eine alternative/einfachere Methode zum Parsen oder Laden von Daten für Python-Ziel? Gibt es irgendeine Verbesserung, die ich an der Grammatik machen kann, um das Parsing schneller zu machen?

Danke

Quelle

2016-03-10 prthrokz

Es ist keine Antwort, aber haben Sie versucht, PyPy oder irgendetwas anderes zu verwenden? Nur um zu wissen, wie viel Last auf Python fällt? – Divisadero

Ich habe PyPy nicht benutzt, habe aber seit gestern etwas mehr recherchiert. Scheint, dass die ANTLR-Eingabestream-Klasse die gesamte Texteingabe Zeichen für Zeichen in einen Byte-Puffer konvertiert. Das dauert bis zu einer Minute. Gibt es einen schnelleren Weg, dies zu tun? Ich bin mir sicher, dass ich die Implementierung des Eingabestreams überschreiben kann, solange ich einen besseren Weg finde, dies zu tun. – prthrokz

@prthrokz, ich würde dir raten, den "alten" Antlr 3 zu versuchen. Antlr 4 versucht, fast jede Grammatik zu parsen, muss aber übermäßig viel Laufzeitaufwand aufbringen, um selbst sehr einfache Grammatiken zu parsen. Antlr 3 ist restriktiver, aber schnell. – kay

Es ist mein Verständnis, dass Ihr IDENT kann statt + aufgrund * Größe Null sein. Dies sendet Ihren Parser in Schleifen für jedes einzelne Zeichen und erzeugt null-große IDENT Knoten.

Quelle

2017-07-22 21:44:46 user2722968

Antwort

Verwandte Themen