Ausführen von Operationen an einem großen Datensatz

Ich muss einige Analysen an einem PSL-Datensatz durchführen, der Informationen zu DNA-Sequenzfragmenten enthält. Grundsätzlich muss ich Einträge finden, die im gleichen Contig derselben gelesen werden (dies sind beide Werte im PSL-Eintrag). Das Problem ist, dass die PSL-Datensätze groß sind (10-30 Mb Textdokumente). Ich habe ein Programm geschrieben, das auf kurzen Aufzeichnungen funktioniert und auf den langen Aufzeichnungen genügend Zeit gegeben hat, aber es hat viel länger gedauert als angegeben. Mir wurde gesagt, das Programm sollte nicht länger als ~ 15 Sekunden dauern. Meins dauerte über 15 Minuten.Ausführen von Operationen an einem großen Datensatz

PSL Einträge sehen wie folgt aus:

275 11 0 0 0 0 0 0 - M02034:35:000000000-A7UU0:1:1101:19443:1992/2 286 0 286 NODE_406138_length_13407_cov_13.425076 13465 408 694 1 286, 0, 408, 

171 5 0 0 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:13497:2001/2 294 0 176 NODE_500869_length_34598_cov_30.643419 34656 34334 34510 1 176, 0, 34334, 

188 14 0 10 0 0 0 0 + M02034:35:000000000-A7UU0:1:1101:18225:2002/1 257 45 257 NODE_455027_length_12018_cov_13.759444 12076 11322 11534 1 212, 45, 11322,

Mein Code sieht wie folgt aus:

import sys 
class PSLreader : 
    ''' 
    Class to provide reading of a file containing psl alignments 
    formatted sequences: 
    object instantiation: 
    myPSLreader = PSLreader(<file name>): 

    object attributes: 
    fname: the initial file name 

    methods: 
    readPSL() : reads psl file, yielding those alignments that are within the first or last 
       1000 nt 

    readPSLpairs() : yields psl pairs that support a circular hypothesis 

    Author: David Bernick 
    Date: May 12, 2013 
    ''' 

    def __init__ (self, fname=''): 
     '''contructor: saves attribute fname ''' 

     self.fname = fname 

    def doOpen (self): 
     if self.fname is '': 
      return sys.stdin 
     else: 
      return open(self.fname) 

    def readPSL (self): 
     ''' 
     using filename given in init, returns each filtered psl records 
     that contain alignments that are within the terminal 1000nt of 
     the target. Incomplete psl records are discarded. 
     If filename was not provided, stdin is used. 

     This method selects for alignments that could may be part of a 
     circle. 

     Illumina pairs aligned to the top strand would have read1(+) and read2(-). 
     For the bottoms trand, read1(-) and read2(+). 

     For potential circularity, 
     these are the conditions that can support circularity: 
     read1(+) near the 3' terminus 
     read1(-) near the 5' terminus 
     read2(-) near the 5' terminus 
     read2(+) near the 3' terminus 

     so... 
     any read(+) near the 3', or 
     any read(-) near the 5' 

     ''' 

     nearEnd = 1000 # this constant determines "near the end" 
     with self.doOpen() as fileH: 

      for line in fileH: 
       pslList = line.split() 
       if len(pslList) < 17: 
        continue 
       tSize = int(pslList[14]) 
       tStart = int(pslList[15]) 
       strand = str(pslList[8]) 

       if strand.startswith('+') and (tSize - tStart > nearEnd): 
        continue 
       elif strand.startswith('-') and (tStart > nearEnd): 
        continue 

       yield line 

    def readPSLpairs (self): 
     read1 = [] 
     read2 = [] 

     for psl in self.readPSL(): 
      parsed_psl = psl.split() 
      strand = parsed_psl[9][-1] 
      if strand == '1': 
       read1.append(parsed_psl) 
      elif strand == '2': 
       read2.append(parsed_psl) 

     output = {} 
     for psl1 in read1: 
      name1 = psl1[9][:-1] 
      contig1 = psl1[13] 
      for psl2 in read2: 
       name2 = psl2[9][:-1] 
       contig2 = psl2[13] 
       if name1 == name2 and contig1 == contig2: 
        try: 
         output[contig1] += 1 
         break 
        except: 
         output[contig1] = 1 
         break 

     print(output) 


PSL_obj = PSLreader('EEV14-Vf.filtered.psl') 
PSL_obj.readPSLpairs()

ich einige Beispiel-Code gegeben wurde, das wie folgt aussieht:

def doSomethingPairwise (a): 
    for leftItem in a[1]: 
     for rightItem in a[2]: 
      if leftItem[1] is rightItem[1]: 
       print (a) 
thisStream = [['David', 'guitar', 1], ['David', 'guitar', 2], 
['John', 'violin', 1], ['John', 'oboe', 2], 
['Patrick', 'theremin', 1], ['Patrick', 'lute',2] ] 
thisGroup = None 
thisGroupList = [ [], [], [] ] 

for name, instrument, num in thisStream: 
    if name != thisGroup: 

     doSomethingPairwise(thisGroupList) 

     thisGroup = name 
     thisGroupList = [ [], [], [] ] 

    thisGroupList[num].append([name, instrument, num]) 
doSomethingPairwise(thisGroupList)

Aber wenn Ich habe versucht, es umzusetzen, aber mein Programm hat lange gedauert. Denke ich darüber falsch? Ich weiß, dass die verschachtelte Schleife langsam ist, aber ich sehe keine Alternative.

Edit: Ich fand heraus, die Daten wurden vorsortiert, was meine Brute-Force-Lösung sehr unpraktisch und unnötig machte.

Quelle

2015-05-26 Sam

Ich hoffe, Ihnen helfen, da muss die Frage ein besten Eingabebeispiel

#is better create PSLRecord class 
class PSLRecord: 
    def __init__(self, line): 
    pslList = line.split() 
    properties = ("matches", "misMatches", "repMatches", "nCount", 
       "qNumInsert", "qBaseInsert", "tNumInsert", 
       "tBaseInsert", "strand", "qName", "qSize", "qStart", 
       "qEnd", "tName", "tSize", "tStart", "tEnd", "blockCount", 
       "blockSizes", "qStarts", "tStarts") 
    self.__dict__.update(dict(zip(properties, pslList))) 

class PSLreader : 
    def __init__ (self, fname=''): 
    self.fname = fname 

    def doOpen (self): 
    if self.fname is '': 
     return sys.stdin 
    else: 
     return open(self.fname) 

    def readPSL (self): 
    with self.doOpen() as fileH: 
     for line in fileH: 
     pslrc = PSLRecord(line) 
     yield pslrc 

    #return a dictionary with all psl records group by qName and tName 
    def readPSLpairs (self): 
    dictpsl = {} 
    for pslrc in self.readPSL(): 
     #OP requirement, remove '1' or '2' char, in pslrc.qName[:-1] 
     key = (pslrc.qName[:-1], pslrc.tName) 
     if not key in dictpsl: 
     dictpsl[key] = [] 
     dictpsl[key].append(pslrc) 
    return dictpsl 

#Function filter .... is better out and self-contained 
def f_filter(pslrec, nearEnd = 1000): 
    if (pslrec.strand.startswith('+') and 
    (int(pslrec.tSize) - int(pslrec.tStart) > nearEnd)): 
    return False 
    if (pslrec.strand.startswith('-') and 
    (int(pslrec.tStart) > nearEnd)): 
    return False 
    return True 

PSL_obj = PSLreader('EEV14-Vf.filtered.psl') 

#read dictionary of pairs 
dictpsl = PSL_obj.readPSLpairs() 

from itertools import product 
#product from itertools 
#(1) x (2,3) = (1,2),(1,3) 

output = {} 
for key, v in dictpsl.items(): 
    name, contig = key 
    #i get filters aligns in principal strand 
    strand_princ = [pslrec for pslrec in v if f_filter(pslrec) and 
       pslrec.qName[-1] == '1'] 
    #i get filters aligns in secondary strand 
    strand_sec = [pslrec for pslrec in v if f_filter(pslrec) and 
       pslrec.qName[-1] == '2'] 
    for pslrec_princ, pslrec_sec in product(strand_princ, strand_sec): 
    #This For has fewer comparisons, since I was grouped before 
    if not contig in output: 
     output[contig] = 1 
    output[contig] += 1

Hinweis Datei: 10-30 Mb ist nicht große Datei, wenn Sie mich fragen

Quelle

2015-05-26 20:58:08

Ausführen von Operationen an einem großen Datensatz

Antwort

Verwandte Themen