2016-04-10 18 views
1

Ich muss die Sprechgeschwindigkeit jeder Zeile von Untertitel berechnen. Der Inhalt der srt (Untertitel) Datei sieht wie folgt aus:R: Extrahieren Zeit von srt (Untertitel) Datei

1 
00:00:19,000 --> 00:00:21,989 
I'm Annita McVeigh and welcome to Election Today where we'll bring you 

2 
00:00:22,000 --> 00:00:23,989 
the latest from the campaign trail, plus debate and analysis. 

3 
00:00:24,000 --> 00:00:28,989 
The Liberal Democrats promise to protect the pay of millions 

Zum Beispiel dauert es 4 Sekunden 989 Millisekunden die 10 Worte zu sagen: „Die Liberaldemokraten versprechen die Bezahlung von Millionen zu schützen“ . Die durchschnittliche Sprachrate dieser 10 Wörter ist 498,9 Millisekunden pro Wort.

Wie lese ich die srt-Datei, so dass ich einen Datenrahmen mit Startzeit haben kann, endTime, Textstring und WORDCOUNT als Spalten und Zeilen der Untertitel als Zeilen wie unten?

startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000") 

endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989") 

textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions") 

wordCount<-c(12,10,10) 

rate.df<-data.frame(startTime, endTime, textString, wordCount) 

Wie kann ich von Startzeit endTime in R subtrahieren, wenn die Zeit in Form von Stunde vorgestellt: Minute: Sekunde, Millisekunde?

+0

ich in der Aufgabe gelang MS Excel, aber ich habe zu viele Daten Excel zu verwenden, für diese Aufgabe. – Ninjacat

Antwort

2

Hier ist eine mögliche Lösung (der Code ist ziemlich selbsterklärend):

text=" 

1 
00:00:19,000 --> 00:00:21,989 
I'm Annita McVeigh and welcome to Election Today where we'll bring you 

2 
00:00:22,000 --> 00:00:23,989 
the latest from the campaign trail, 
plus debate 
and analysis. 



3 
00:00:24,000 --> 00:00:28,989 
The Liberal Democrats promise to protect 
the pay of millions" 

con<-textConnection(text) 
lines <- readLines(con) 

# the previous lines of code are just to replicate you case, and 
# they should be replaced by the following single line in the real case 
# lines <- readLines(srtFileName) 

listOfEntries <- 
lapply(split(1:length(lines),cumsum(grepl("^\\s*$",lines))),function(blockIdx){ 
    block <- lines[blockIdx] 
    block <- block[!grepl("^\\s*$",block)] 
    if(length(block) == 0){ 
     return(NULL) 
    } 
    if(length(block) < 3){ 
     warning("a block not respecting srt standards has been found") 
    } 
    return(data.frame(id=block[1], 
         times=block[2], 
         textString=paste0(block[3:length(block)],collapse="\n"), 
         stringsAsFactors = FALSE)) 
    }) 
m <- do.call(rbind,listOfEntries) 


# split start and end times 
tmp <- do.call(rbind,strsplit(m[,'times'],' --> ')) 
m$startTime <- tmp[,1] 
m$endTime <- tmp[,2] 

# parse times 
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric)) 
m$fromSeconds <- tmp %*% c(60*60,60,1,1/1000) 

tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric)) 
m$toSeconds <- tmp %*% c(60*60,60,1,1/1000) 

# compute time difference in seconds 
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds 

# word count 
m$wordCount <- vapply(gregexpr("\\W+",m$textString),length,0) + 1 

# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. : 
#m$wordCount <- vapply(gregexpr("\\W+",gsub("'","",m$textString)),length,0) + 1 

m$millisecsPerWord <- m$timeDiffInSecs * 1000/m$wordCount 

Ergebnis:

> m 
    id       times                textString 
2 1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you 
3 2 00:00:22,000 --> 00:00:23,989  the latest from the campaign trail, \nplus debate \nand analysis. 
6 3 00:00:24,000 --> 00:00:28,989   The Liberal Democrats promise to protect \nthe pay of millions 
    startTime  endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord 
2 00:00:19,000 00:00:21,989   19 21.989   2.989  14   213.5000 
3 00:00:22,000 00:00:23,989   22 23.989   1.989  11   180.8182 
6 00:00:24,000 00:00:28,989   24 28.989   4.989  10   498.9000 
+1

Oh. Das ist erstaunlich! Vielen Dank, Digemall! Die Codes sind einfach wunderschön! – Ninjacat

+0

Vielen Dank, @digemall – Ninjacat