Ich habe gerade openNLP für das gleiche verwendet.
public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException,
InvalidFormatException {
InputStream is = new FileInputStream("resources/models/en-sent.bin");
SentenceModel model = new SentenceModel(is);
SentenceDetectorME sdetector = new SentenceDetectorME(model);
String[] sentDetect = sdetector.sentDetect(paragraph);
is.close();
return Arrays.asList(sentDetect);
}
Beispiel
//Failed at Hi.
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]";
SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));
Er scheiterte nur, wenn es ein menschlicher Fehler. Z.B. "DR." Abkürzung sollte Kapital D haben, und es wird mindestens 1 Leerzeichen zwischen 2 Sätzen erwartet.
Sie können es auch mit RE auf folgende Weise erreichen;
public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){
List<String> sentences = new ArrayList<String>();
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher reMatcher = re.matcher(paragraph);
while (reMatcher.find()) {
sentences.add(reMatcher.group());
}
return sentences;
}
Beispiel
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Mr., mrs.
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at U.S.
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]";
SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));
Aber Fehler sind kompetitiv hoch. Eine andere Möglichkeit ist die Verwendung von BreakIterator.
public static List<String> breakIntoSentencesBreakIterator(String paragraph){
List<String> sentences = new ArrayList<String>();
BreakIterator sentenceIterator =
BreakIterator.getSentenceInstance(Locale.ENGLISH);
BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
sentenceInstance.setText(paragraph);
int end = sentenceInstance.last();
for (int start = sentenceInstance.previous();
start != BreakIterator.DONE;
end = start, start = sentenceInstance.previous()) {
sentences.add(paragraph.substring(start,end));
}
return sentences;
}
Beispiel:
paragraph = "Hi. How are you? This is Mike.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Door.Noone
paragraph = "Close the Door.Noone is out there";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at Mr.
paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
//Failed at dr.
paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to [email protected]";
SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));
Benchmarking:
- benutzerdefinierte RE: 7 ms
- BreakIterator: 143 ms
- openNlp: 255 ms
Was genau meinen Sie mit "einfachen Satz"? Nur ein Satz im Vergleich zu einem Absatz - in diesem Fall handelt es sich bei Ihrer Frage um die Erkennung von Satzgrenzen. Oder ein Satz, der nur ein Hauptprädikat enthält (im Gegensatz zu einem komplexen Satz mit Nebensätzen usw.)? Oder etwas ganz anderes? – jogojapan
Hallo jogojapan, ja, das ist richtig, ich meinte nur einen Satz im Vergleich zu einem Absatz ... –
Sie nicht richtig definieren, was Sie mit einem einfachen Satz meinen, so ist es schwer für jeden, Ihre Frage zu beantworten. Vielleicht möchten Sie etwas wie den Stanford Parser verwenden, um den Syntaxbaum für jeden Satz zu erhalten, und alle Sätze loswerden, die nicht vom Typ 'NP VP' sind, dh Sätze, die eine Nominalphrase bilden, gefolgt von einer Verbalphrase (z '[John] [saß auf einer Bank]', '[Mary and Jill] [aß ihre Sandwiches]', usw.) –