Vielen Dank an rendel, die mir geholfen, die richtige Lösung zu finden !
Die Lösung von Andrei Stefan ist nicht optimal.
Warum? Erstens macht das Fehlen des Kleinbuchstabenfilters im Suchanalysator die Suche unpraktisch; Der Fall muss genau abgestimmt werden. Ein benutzerdefinierter Analysator mit lowercase
Filter wird anstelle von "analyzer": "keyword"
benötigt.
Zweitens ist der Analyseteil falsch! Während der Indexzeit wird eine Zeichenkette "F00.0 - Demenz bei Alzheimer-Krankheit mit frühem Beginn" durch edge_ngram_analyzer
analysiert. Mit diesem Analysator, haben wir die folgende Reihe von Wörterbuch als die analysierte Zeichenfolge:
{
"tokens": [
{
"end_offset": 2,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 6,
"token": "0 ",
"type": "word",
"start_offset": 4,
"position": 2
},
{
"end_offset": 9,
"token": " ",
"type": "word",
"start_offset": 7,
"position": 3
},
{
"end_offset": 10,
"token": " d",
"type": "word",
"start_offset": 7,
"position": 4
},
{
"end_offset": 11,
"token": " de",
"type": "word",
"start_offset": 7,
"position": 5
},
{
"end_offset": 12,
"token": " dem",
"type": "word",
"start_offset": 7,
"position": 6
},
{
"end_offset": 13,
"token": " deme",
"type": "word",
"start_offset": 7,
"position": 7
},
{
"end_offset": 14,
"token": " demen",
"type": "word",
"start_offset": 7,
"position": 8
},
{
"end_offset": 15,
"token": " dement",
"type": "word",
"start_offset": 7,
"position": 9
},
{
"end_offset": 16,
"token": " dementi",
"type": "word",
"start_offset": 7,
"position": 10
},
{
"end_offset": 17,
"token": " dementia",
"type": "word",
"start_offset": 7,
"position": 11
},
{
"end_offset": 18,
"token": " dementia ",
"type": "word",
"start_offset": 7,
"position": 12
},
{
"end_offset": 19,
"token": " dementia i",
"type": "word",
"start_offset": 7,
"position": 13
},
{
"end_offset": 20,
"token": " dementia in",
"type": "word",
"start_offset": 7,
"position": 14
},
{
"end_offset": 21,
"token": " dementia in ",
"type": "word",
"start_offset": 7,
"position": 15
},
{
"end_offset": 22,
"token": " dementia in a",
"type": "word",
"start_offset": 7,
"position": 16
},
{
"end_offset": 23,
"token": " dementia in al",
"type": "word",
"start_offset": 7,
"position": 17
},
{
"end_offset": 24,
"token": " dementia in alz",
"type": "word",
"start_offset": 7,
"position": 18
},
{
"end_offset": 25,
"token": " dementia in alzh",
"type": "word",
"start_offset": 7,
"position": 19
},
{
"end_offset": 26,
"token": " dementia in alzhe",
"type": "word",
"start_offset": 7,
"position": 20
},
{
"end_offset": 27,
"token": " dementia in alzhei",
"type": "word",
"start_offset": 7,
"position": 21
},
{
"end_offset": 28,
"token": " dementia in alzheim",
"type": "word",
"start_offset": 7,
"position": 22
},
{
"end_offset": 29,
"token": " dementia in alzheime",
"type": "word",
"start_offset": 7,
"position": 23
},
{
"end_offset": 30,
"token": " dementia in alzheimer",
"type": "word",
"start_offset": 7,
"position": 24
},
{
"end_offset": 33,
"token": "s ",
"type": "word",
"start_offset": 31,
"position": 25
},
{
"end_offset": 34,
"token": "s d",
"type": "word",
"start_offset": 31,
"position": 26
},
{
"end_offset": 35,
"token": "s di",
"type": "word",
"start_offset": 31,
"position": 27
},
{
"end_offset": 36,
"token": "s dis",
"type": "word",
"start_offset": 31,
"position": 28
},
{
"end_offset": 37,
"token": "s dise",
"type": "word",
"start_offset": 31,
"position": 29
},
{
"end_offset": 38,
"token": "s disea",
"type": "word",
"start_offset": 31,
"position": 30
},
{
"end_offset": 39,
"token": "s diseas",
"type": "word",
"start_offset": 31,
"position": 31
},
{
"end_offset": 40,
"token": "s disease",
"type": "word",
"start_offset": 31,
"position": 32
},
{
"end_offset": 41,
"token": "s disease ",
"type": "word",
"start_offset": 31,
"position": 33
},
{
"end_offset": 42,
"token": "s disease w",
"type": "word",
"start_offset": 31,
"position": 34
},
{
"end_offset": 43,
"token": "s disease wi",
"type": "word",
"start_offset": 31,
"position": 35
},
{
"end_offset": 44,
"token": "s disease wit",
"type": "word",
"start_offset": 31,
"position": 36
},
{
"end_offset": 45,
"token": "s disease with",
"type": "word",
"start_offset": 31,
"position": 37
},
{
"end_offset": 46,
"token": "s disease with ",
"type": "word",
"start_offset": 31,
"position": 38
},
{
"end_offset": 47,
"token": "s disease with e",
"type": "word",
"start_offset": 31,
"position": 39
},
{
"end_offset": 48,
"token": "s disease with ea",
"type": "word",
"start_offset": 31,
"position": 40
},
{
"end_offset": 49,
"token": "s disease with ear",
"type": "word",
"start_offset": 31,
"position": 41
},
{
"end_offset": 50,
"token": "s disease with earl",
"type": "word",
"start_offset": 31,
"position": 42
},
{
"end_offset": 51,
"token": "s disease with early",
"type": "word",
"start_offset": 31,
"position": 43
},
{
"end_offset": 52,
"token": "s disease with early ",
"type": "word",
"start_offset": 31,
"position": 44
},
{
"end_offset": 53,
"token": "s disease with early o",
"type": "word",
"start_offset": 31,
"position": 45
},
{
"end_offset": 54,
"token": "s disease with early on",
"type": "word",
"start_offset": 31,
"position": 46
},
{
"end_offset": 55,
"token": "s disease with early ons",
"type": "word",
"start_offset": 31,
"position": 47
},
{
"end_offset": 56,
"token": "s disease with early onse",
"type": "word",
"start_offset": 31,
"position": 48
}
]
}
Wie Sie sehen können, die gesamte Zeichenkette mit Token versehen mit Token-Größe von 2 bis 25 Zeichen. Die Zeichenfolge wird linear mit allen Leerzeichen und der Position in Token umgewandelt, die für jedes neue Token um eins erhöht wird.
Es gibt mehrere Probleme mit sich:
- Die
edge_ngram_analyzer
produziert unuseful Token die nie für zum Beispiel gesucht werden: "", "", "d", " sd“, "Krankheit w" usw.
- auch es nicht produzieren viel nützliche Tokens, die verwendet werden könnten, zum Beispiel: "Krankheit", "frühen Beginn" usw. Es wird 0 Ergebnisse geben, wenn Sie versuchen, nach einem dieser Wörter zu suchen.
- Beachten Sie, das letzte Token ist "s Krankheit mit früher". Wo ist das endgültige "t"? Wegen der
"max_gram" : "25"
wir "verloren" einige Text in allen Feldern. Sie können nicht mehr nach diesem Text suchen, da keine Token dafür vorhanden sind.
- Der
trim
Filter verschleiert nur das Problem beim Filtern zusätzlicher Leerzeichen, wenn dies durch einen Tokenizer erledigt werden könnte.
- Die
edge_ngram_analyzer
erhöht die Position jedes Tokens, was für Positionsabfragen wie Phrasenabfragen problematisch ist. Man sollte stattdessen die edge_ngram_filter
verwenden, die die Position des Tokens bei der Generierung der Ngrams erhalten wird.
Die optimale Lösung.
Die Zuordnungen Einstellungen zu verwenden:
...
"mappings": {
"Type": {
"_all":{
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "keyword_analyzer"
},
"properties": {
"Field": {
"search_analyzer": "keyword_analyzer",
"type": "string",
"analyzer": "edge_ngram_analyzer"
},
...
...
"settings": {
"analysis": {
"filter": {
"english_poss_stemmer": {
"type": "stemmer",
"name": "possessive_english"
},
"edge_ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25",
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"edge_ngram_analyzer": {
"filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
"tokenizer": "standard"
},
"keyword_analyzer": {
"filter": ["lowercase", "english_poss_stemmer"],
"tokenizer": "standard"
}
}
}
}
...
Blick auf die Analyse:
{
"tokens": [
{
"end_offset": 5,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 17,
"token": "de",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dem",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "deme",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "demen",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dement",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementi",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementia",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 20,
"token": "in",
"type": "word",
"start_offset": 18,
"position": 3
},
{
"end_offset": 32,
"token": "al",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alz",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzh",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhe",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhei",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheim",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheime",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheimer",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 40,
"token": "di",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dis",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dise",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disea",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "diseas",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disease",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 45,
"token": "wi",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "wit",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "with",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 51,
"token": "ea",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "ear",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "earl",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "early",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 57,
"token": "on",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "ons",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onse",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onset",
"type": "word",
"start_offset": 52,
"position": 8
}
]
}
On Index Zeit ein Text von standard
tokenizer Token aufgeteilt wird, dann einzelne Wörter durch lowercase
gefiltert werden, possessive_english
und edge_ngram
Filter. Tokens werden nur für Wörter produziert. Nach der Suchzeit wird ein Text durch den Tokenizer standard
in Token zerlegt, dann werden die einzelnen Wörter durch lowercase
und possessive_english
gefiltert. Die gesuchten Wörter werden mit den Token verglichen, die während der Indexzeit erstellt wurden.
So machen wir die inkrementelle Suche möglich!
Jetzt, da wir auf einzelne Wörter Ngram tun, können wir sogar Abfragen ausführen wie
{
'query': {
'multi_match': {
'query': 'dem in alzh',
'type': 'phrase',
'fields': ['_all']
}
}
}
und korrekte Ergebnisse erhalten.
Kein Text ist "verloren", alles ist durchsuchbar und es gibt keine Notwendigkeit, mit Leerzeichen durch trim
Filter mehr umzugehen.
Haben Sie versucht, query_string anstelle von multi_match zu verwenden? Lass es mich wissen, wenn es dein Problem löst. –
Der 'query_string' sucht standardmäßig im _ _all'-Feld. Es ist also dasselbe wie hier mit 'multi_match' und' "fields": ["_all"] '. Trotzdem habe ich es versucht, keinen Erfolg. Ich benutzte die folgende Abfrage '{'Abfrage': {'Query_string': {'Abfrage': 'Demenz in Alzh', 'Phrase_slop': 0}}}' – trex