Ich habe ein Datenframe von zwei Spalten. Spalte eins ist eine Identifikationsnummer und Spalte 2 ist eine Verbindung. Die Verbindungen in Spalte 2 sind jedoch oft repetierend (verschiedene Formen der gleichen Verbindung). Ich möchte jedes Duplikat außer der einfachen Form der Verbindung entfernen.Entfernen von Quasi-Duplikaten aus einem R-Datenframe
Dies ist der Datenrahmen:
>NISTSpecR
NIST NAME
366620 Formic acid, TMS derivative
366765 2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative
342340 Acetic acid, TMS derivative
352374 Propanoic acid, TMS derivative
333858 Butyric Acid, TMS derivative
352377 Pentanoic acid, TMS derivative
24239 Hexanoic acid, TMS derivative
333733 Heptanoic acid, TMS derivative
352455 Oxalic acid, 2TMS derivative
414056 Succinic acid, monoethyl ester-, (TMS)
332809 Adipic acid, TMS derivative
30799 Pimelic acid, 2TMS derivative
292699 Suberic acid, 2TMS derivative
333874 Citric acid, 4TMS derivative
366657 Citric acid, 3TMS derivative
333513 (-)-Epinephrine, 3TMS derivative
16985 Epinephrine, (.beta.)-, 3TMS derivative
24795 Norepinephrine, (R)-, 5TMS derivative
332935 DL-Norepinephrine, 4TMS derivative
Und hier ist seine Struktur:
> str(NISTSpecR)
'data.frame': 154 obs. of 3 variables:
$ Spec: Factor w/ 239429 levels "1 0; 13 2; 14 27; 15 239; 16 3; 18 2; 26 3; 27 36; 28 32; 29 113; 30 9; 31 64; 32 9; 33 17; 34 17; 35 20; 36 1; 37 1; 41 8; 42 "| __truncated__,..: 23720 32791 3011 32175 12349 29069 193166 26108 28713 73845 ...
$ NIST: chr "366620" "366765" "342340" "352374" ...
$ NAME: Factor w/ 239430 levels "-4'-Dimethylamino-2'-(trimethylsilyl)acetanilide",..: 157152 39442 108436 210392 133148 199151 169386 168243 195800 229235 ...
Ich würde das Endergebnis gefällt so etwas wie folgt aussehen:
>NISTSpecR
NIST NAME
366620 Formic acid, TMS derivative
342340 Acetic acid, TMS derivative
352374 Propanoic acid, TMS derivative
333858 Butyric Acid, TMS derivative
352377 Pentanoic acid, TMS derivative
24239 Hexanoic acid, TMS derivative
333733 Heptanoic acid, TMS derivative
352455 Oxalic acid, 2TMS derivative
414056 Succinic acid, monoethyl ester-, (TMS)
332809 Adipic acid, TMS derivative
30799 Pimelic acid, 2TMS derivative
292699 Suberic acid, 2TMS derivative
366657 Citric acid, 3TMS derivative
333513 (-)-Epinephrine, 3TMS derivative
24795 Norepinephrine, (R)-, 5TMS derivative
Es gibt nur eine von jeder Stammverbindung (dh Ameisensäure, ...). UND es muss die einfachste Version sein (die mit den wenigsten Zeichen).
> dput(as.character(NISTSpecR$NAME))
c("Formic acid, TMS derivative", "2-[2-(2-Butoxyethoxy)ethoxy] Acetic acid, TMS derivative",
"Acetic acid, TMS derivative", "Propanoic acid, TMS derivative",
"Butyric Acid, TMS derivative", "Pentanoic acid, TMS derivative",
"Hexanoic acid, TMS derivative", "Heptanoic acid, TMS derivative",
"Oxalic acid, 2TMS derivative", "Succinic acid, monoethyl ester-, (TMS)",
"Adipic acid, TMS derivative", "Pimelic acid, 2TMS derivative",
"Suberic acid, 2TMS derivative", "Citric acid, 4TMS derivative",
"Citric acid, 3TMS derivative", "Citric acid 3TMS", "Citric acid, ethyl ester, tri-TMS",
"Isocitric acid lactone, 2TMS derivative", "Glyoxylic acid, di-TMS",
"Pyruvic acid, TMS derivative", "Malic acid, 2TMS derivative",
"Malic acid 1-ethyl ester, 2TMS", "Malic acid, 4-ethyl ester, 2TMS",
"Malic acid, 3TMS derivative", "4-Hydroxybutanoic acid, 2TMS derivative",
"Prostaglandin A1, 2TMS derivative", "Prostaglandin A2, 2TMS derivative",
"Prostaglandin E2, 3TMS", "D-Arabinose, 4TMS derivative", "D-Xylose, 4TMS derivative",
"D-Lyxose, 4TMS derivative", "D-Ribose, 4TMS derivative", "D-Glucose, 5TMS derivative",
"D-Galactose, 5TMS derivative", "D-Mannose, 5TMS derivative",
"D-Allose, oxime (isomer 1), 6TMS derivative", "D-Allose, oxime (isomer 2), 6TMS derivative",
"D-Altrose, 5TMS derivative", "Dihydroxyacetone, 2TMS derivative",
"1,3-Dihydroxyacetone dimer, 4TMS derivative", "D-Fructose, 5TMS derivative",
"D-Psicose, 5TMS Derivat", "Sedoheptulose, 6TMS Derivat" "D-2-Desoxyribose, 3TMS Derivat", "2-Desoxyribose, 3TMS Derivat" „L-Fucose, 4TMS-Derivat, L-Rhamnose, (R, R, S, S) -, 4TMS-Derivat, L-Rhamnose, 4TMS-Derivat, N-Acetyl-D-glucosamin, 4TMS-Derivat, D -Gluconsäure, 6TMS-Derivat, Glycerolmonostearat, 2TMS-Derivat, Glycerol-2-Laurat, 2TMS-Derivat, Glycerol, 3TMS-Derivat, Xylitol, 5TMS-Derivat, D-Sorbitol, 6TMS-Derivat "D-Mannitol, 6TMS-Derivat", "Saccharose, 8TMS-Derivat", "D-Lactose , (Isomer 1), 8TMS-Derivat ", " & bgr; -D-Lactose, (Isomer 1), 8TMS-Derivat ", D-Lactose, (Isomer 2), 8TMS-Derivat", "& bgr; -D -Lactose, (Isomer 2), 8TMS-Derivat, α-D-Lactose, 8TMS-Derivat, α-D-Lactose, 8TMS-Derivat, β-Lactose, 8TMS-Derivat, "Lactose, 8TMS-Derivat", "Maltose, 8TMS-Derivat, Isomer 1", "Maltose, 8TMS-Derivat, Isomer 2", "Maltose, 8TMS-Derivat", "D-Trehalose, 7TMS-Derivat", "Melibiose, 8TMS Derivat ", " L-Ornithin, 3TMS-Derivat "," DL-Ornithin, 3TMS-Derivat ", " DL-Ornithin, 4TMS-Derivat "," L-Ornithin, 4TMS-Derivat ", " L-Homoserin, 2TMS-Derivat " "L-Citrullin, 3TMS-Derivat", 3-Iod-L-tyrosin, 3TMS-Derivat, 3-Aminoisobuttersäure, TMS-Derivat, 3-Aminoisobuttersäure, 3TMS-Derivat, 3-Aminoisobuttersäure, 2TMS-Derivat, D-Isoleucin , N-Acetyl-, TMS-Derivat "," L-Hydroxyprolin, (E) -, 2TMS-Derivat ", " L-Hydroxyprolin, (E) -, 3TMS-Derivat "," Hydroxyprolin, 3TMS-Derivat ", " 3- Hydroxyprolin, 3TMS-Derivat ", L-Cystin, 4TMS-Derivat", "Ethanolamin, 3TMS-Derivat", "Ethanolamin, 2TMS-Derivat", "3-Aminopropanol, TMS-Derivat", "Putrescin, 4TMS-Derivat", "Histamin 2TMS-Derivat, Histamin, 3TMS-Derivat, Dopamin, 4TMS-Derivat, , Dopamin, 3TMS-Derivat, Serotonin, 4TMS-Derivat, Tyramin, 3TMS-Derivat, "Tyramin, TMS-Derivat", "Tyramin, 2TMS-Derivat", "Phenethylamin, 2TMS-Derivat", "1-Phenethylamin, TMS-Derivat", "Phenethylamin, TMS-Derivat", "Biotin, 3TMS-Derivat", "16. beta., 17α-Estriol, 3TMS-Derivat, Estriol, 3TMS-Derivat, 16α, 17α-Estriol, 3TMS-Derivat, 16.β, 17β.-Estriol, 3TMS-Derivat, Estron, TMS-Derivat, 16-Estron, TMS-Derivat, Estron, O-Methyloxim, TMS-Derivat, Equilin, TMS-Derivat, Equilenin, (14 & bgr;) -, TMS-Derivat ", " Equilenin, TMS-Derivat "," 2-Hydroxyestradiol, 3TMS-Derivat ", " Androsteron, (E) -, TMS-Derivat "," Dehydroepiandrosteron, (E) -, TMS-Derivat ", 5-β-Dihydrotestosteron, TMS-Derivat, 5α-Dihydrotestosteron, TMS-Derivat, Testosteron-O-methyloxim, TMS-Derivat, Testosteron, TMS-Derivat, Pregnenolon, TMS-Derivat "Aldosteron, 2TMS-Derivat", "Aldosteron, N-Methoxy-tri-TMS", "Corticosteron, Bis (O-methyloxim)", "Desoxycholsäure, 2TMS-Derivat", "Desoxycol 3-MS-Derivat, Lithocholsäure, 2TMS-Derivat, Cholesterol, TMS-Derivat, Desmosterol, TMS-Derivat, Ergosterol, TMS-Derivat, Campesterol, TMS-Derivat, Fucosterol, TMS Derivat, Stigmastanol, TMS-Derivat, Stigmasterol, TMS-Derivat, 11-Desoxycortisol, Bis (O-methyloxim), Melatonin, 2TMS-Derivat, Adrenalin, 4TMS-Derivat, L Adrenalin, 4TMS-Derivat, Glycin, 3TMS-Derivat, Glycin, TMS-Derivat, Glycin, 2TMS-Derivat, Asparaginsäure, 3TMS-Derivat, L-Asparaginsäure, 3TMS-Derivat, L -Aspartsäure, 2TMS-Derivat, L-Glutaminsäure, 3TMS-Derivat, (-) - Epinephrin, 3TMS deri Vativ, Epinephrin, (ß) -, 3TMS-Derivat, (-) - Epinephrin, 4TMS-Derivat, Norepinephrin, (R) -, 5TMS-Derivat, DL-Norepinephrin, 4TMS-Derivat "Norepinephrin, (R) -, 4TMS Derivat", "Cycloserine, 3TMS Derivat", "Cycloheximide, 2TMS Derivat", "Chloramphenicol, 2TMS Derivat", "Chloramphenicol, 3TMS Derivat" )
Danke .
Das würde funktionieren, wenn sie alle einfache Säuren wären. Das df wurde mit einigen anderen Werten aktualisiert. Außerdem muss ich nur die einfachste Version behalten, nicht irgendeine Version –
irgendwelche anderen Vorschläge? –
Welche anderen Formen nehmen die Verbindungen ein? Hast du eine Liste? Der Abgleich mit einer bekannten Liste wird einfach sein, andernfalls wird es eine Ad-hoc-String-Aufteilung sein. – shayaa