2015-04-09 8 views
6

Ich habe diese große Korpusdaten in DatenrahmenCount Worthäufigkeiten in list-of-Listen-of-Wörter

res (Datenrahmen)

Textform.1

1                                                                                                                               <NA> 
2 beren stuart vanuatu monday october  venkatesh ramesh sandeep talanki  nagaraj subject approve qlikview gpa access process form gpa access email requestor line manager access granted raj add user qlikview workgroup gpa access form requestors lim tek kon vanuatu address lini high port vila efate title relationship manager emerging corporates employee id lan id limtk bsbcc authorising manager beren stuart vanuatu  read gpa dashboard business technical reason na 
                                                                                                                                  text.2 
1                                                                                                                               <NA> 
2 kumar santhosh behalf relationshipbankingfinancesupport friday october  venkatesh cc global business reporting subject fw approve qlikview gpa access santhosh faunt daniel png wednesday october  relationshipbankingfinancesupport cc amet sova subject fw approve qlikview gpa access unable approve excel due macro issues process amet sova monday october  faunt daniel png subject approve qlikview gpa access  review attached form click line manager approval approve 
                                                                                                                                          text.3 
1                                                                                                                                       <NA> 
2 thomson owen tonga thursday october  venkatesh ramesh sandeep talanki  nagaraj subject approve qlikview gpa access process form gpa access email requestor line manager access granted raj add user qlikview workgroup gpa access form requestors hia viliami address head office fakafanua centre maufanga vuna road nukualofa tongatapu tonga nukualofa tongatapu title nfc amu manager employee id lan id hiav bsbcc authorising manager thomson owen tonga  read gpa dashboard business technical reason 

1                                                                                                                                    <NA> 
2 kumar rajesh fiji tuesday october  venkatesh ramesh sandeep talanki  nagaraj subject approve qlikview gpa access process form gpa access email requestor line manager access granted raj add user qlikview workgroup gpa access form requestors fong vincent address level anz house victoria parade suva suva viti levu title national manager commercial banking fiji employee id lan id fongv bsbcc authorising manager kumar rajesh fiji  read gpa dashboard business technical reason user 
                                               text.5 
1                                            <NA> 
2 dennis david timor thursday october  buchanan geoffrey solomon islands subject approve qlikview gpa access  review attached form click line manager approval approve 
                           text.6 
1                        <NA> 
2 matthey christopher wednesday october pm parrott louise subject document file documentzip 
                                                                                                                                                                                                                                                                                                                                                text.7 
1                                                                                                                                                                                                                                                            
2 tan jasmine thursday october pm global business reporting cc tan yong hoong rai dinkar subject rtc view report sep sensitivity confidential team don’ access sharepoint link arrange access jasmine ayyamperumal rajendran ramesh kumar behalf global business reporting tuesday october pm kumar gaurav hong kong tan jasmine seah linda shroff manish behan thibault hong kong clay iv william cc tan yong hoong rai dinkar tan matthew rb finance sim sui poh subramanian raghuveer murugeshaiah sunil subject rtc view report sep sensitivity confidential october dear attached sharepoint report rtc portfolio  client list august report discussed individual reviews  rtc financials  full client financials pivot table excel file metrics clients note report based rtc client list dinkar queries client list   review list december reporting excel file worksheets rtc summary default income measure product details pivot table product measures rtc data detail client level data grouping rtc rtc methodology explained   queries email global business reporting  issues accessing reports sharepoint sharepoint link ø gaurav kumar ø jasmine tan ø linda seah ø manish shroff ø thibault behan ø william clay global business reporting team 
                                                              text.8 
1                                                           <NA> 
2 deo ravinesh friday october  venkatesh global business reporting cc monteleone elif kabyanga isaac pinto rufus kiribati kumar santhosh subject approve qlikview gpa access team  assist rufus ceo kiribati gpa access ravi 
                                      text.9 
1                                   <NA> 
2 epoa regina thursday october  relationshipbankingfinancesupport subject gpa analysis filled form  reports assist cheers regina 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   text.10 
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 <NA> 
2 original message tseng rickson thursday october pm global business reporting cc kumar santhosh wong toto subject fw gpa importance high santhosh venkatesh quickly grant global iib gpa access mary macleod cheers rickson original message wong toto thursday october pm tseng rickson kumar santhosh subject gpa installed qlik marys desktop access account ready toto original message tseng rickson wednesday october  kumar santhosh cc wong toto subject gpa santhosh  email gbr mailbox requesting marys iib cfo access gpa  helping setup cheers rickson original message kumar santhosh wednesday october  tseng rickson cc wong toto subject gpa rickson continue email global business report mailbox venkatesh cover work find replacement sandeep software package windows package apcqliktechintabqvpluginsetupr santhosh original message tseng rickson tuesday october pm kumar santhosh cc wong toto subject gpa santhosh  sandeep left bank dont whats software package win gpa plugin dont grant access mary cheers rickson original message wong toto tuesday october pm tseng rickson subject fw gpa rickson advise software package upgrade marys desktop win week add package ready toto original message yip vivian tuesday october pm wong toto subject fw gpa toto  gpa installed mary macleods desktop computerbefore friday october rickson  computer lan id window version order installation advise vivian yip executive assistant mr gilles planté deputy ceo iib anz exchange square connaught place central hong kong phone original message broker ali tuesday october pm yip vivian tseng rickson li shirley cc macleod mary scott nicola subject gpa vivian gpa installed marys laptop installed rickson spend minutes mary   week mary hk ali 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 text.11 
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               <NA> 
2 ang vanessa tuesday october  global business reporting subject discontinued commercial fum performance report monday october team advise reach moving forward information required vanessa ayyamperumal rajendran ramesh kumar behalf global business reporting tuesday october  au yeung ivy bhuta chintan chang frank chok christopher chuang jacky china dyer andy goyal aseem gupta vivek jiang charles li shirley lim jasmine ec loh jonathan mcleod donnelle miller greg singapore praswitrianto rama roumier frederic hong kong leoni kelly hong kong runciman gary hong kong shankar vijay soh serene tong nelson tran dang cecilia tseng rickson yeh anita yeung jonathan hong kong tse ying tin yew lolita ang benedict hong kong lea danay lin gloria tong mike chuang jacky china chen carrie china poon yen chi anita qian jack chow frankie jiang helen china oum morokot dith sochal kheng sopheakchenda wong theodore foo chang horng bhattacharya arnab truong kent hong kong chan vincent cy hong kong skien craig hong kong lau vincent yeung jonathan hong kong sum selina chok christopher yau emily lee irene hong kong chung margaret lam betty turel kaiwan chan david hong kong chak katherine cheng wilson hong kong chiu polly dhupar karan chow ruskin hong kong wong sunny minam saud fiji damayanti meirina eka bahashwan rifai venkatesh shailesh sucianto lucy kartadinata paul tye alan ng wee lee diana ang sarup adesh lim jasmine ec yeoh hin ler adrain ang vanessa vu pham linh phuong tran thi sinh vietnam bui thanh van nadarajah lavanya vietnam lee john chu sally chou peter huang sophia tw tb lin lydia chang richard hsu ken huang michelle chow winnie tw tb cc mathad vijayakumar kumar santhosh subramanian raghuveer mohan durga subject discontinued commercial fum performance report monday october monday october commercial fum performance report  forward due change business structure back friday oct  global business reporting anz support services india manyata embassy business park bangalore email global business reporting 

von diesem Datenrahmen i welche Wörter ich brauche haben extrahiert

pattern<- "([a][c][c][e][s][s]|[r][e][p][o][r][t]|[d][a][t][a])" 

O<-lapply(res, function(x) str_extract_all(x,pattern)) 

ErgebnisEDIT

$text 
$text[[1]] 
[1] "access" "access" "access" "access" 

$text[[2]] 
[1] "report" "access" "access" "access" 

$text[[3]] 
[1] "access" "access" "access" "access" 

$text[[4]] 
[1] "access" "access" "access" "access" 

$text[[5]] 
[1] "report" "access" "access" "access" "access" "access" "access" "access" 

$text[[6]] 
[1] "report" "access" "access" "report" "access" "access" "access" "access" "access" "access" 

$text[[7]] 
[1] "report" "report" "access" "access" "report" "report" "report" "report" "report" "report" "data" "data" "report" 
[14] "access" "report" "report" 

$text[[8]] 
[1] "report" "access" "access" 

$text[[9]] 
[1] "report" "access" "access" "access" "report" 

$text[[10]] 
[1] "report" "access" "access" "access" "report" "access" 

Dabei möchte ich das Auftreten eines jeden Wortes zählen
I str_count verwendet haben, dies zu erreichen, wurde aber nicht hilfsbereit. und ich fand viele Wortzählungs bezogene Q in STO, aber nicht in R für Listenart.

dd<-lapply(O,function(x) c<-str_count(x)) 

Oder kann ich die Häufigkeit jedes Wortes jeder Liste berechnen? ich benutzte termFrequency, aber unterstützt nicht meine Version R 3.1.0.

O <- structure(list(text= list(c("access", "access","access", "access"), 
           c("report","access", "access", "access"), 
           c("access","access", "access", "access"), 
           c("access","access", "access", "access"), 
           c("access"), 
           c(character(0)), 
           c("report", "report", "access", "access","report", "report", "report", "report", "report", "report", 
            "data", "data", "report", "access", "report", "report"), 
           c("report", "access","access"), 
           c("report"), c("report", "access", "access", "access", "report","access")))) 

refered this STO und versuchte mit frq1 <- findFreqTerms(O) nicht

arbeiten
+0

Vielleicht versuchen Sie 'Tabelle (unlist (O))'? –

+0

Haben Sie zum Umgang mit TermDocumentMatrix-Objekten erwogen, den gesamten Prozess zu starten, indem Sie die Funktion 'termFreq' des Pakets verwenden und die Eliminierung fortsetzen? Das ist nur eine Idee von einem Mann, der mit 'tm' nicht vertraut ist. –

+0

danke, auch ich bin ganz neu dazu! – KRU

Antwort

6

Ok mir sagen, wie das wäre für Sie arbeiten.

diese Daten verwenden:

O <- structure(list(text.1 = list(character(0), c("access", "access", 
"access", "access")), text.2 = list(character(0), c("report", 
"access", "access", "access")), text.3 = list(character(0), c("access", 
"access", "access", "access")), text.4 = list(character(0), c("access", 
"access", "access", "access")), text.5 = list(character(0), "access"), 
    text.6 = list(character(0), character(0)), text.7 = list(
     character(0), c("report", "report", "access", "access", 
     "report", "report", "report", "report", "report", "report", 
     "data", "data", "report", "access", "report", "report" 
     )), text.8 = list(character(0), c("report", "access", 
    "access")), text.9 = list(character(0), "report"), text.10 = list(
     NULL, c("report", "access", "access", "access", "report", 
     "access"))), .Names = c("text.1", "text.2", "text.3", 
"text.4", "text.5", "text.6", "text.7", "text.8", "text.9", "text.10" 
)) 

Da es die Worte sind immer in dem zweiten Element der text.x Listen erscheint, werden wir diese Worte nehmen und sie in einem newlist setzen. Mehr noch, wir werden diese Daten in Faktoren umwandeln, die wir später in einen Datenrahmen umgruppieren können.

newlist <- list() 

for(item in O) { 
    newlist[[length(newlist)+1]] <- factor(item[[2]], levels = c("access", "data", "report")) 
} 

dd <- data.frame(lapply(newlist, table)) 
dd <- t(as.matrix(dd[,c(2,4,6,8,10,12,14,16,18,20)])) 

rownames(dd) <- paste0("Text.",1:10) 
colnames(dd) <- c("access", "data", "report") 

dd 

#   access data report 
# Text.1  4 0  0 
# Text.2  3 0  1 
# Text.3  4 0  0 
# Text.4  4 0  0 
# Text.5  1 0  0 
# Text.6  0 0  0 
# Text.7  3 2  11 
# Text.8  2 0  1 
# Text.9  0 0  1 
# Text.10  4 0  2 
+0

yup, aber ich möchte zählen Wie viele Wörter werden in jeder Liste wiederholt? , ich probierte 'findassoc' für jedes Wort, aber nicht für Listendaten – KRU

+0

Wie sieht es jetzt aus? –

+0

genau !! Danke @Dominic – KRU