2016-06-24 10 views
1

Hier ist mein Datenrahmen:Datenrahmen Zusammenfassung der einzelnen Seiten

import pandas as pd 
import re 

!wget https://s3.amazonaws.com/todel162/elastic.csv 

df=pd.read_csv('elastic.csv') 

def mysearch(mystring): 
    urls = re.findall('elastic.co/guide(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', mystring) 
    return urls 

df['mysearch']=df.Body.apply(mysearch) 

Es kann in jeder Spalte mysearch genannt mehr als 1 URL sein. Ich brauche alle eindeutigen HTML-Seiten (nicht Urls) mit entsprechenden parentID zu verbinden und die Ausgabe wird in etwa so aussehen:

query-dsl-term-query.html 35564374, 46568374 
query-dsl-bool-query.html 35594195, 75694493 
plugins-inputs-jdbc.html 34203007 

Antwort

1

Sie verwenden können:

import pandas as pd 

#force column ParentId as string 
df=pd.read_csv('https://s3.amazonaws.com/todel162/elastic.csv', dtype={'ParentId':str}) 
#print (df) 

#find all patterns, create new dataframe 
pat = 'elastic.co/guide(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+' 
df1 = pd.DataFrame([x for x in df.Body.str.findall(pat)]) 

#see http://stackoverflow.com/a/37592047/2901002 
df1 = df.drop('Body',axis=1).join(df1.stack().reset_index(drop=True, level=1).rename('Body')) 

#filter only rows contains .html 
df1 = df1[df1.Body.str.contains('.html')] 

#split by last `/` 
df1['url'] = df1.Body.str.rsplit('/', 1, expand=False).str[1] 
#print (df1) 

#join by unique url 
df2 = df1.groupby('url')['ParentId'].apply(lambda x: ','.join(x.astype(str))).reset_index() 
print (df2) 

                url \ 
0         _add_an_index.html 
1         _add_failover.html 
2       _aggregation_test_drive.html 
3         _basic_concepts.html 
4        _batch_processing.html 
5         _best_fields.html 
6       _boosting_query_clauses.html 
7       _bucket_aggregations.html 
8       _buckets_inside_buckets.html 
9          _cat_api.html 
10        _closer_is_better.html 
11        _cluster_health.html 
12    _combining_queries_with_filters.html 
13        _community_dsls.html 
14      _community_integrations.html 
15         _configuration.html 
16       _controlling_analysis.html 
17       _coping_with_failure.html 
18       _cross_fields_queries.html 
19 _dealing_with_json_arrays_and_objects_in_php.html 
20      _dealing_with_null_values.html 
21        _delete_an_index.html 
22        _deleting_an_index.html 
23       _deleting_documents.html 
24    _deploying_in_jboss_eap6_module.html 
25   _developer_guide_adding_a_new_protocol.html 
26        _elasticsearch_net.html 
27         _empty_search.html 
28       _exact_value_fields.html 
29      _executing_aggregations.html 
..             ... 
923        suggester-context.html 
924      synonyms-analysis-chain.html 
925     synonyms-expand-or-contract.html 
926           tasks.html 
927       term-level-queries.html 
928         term-vector.html 
929        term-vs-full-text.html 
930      terms-list-query-usage.html 
931        testing-framework.html 
932         time-based.html 
933         time-units.html 
934         token-count.html 
935          top-hits.html 
936          translog.html 
937        transport-client.html 
938       unicode-normalization.html 
939         unit-tests.html 
940         update-doc.html 
941         user-based.html 
942    using-elasticsearch-test-classes.html 
943    using-kibana-for-the-first-time.html 
944      using-language-analyzers.html 
945        using-stopwords.html 
946        using-synonyms.html 
947    verbatim-and-strict-query-usage.html 
948          visualize.html 
949        watch-definition.html 
950        watch-log-data.html 
951       working-with-plugins.html 
952        writing-queries.html 

               ParentId 
0             nan 
1             nan 
2             nan 
3  35958492,nan,35374339,31180988,29818589,32869841 
4            34509058 
5            33398143 
6 33398143,31836937,34069554,31967672,34006197,3... 
7           nan,nan,nan 
8           nan,30063221 
9            29526147 
10     31311687,34323428,34255519,30517904 
11           36026339 
12     33395412,nan,28989479,36325156,nan 
13           34143066 
14           34143066 
15           30886182 
16    31591210,35914330,32246656,32463762,nan 
17          35078736,nan 
18       33398143,34631940,36569635 
19             nan 
20         nan,nan,nan,nan,nan 
21           32872677 
22          nan,22924300 
23             nan 
24            nan,nan 
25           34132278 
26          nan,30956854 
27         31027308,33658619 
28        29923047,33757901,nan 
29       nan,nan,30280206,nan,nan 
..             ... 
923 37189942,36802797,36802797,35683069,nan,362040... 
924           34358802 
925         33250379,34358802 
926           36508292 
927           34312196 
928        32269054,nan,34680820 
929    36414571,32264571,32075616,32619266 
930         36697563,36565189 
931           30755194 
932   28984723,33827559,32635456,32718927,nan 
933           36752424 
934  36025764,34148626,32059804,34882813,34171223 
935     nan,nan,nan,29896839,nan,31411664 
936       33110371,33110371,35465922 
937 nan,35064511,35876176,31453270,nan,27170739,25... 
938            nan 
939            nan 
940      nan,33218812,31424380,nan,nan 
941            nan 
942            nan 
943           33996619 
944         30195926,37218517 
945 31625943,33370591,36794324,30132959,32694958,3... 
946       29254643,34255519,nan,nan 
947         37697866,37697866 
948           35347332 
949           31831689 
950         33831247,31831689 
951         37007206,31809884 
952            nan 

[953 rows x 2 columns]