Ähnliche URLs gruppieren/Allgemeine URL-Muster finden (Python)

Ich habe ungefähr 100k URLs, von denen jede als positiv oder negativ markiert wurde. Ich möchte sehen, welche Art von URLs positiv entspricht? (ähnlich für negativ)Ähnliche URLs gruppieren/Allgemeine URL-Muster finden (Python)

Ich begann mit der Gruppierung von Sub-Domains und identifizierte die häufigsten positiven und negativen Subdomains.

Nun, für Sub-Domänen, die ein gleich positives Verhältnis zu negativ haben, möchte ich weiter sezieren und nach Mustern suchen. Beispielmuster:

Die Links sind nicht auf clarin.com beschränkt.

Irgendwelche Vorschläge, wie man solche Muster aufdecken kann?

Quelle

2016-07-14 The Wanderer

Gelöst dies: Endete unter Verwendung von finding largest common substring Problem.

Die Lösung umfasst das Erstellen eines Parse-Baums aus jedem Zeichen der URL. Jeder Knoten im Baum speichert positive, negative Gesamtzählungen. Schließlich wird der Baum beschnitten, um die häufigsten Muster zurückzugeben.

Code:

def find_patterns(incoming_urls): 
    urls = {} 
    # make the tree 
    for url in incoming_urls: 
     url, atype = line.strip().split("____") # assuming incoming_urls is a list with each entry of type url__class 
     if len(url) < 100: # Take only the initial 100 characters to avoid building a sparse tree 
      bound = len(url) + 1 
     else: 
      bound = 101 
     for x in range(1, bound): 
      if url[:x].lower() not in urls: 
       urls[url[:x].lower()] = {'positive': 0, 'negative': 0, 'total': 0} 
      urls[url[:x].lower()][atype] += 1 
      urls[url[:x].lower()]['total'] += 1 

    new_urls = {} 
    # prune the tree 
    for url in urls: 
     if urls[url]['total'] < 5: # For something to be called as common pattern, there should be at least 5 occurrences of it. 
      continue 
     urls[url]['negative_percentage'] = (float(urls[url]['negative']) * 100)/urls[url]['total'] 
     if urls[url]['negative_percentage'] < 85.0: # Assuming I am interested in finding url patterns for negative class 
      continue 
     length = len(url) 
     found = False 
     # iterate to see if a len+1 url is present with same total count 
     for second in urls: 
      if len(second) <= length: 
       continue 
      if url == second[:length] and urls[url]['total'] == urls[second]['total']: 
       found = True 
       break 
     # discard urls with length less than 20 
     if not found and len(url) > 20: 
      new_urls[url] = urls[url] 

    print "URL Pattern; Positive; Negative; Total; Negative (%)" 
    for url in new_urls: 
     print "%s; %d; %d; %d; %.2f" % (
      url, new_urls[url]['positive'], new_urls[url]['negative'], new_urls[url]['total'], 
      new_urls[url]['negative_percentage'])

Quelle

2016-07-16 04:38:52

Ähnliche URLs gruppieren/Allgemeine URL-Muster finden (Python)

Antwort

Verwandte Themen