sortieren Datenrahmen Zeilen unabhängig von Werten in einem anderen Datenrahmen

Es seien zwei Datenrahmen:sortieren Datenrahmen Zeilen unabhängig von Werten in einem anderen Datenrahmen

import pandas as pd 
import numpy as np 

d1 = {} 
d2 = {} 

np.random.seed(5) 
for col in list("ABCDEF"): 
    d1[col] = np.random.randn(12) 
    d2[col+'2'] = np.random.random_integers(0,100, 12) 

t_index = pd.date_range(start = '2015-01-31', periods = 12, freq = "M") 

dat1 = pd.DataFrame(d1, index = t_index) 
dat2 = pd.DataFrame(d2, index = t_index)

Ich möchte dat1 die Zeilen durch die Zeilen in dat2 sortieren und eine Teilmenge der bestellten Daten aus dat1 zu extrahieren. Unten sehen Sie ein Beispiel, in dem die obersten 5 Werte pro Zeile aus dat1 extrahiert werden. Zum Beispiel mit:

    A   B   C   D   E  F 
2015-01-31 0.441227 -0.817548 -0.723062 -0.205149 0.230843 -0.25395 
2015-02-28 -0.330870 -1.168279 -0.042419 -0.8 -0.042166 0.42985 

      A2 B2 C2 D2 E2 F2 
2015-01-31 47 47 82 66 64 40 
2015-02-28 30 16 60 57 77 74

Ich würde erhalten:

  0 1 2 3 4 
2015-01-31 A B E D C 
2015-02-28 A D C F E 
        0   1   2   3   4 
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062 
2015-02-28 -0.330870 -0.8 -0.042419 0.429850 -0.042166

Hier ist meine Lösung. Das größte Problem ist, dass dieser Code sich nicht mit den NA-Werten in dat1 oder dat2 befasst, was ein enormes Problem ist, das behoben werden muss.

def sortByAnthr(X,Y): 
    return([x for (x,y) in sorted(zip(X,Y), key=lambda pair: pair[1])]) 

def r_selectr(dat2,dat1, n): 
    ordr_cols = dat1.apply(lambda x: sortByAnthr(x.index,dat2.loc[x.name,:]),axis=1).iloc[:,-n:] 
    ordr_cols.columns = list(range(0,n)) #assign column names 

    ordr_r = ordr_cols.apply(lambda x: dat1.ix[x.name,x.values].tolist(),axis=1) 
    return([ordr_cols, ordr_r]) 

ordr_cols,ordr_r = r_selectr(dat2,dat1,5) 

ordr_cols.iloc[:2,:] 
      0 1 2 3 4 
2015-01-31 A B E D C 
2015-02-28 A D C F E 

ordr_r.iloc[:2,:] 
        0   1   2   3   4 
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062 
2015-02-28 -0.330870 -0.8 -0.042419 0.429850 -0.042166

Zum Beispiel mit NAs, schlägt die oben richtig sortieren:

dat1.iloc[[1,2],[1,3,5]]=np.nan 
dat2.iloc[[1,4],[2,4,5]]=np.nan

Quelle

2016-04-04 Gene Burinsky

Hier ist meine Lösung. Es behandelt jetzt NAs, indem es die Indizes von Nicht-NA-Werten in dat1 und dat2 für jede Reihe schneidet. Dies führt jedoch zu einem Problem in apply, wobei apply für jede Zeile Ausgaben gleicher Größe benötigt. Die Funktion, die Elemente füllt, die nicht sortiert werden können, ist fillVacuum.

def fillVacuum(toFill,MatchLengthOf): 
    if len(toFill)<len(MatchLengthOf): 
     [toFill.insert(i, np.nan) for i in range(len(MatchLengthOf)-len(toFill))] 
    return() 

def sortByAnthr(X,Y,Xindex): 
    #intersect non-na column indexes between two datasets 
    idx = np.intersect1d(X.notnull().nonzero()[0],Y.notnull().nonzero()[0]) 

    #order the subset of X.index by Y 
    ordrX = [x for (x,y) in sorted(zip(Xindex[idx],Y[idx]), key=lambda pair: pair[1])] 

    #due to molding that'll happen later in apply, it is necessary to fill removed indexes 
    fillVacuum(ordrX, Xindex) 

    return(ordrX) 

def OrderRow(row,df): 
    ordrd_row = df.ix[row.dropna().name,row.dropna().values].tolist() 
    fillVacuum(ordrd_row, row) 
    return(ordrd_row) 

def r_selectr(dat2,dat1, n): 
    ordr_cols = dat1.apply(lambda x: sortByAnthr(x,dat2.loc[x.name,:],x.index),axis=1).iloc[:,-n:] 
    ordr_cols.columns = list(range(0,n)) #assign interpretable column names 

    ordr_r = ordr_cols.apply(lambda x: OrderRow(x,dat1),axis=1) 
    return([ordr_cols, ordr_r]) 

ordr_cols,ordr_r = r_selectr(dat2,dat1,5)

Diese Funktionen ergeben folgende:

dat1.iloc[:2,:] 
        A   B   C   D   E   F 
2015-01-31 0.441227 -0.817548 -0.723062 -0.205149 0.230843 -0.253954 
2015-02-28  NaN  NaN -0.042419 -0.8  NaN 0.429850 

dat2.iloc[:2,:] 
      A2 B2 C2 D2 E2 F2 
2015-01-31 47 47 82 66 64 40 
2015-02-28 NaN 16 60 57 77 NaN 

ordr_cols.iloc[:2,:] 
       0 1 2 3 4 
2015-01-31 A B E D C 
2015-02-28 NaN NaN NaN D C 

ordr_r.iloc[:2,:] 
        0   1   2   3   4 
2015-01-31 0.441227 -0.817548 0.230843 -0.205149 -0.723062 
2015-02-28  NaN  NaN  NaN -0.8 -0.042419

Quelle

2016-04-04 19:59:10

sortieren Datenrahmen Zeilen unabhängig von Werten in einem anderen Datenrahmen

Antwort

Verwandte Themen