können Sie get_dummies
mit concat
verwenden Wenn die Werte in den Spalten user
oder item
numerisch sind, zu gieße string
von astype
:
df = pd.DataFrame({'item': {0: 13, 1: 11, 2: 14, 3: 12},
'affinity': {0: 0.1, 1: 0.4, 2: 0.9, 3: 1.0},
'user': {0: 1, 1: 2, 2: 3, 3: 4}},
columns=['user','item','affinity'])
print df
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
df1 = df.user.astype(str).str.get_dummies()
df1.columns = ['user' + str(x) for x in df1.columns]
print df1
user1 user2 user3 user4
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
df2 = df.item.astype(str).str.get_dummies()
df2.columns = ['item' + str(x) for x in df2.columns]
print df2
item11 item12 item13 item14
0 0 0 1 0
1 1 0 0 0
2 0 0 0 1
3 0 1 0 0
print pd.concat([df1,df2, df.affinity], axis=1)
user1 user2 user3 user4 item11 item12 item13 item14 affinity
0 1 0 0 0 0 0 1 0 0.1
1 0 1 0 0 1 0 0 0 0.4
2 0 0 1 0 0 0 0 1 0.9
3 0 0 0 1 0 1 0 0 1.0
Timings:
len(df) = 4
:
In [49]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 690 µs per loop
len(df) = 40
:
df = pd.concat([df]*10).reset_index(drop=True)
In [51]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 719 µs per loop
len(df) = 400
:
df = pd.concat([df]*100).reset_index(drop=True)
In [43]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 748 µs per loop
len(df) = 4k
:
df = pd.concat([df]*1000).reset_index(drop=True)
In [41]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 761 µs per loop
len(df) = 40k
:
df = pd.concat([df]*10000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
1000 loops, best of 3: 1.83 ms per loop
len(df) = 400k
:
df = pd.concat([df]*100000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
100 loops, best of 3: 15.6 ms per loop
Cool! Wird dies hinsichtlich der Laufzeit für einen großen Datensatz skalieren? – Sangram
Hmmm, Dummies brauchen viel Speicher in großen Datenrahmen. Ist es ein Problem? – jezrael
Momentan ist Speicher kein Problem, da ich Funktionen einzeln kodieren und in Datei speichern kann. Aber ich frage mich, wie lange es dauern wird. Ich werde ein Benchmarking durchführen und die Ergebnisse hier trotzdem veröffentlichen. – Sangram