keras 練習(xí)2 -- 天池新人實戰(zhàn)賽之[離線賽] (1)

這次專門使用天池新人賽的離線比賽來實際練習(xí), 因為時間和算力, 更重要是經(jīng)驗的問題, 這次嘗試還有很多問題, 比如還不會傳統(tǒng)的機器算法, 比如對深度學(xué)習(xí)還有很多東西不夠熟悉, 還有數(shù)據(jù)預(yù)處理還不夠熟練。

本文由拎著激光炮的野人原創(chuàng), 歡迎轉(zhuǎn)載, 轉(zhuǎn)載請注明作者與原文鏈接

http://m.itdecent.cn/p/ef1fc958e30f


解讀分析

認真讀題之后, 發(fā)現(xiàn)這個賽題是針對于類似于o2o的預(yù)測, 商品基本來自于服務(wù)行業(yè), 線上購買, 線下消費, 也就是說和商品的地理位置有很大的關(guān)系, 所以我們?yōu)榱撕唵危?我們假設(shè)不同的商品類別之間沒有太大的可替換性和相關(guān)性, 假設(shè)用戶的geo位置和商品的geo位置有很大的相關(guān)性

1. 導(dǎo)入數(shù)據(jù)

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import math
from sklearn.metrics import f1_score
idx = pd.IndexSlice
#讀取items字段
items = pd.read_csv("./tianchi_fresh_comp_train_item.csv")
print("read items", items.count()[0])
actions = pd.read_csv("./tianchi_fresh_comp_train_user.csv")
print("user action read, total:", actions.count()[0])
read items 620918
user action read, total: 23291027
# 讀取并且轉(zhuǎn)換actions表, 用戶的所有的行為
# TODO: 暫時忽略所有的geo信息

def prepare_data(actions, items):
    #convert time
    actions.time = pd.to_datetime(actions.time)

    #index user
    user_index = actions.user_id.drop_duplicates()
    user_index = user_index.reset_index(drop=True).reset_index().set_index("user_id")
    user_index.columns = ['user']
    actions = pd.merge(actions, user_index, left_on='user_id', right_index=True, how='left')

    #index item
    item_ids = actions.item_id.drop_duplicates()
    item_ids = item_ids.reset_index(drop=True).reset_index().set_index("item_id")
    item_ids.columns = ['item']
    actions = pd.merge(actions, item_ids, left_on='item_id', right_index=True, how='left')

    items = pd.merge(items, item_ids, left_on='item_id', right_index=True, how='left')

    # index category
    category = actions.item_category.drop_duplicates()
    category = category.reset_index(drop=True).reset_index().set_index("item_category")
    category.columns = ['category']
    actions = pd.merge(actions, category, left_on='item_category', right_index=True, how='left')

    #drop user_id, item_id
    actions = actions.drop(['user_id', 'item_id', 'item_category'], axis=1)
    items = items.drop(['item_id', 'item_category'], axis=1);

    #reoder columns
    actions = actions.loc[:, ['user', 'item', 'behavior_type', 'category', 'time', 'user_geohash']]
    
    #add date and hour
    actions['date'] = actions.time.dt.date
    actions['hour'] = actions.time.dt.hour
    return actions, items, user_index, item_ids, category

# actions, items = prepare_data(actions, items)
actions, items, user_index, item_ids, _ = prepare_data(actions, items)
actions.head()

[圖片上傳中...(image.png-fb438e-1547196526920-0)]

items.head()

2.觀察數(shù)據(jù)

geo = pd.concat([items.item_geohash, actions.user_geohash]).drop_duplicates()
item_geo = items.item_geohash.drop_duplicates().dropna()
print("商品的geo去重后總數(shù)的統(tǒng)計", item_geo.count())
action_geo = actions.user_geohash.drop_duplicates().dropna()
print("用戶行為的geo去重后總數(shù)的統(tǒng)計",action_geo.count())
print("商品與用戶行為的geo去重后總數(shù)的統(tǒng)計:\n", 
      "交集 / 用戶行為geo:",
      len(action_geo[action_geo.isin(item_geo)]) / len(action_geo),
      "\n交集 / 商品geo:",
      len(item_geo[item_geo.isin(action_geo)]) / len(item_geo)
     )
del item_geo
del action_geo
#從結(jié)果可以看出, 大多數(shù)情況下用戶和商品的地址存在匹配的情況, 少量不匹配

商品的geo去重后總數(shù)的統(tǒng)計 57358
用戶行為的geo去重后總數(shù)的統(tǒng)計 1018981
商品與用戶行為的geo去重后總數(shù)的統(tǒng)計:
交集 / 用戶行為geo: 0.025223237724746585
交集 / 商品geo: 0.44809791136371563

ag = actions.loc[:, ['user', 'user_geohash']].dropna()
print("用戶行為帶有g(shù)eohash的數(shù)量", len(ag))
ag = ag.drop_duplicates()
print("用戶行為帶有g(shù)eohash的數(shù)量(去重后)", len(ag))
ag['c'] = 1
ag = ag.loc[:, ['user', 'c']].groupby('user').sum()
print(ag.describe())
del ag
#可以發(fā)現(xiàn)用戶
#有g(shù)eo hash地址的用戶行為的中位數(shù)為42, 就是大多數(shù)用戶所在的geohash是經(jīng)常變化的
#用戶在不同的時間, 處于多個不同的geo地址(也就是說這個geo的還是比較精確的, 可能離開商品的某個geo有一定的距離)
#那么可以考慮的是, 是否時間間隔越近的兩個geohash地址, 意味著越近的距離

用戶行為帶有g(shù)eohash的數(shù)量 7380017
用戶行為帶有g(shù)eohash的數(shù)量(去重后) 1257674
c
count 16240.000000
mean 77.442980
std 53.782759
min 1.000000
25% 42.000000
50% 68.000000
75% 103.000000
max 709.000000

df = actions[actions.user_geohash.notna()]
print("購買的時候, 有g(shù)eo信息的行為數(shù)量", len(df), "占全部行為的", len(df[df.user_geohash.isin(items.item_geohash)]) / len(df))
del df

購買的時候, 有g(shù)eo信息的行為數(shù)量 7380017 占全部行為的 0.03044234179948366

3. 提取特征

首先要考慮要提取哪些特征, 這些特征需要考慮體現(xiàn)用戶、商品、商品分類、地點等特性

  1. 用戶: 總體行為次數(shù),還有如何體現(xiàn)出用戶的購買愛好, 比如針對某一類商品購買的喜好?
  2. 商品/分類: 總體有多少用戶購買, 所有用戶的總體行為計數(shù)
  3. 分類:總共有多少商品
  4. 上面特征的時間特性?
  5. 上面物品的地理特性
  6. 上面物品的交叉特性, 比如某個用戶特別愛購買某個商品
  7. 與時間相關(guān)的特性, 用戶某一天的購買行為計數(shù), 用來計算第二天是否購買

3.0 保存特征

saved_actions = actions
print(len(actions))
actions.head()

23291027


image.png
# actions = saved_actions;#恢復(fù)actions
print("共計: {}條交易記錄".format(actions.user.max()))
共計: 19999條交易記錄
# #從用戶來限制提取特征對數(shù)據(jù)額占用, 是在太卡了, 后續(xù)刪除
# actions = actions.set_index("user").loc[:10000, :]
# actions = actions.reset_index()
# print(actions.user.max())
# actions.head()

3.1 提取用戶特征

#用戶總計購買了多少商品
user = actions.groupby(['user', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
user.rename(columns={'item': 'c'}, level=0, inplace=True)
user.head()
image.png
# 統(tǒng)計購買商品的種類
c = actions.drop_duplicates(['user', 'behavior_type', 'item']) \
    .groupby(['user', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
user = user.merge(c, left_index=True, right_index=True, how='left')
user.head()
image.png
#統(tǒng)計購買商品類別的種類
c = actions.drop_duplicates(['user', 'behavior_type', 'category']) \
    .groupby(['user', 'behavior_type'])[['category']].count().unstack().fillna(0).astype(np.int)
user = user.merge(c, left_index=True, right_index=True, how='left')
user.head()
image.png
user = pd.DataFrame(user.values, index=user.index, columns=["u{}".format(i) for i in range(0, 12, 1)])
user.head()
image.png
user = user / (user.mean() + user.std() * 3)
user.head()
image.png
# user.to_csv("user.csv")
# del user

3.2 統(tǒng)計商品屬性

#統(tǒng)計商品被購買的次數(shù)
good = actions.groupby(['item', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
good.rename(columns={'user': 'c'}, level=0, inplace=True)
good.head()
image.png
#統(tǒng)計商品被多少用戶購買過
c = actions.drop_duplicates(['user', 'behavior_type', 'item']) \
    .groupby(['item', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
good = good.merge(c, left_index=True, right_index=True, how='left')
good.head()
image.png
good = pd.DataFrame(good.values, index=good.index, columns=["g{}".format(i) for i in range(0, 8, 1)])
good.head()
image.png
good = good / (good.mean() + good.std() * 3)
good.head()

![https://upload-images.jianshu.io/upload_images/7100403-5cbe4cd6a773ce6b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

# good.to_csv("good.csv")
# del good

3.3 統(tǒng)計商品類別的特征

#統(tǒng)計商品類別被購買的次數(shù)
cat = actions.groupby(['category', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
cat.rename(columns={'user': 'c'}, level=0, inplace=True)
cat.head()
image.png
#統(tǒng)計商品類別被多少用戶購買過
c = actions.drop_duplicates(['user', 'behavior_type', 'category']) \
    .groupby(['category', 'behavior_type'])[['user']].count().unstack().fillna(0).astype(np.int)
cat = cat.merge(c, left_index=True, right_index=True, how='left')
cat.head()
image.png
#統(tǒng)計商品類別有多少商品
c = actions.drop_duplicates(['item', 'behavior_type', 'category']) \
    .groupby(['category', 'behavior_type'])[['item']].count().unstack().fillna(0).astype(np.int)
cat = cat.merge(c, left_index=True, right_index=True, how='left')
cat.head()
image.png
cat = pd.DataFrame(cat.values, index=cat.index, columns=["c{}".format(i) for i in range(0, 12, 1)])
cat.head()
image.png
cat = cat / (cat.mean() + cat.std() * 3)
cat.head()
image.png
# cat.to_csv('cat.csv')
# del cat
del c

3.4 時間特性

3.5 地理特性

3.6 交叉特性

3.7 24小時內(nèi)的動作

def read_csv():
    return pd.read_csv("user.csv", index_col=0)
def read_good():
    return pd.read_csv("good.csv", index_col=0)
def read_cat():
    return pd.read_csv('cat.csv', index_col=0)
def read_label():
    return pd.read_csv("label.csv", index_col=0)
# 用戶第二天是否會購買的標(biāo)簽
label = actions[actions.behavior_type == 4].copy()
label.date = (pd.to_datetime(label.date) - np.timedelta64(1, 'D'))
# label.date = label.date.dt.date
print(label.date.dtypes)
label['buy'] = 1
# label = label.loc[:, ['date', 'user','category','item','buy']].groupby(['date', 'user','category','item']).sum()
label = label.set_index(['date', 'user']).loc[:, ['item', 'category', 'buy']].drop_duplicates()
label.set_index(['category','item',], append=True, inplace=True)
label.head()
datetime64[ns]
image.png
# label.to_csv("label.csv")
# del label
# read_label().head()
# 統(tǒng)計用戶最后一天的行為
d_action = actions.copy()
d_action['d']  = 1
d_action.date = pd.to_datetime(d_action.date)
d_action = d_action.groupby([ 'date', 'user', 'category', 'item', 'behavior_type']).sum()[['d']]
d_action = d_action / (d_action.mean() + d_action.std() * 3)
d_action = d_action.unstack().fillna(0).astype(np.float32)
d_action.columns = d_action.columns.droplevel(0)
d_action.columns = ['d_t{}'.format(i) for i in range(1, 5, 1)]
d_action.head()
image.png
# d_action.to_csv('d_action.csv')
# pd.read_csv('d_action.csv', index_col=0).dtypes
#某個用戶3小時的行為
x_action = actions.copy()
x_action['c']  = 1
x_action.date = pd.to_datetime(x_action.date).dt.date
#數(shù)據(jù)量太大, 只考慮最后3個小時的數(shù)據(jù)
x_action = x_action.loc[x_action.hour.isin([23, 22, 21])]
x_action.date = pd.to_datetime(x_action.date)
x_action = x_action.groupby([ 'date', 'user', 'category', 'item', 'hour', 'behavior_type']).sum()
x_action = x_action.unstack()
x_action = x_action / (x_action.mean() + x_action.std() * 3)
x_action = x_action.stack().astype(np.float32)
x_action.head()
image.png
x_action = x_action.unstack(['hour', 'behavior_type'], fill_value=0).sort_index(axis=1)
x_action.columns = x_action.columns.droplevel(0)
# print(x_action.describe())
#用如此方式來保證代碼會被正確的展開成96列, 而不至于部分代碼被
x_action = pd.DataFrame(x_action.values, index=x_action.index, columns=pd.MultiIndex.from_product([range(1, 5, 1), range(21, 24, 1)], names=['behavior_type','hour']))
x_action = x_action.fillna(0)
# x_action.info()
# x_action[:, :] = x_action[:, :].astype(np.int8)
# x_action.info()
# x_action = x_action.apply(lambda x: x.astype(np.int32))
x_action = pd.DataFrame(x_action.values, index=x_action.index, columns = ["h{}_{}".format(h, t) for h in range(21, 24, 1) for t in [1, 2, 3, 4]])
# print(x_action.describe())
x_action.head()
image.png
x_action = d_action.merge(x_action, left_index=True, right_index=True, how='left')
x_action.fillna(0, inplace=True)
x_action.head()
image.png
# 合并x, y數(shù)據(jù), 使用how='left'可以過濾掉之前沒有行為, 但是卻有購買動作的數(shù)據(jù)
# 當(dāng)然, 這樣我也過濾到了, 我看了n個便宜的, 結(jié)果買了這類里面的一個爆品
# TODO: 以后想法處理
x_action = x_action.merge(label, left_index=True, right_index=True, how='left')
x_action.fillna(0, inplace=True)
x_action.buy = x_action.buy.astype(np.int8)
x_action.head()
image.png
#對應(yīng)時間點行為對應(yīng)的用戶, 商品, 分類屬性
x_action.reset_index(inplace=True)
x_action = user.merge(x_action, right_on='user', left_index=True, how='right')
x_action = good.merge(x_action, right_on='item', left_index=True, how='right')
x_action = cat.merge(x_action, right_on='category', left_index=True, how='right')
x_action.set_index(['date', 'user', 'category', 'item'], inplace=True)
x_action.head()
image.png
#x_action.to_csv("x_action.csv") #數(shù)據(jù)量太大, 寫入非常的慢, 如何破這個問題呢?
#pd.read_csv("x_action.csv").head()

3.8 優(yōu)化方向

3.8.1 以后可能考慮加入噪音層, 不然, 某個用戶可能存在只是查看了一次, 買了一次, 就被網(wǎng)絡(luò)記憶成必買的用戶

3.8.2 如何按組來訓(xùn)練, 畢竟用戶一般是看一類商品, 然后選擇其中一個商品來購買

3.8.3 目前采用的"正則化"是否合理, 是否有更好或者更加通用的數(shù)據(jù)處理方式, 或者直接用normal是否更好

from keras.layers import Dense, LSTM, Dropout
from keras.models import Model, Input
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_action.loc[:'2014-12-18'].values[:, :-1], x_action.values[:, -1], test_size=0.1)
x_train, y_train
(array([[0.03628967, 0.03287605, 0.04344553, ..., 0.        , 0.        ,
         0.        ],
        [2.19627494, 2.48323743, 2.35556257, ..., 0.        , 0.        ,
         0.        ],
        [0.12262843, 0.05917688, 0.07467201, ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.08175772, 0.04602647, 0.16020541, ..., 0.        , 0.        ,
         0.        ],
        [9.42243503, 9.34921724, 9.17379612, ..., 0.        , 0.        ,
         0.        ],
        [2.09024259, 1.75996439, 2.44856316, ..., 0.        , 0.        ,
         0.        ]]), array([0., 0., 0., ..., 0., 0., 0.]))
x_train.shape # (17482, 128) for test, (9014805, 48) for all
(9014805, 48)
inputs = Input(shape=(x_train.shape[1], ))
x = Dense(256)(inputs)
x = Dropout(0.2)(x)
x = Dense(128)(x)
x = Dropout(0.2)(x)
outputs = Dense(1, activation='sigmoid')(x)

model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='binary_crossentropy', optimizer='rmsprop',metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=64, epochs=50, validation_data=[x_test, y_test])
# 10000名用戶的結(jié)果如下:
# Epoch 11/50
# 4065324/4065324 [==============================] - 364s 90us/step - loss: 0.0219 - acc: 0.9963 - val_loss: 0.0196 - val_acc: 0.9968
Train on 9014805 samples, validate on 1001646 samples
Epoch 1/50
9014805/9014805 [==============================] - 796s 88us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0293 - val_acc: 0.9959
Epoch 2/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0215 - val_acc: 0.9968
Epoch 3/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0208 - val_acc: 0.9968
Epoch 4/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0253 - val_acc: 0.9966
Epoch 5/50
9014805/9014805 [==============================] - 787s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0249 - val_acc: 0.9961
Epoch 6/50
9014805/9014805 [==============================] - 785s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0258 - val_acc: 0.9966
Epoch 7/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0270 - acc: 0.9960 - val_loss: 0.0235 - val_acc: 0.9969
Epoch 8/50
9014805/9014805 [==============================] - 783s 87us/step - loss: 0.0272 - acc: 0.9960 - val_loss: 0.0216 - val_acc: 0.9968
Epoch 9/50
9014805/9014805 [==============================] - 785s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0242 - val_acc: 0.9964
Epoch 10/50
9014805/9014805 [==============================] - 786s 87us/step - loss: 0.0273 - acc: 0.9960 - val_loss: 0.0229 - val_acc: 0.9969
Epoch 11/50
9014805/9014805 [==============================] - 783s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0234 - val_acc: 0.9967
Epoch 12/50
9014805/9014805 [==============================] - 782s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0230 - val_acc: 0.9968
Epoch 13/50
9014805/9014805 [==============================] - 781s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0236 - val_acc: 0.9969
Epoch 14/50
9014805/9014805 [==============================] - 781s 87us/step - loss: 0.0272 - acc: 0.9960 - val_loss: 0.0229 - val_acc: 0.9967
Epoch 15/50
9014805/9014805 [==============================] - 782s 87us/step - loss: 0.0271 - acc: 0.9960 - val_loss: 0.0249 - val_acc: 0.9961
Epoch 16/50
5038976/9014805 [===============>..............] - ETA: 5:37 - loss: 0.0271 - acc: 0.9960

............................

y_predict = model.predict(x_test)
y_predict
array([[1.0223342e-03],
       [5.4835586e-10],
       [1.0230833e-03],
       ...,
       [5.3039176e-04],
       [1.6522235e-03],
       [1.2059750e-03]], dtype=float32)
import matplotlib.pyplot as plt
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])
image.png

image.png
#統(tǒng)計哪個閾值的F1 score最高
def get_f1_by(true, predict, n):
    predict = y_predict.reshape(-1)
    return f1_score(true, np.where(predict >= n, np.ones_like(predict), np.zeros_like(predict)))

def cal_f1_score(true, predict):
    
    result = area.apply(lambda i : get_f1_by(true, predict, i))
    return result 

area = pd.Series(np.arange(1e-9, 0.9, 0.05))
result = cal_f1_score(y_test.reshape(-1), y_predict.reshape(-1))
print('when n =', area[result.idxmax()], 'best result is', result.max())
plt.scatter(area, result)
plt.show()
#0.043000000000000024 0.15744941753525443
when n = 0.050000001 best result is 0.061109622085231845
image.png
area = pd.Series(np.arange(1e-9, 0.1, 0.001))
result = cal_f1_score(y_test.reshape(-1), y_predict.reshape(-1))
print('when n =', area[result.idxmax()], 'best result is', result.max())
plt.scatter(area, result)
plt.show()
#0.043000000000000024 0.15744941753525443
when n = 0.016000001 best result is 0.19509536784741144
image.png
#從這個結(jié)果發(fā)現(xiàn)0.043是閾值可以得到最高的F1, 我們用這個來預(yù)測最后一天的結(jié)果

y_predict = model.predict(x_action.loc['2014-12-18'].iloc[:, :-1])

y_predict = np.where(y_predict > 0.016, np.ones_like(y_predict), np.zeros_like(y_predict))

len(y_predict[y_predict==1])
1154
result = x_action.loc['2014-12-18'].copy()
result.buy = y_predict
result = result.loc[result.buy > 0, 'buy'].reset_index().loc[:, ['user', 'item']]
result.head()
image.png
user_index.head()
image.png
item_ids.head()
image.png
result = result.merge(item_ids.reset_index(), left_on='item', right_on='item', how='left') \
    .merge(user_index.reset_index(), left_on='user', right_on='user', how='left').loc[:, ['user_id', 'item_id']]
result.head()
result.to_csv('tianchi_mobile_recommendation_predict.csv')

至此數(shù)據(jù)已經(jīng)得到一個結(jié)果, 提交到天池上結(jié)果尚未得到:
來日更新

我本意是通過LSTM和embedding來構(gòu)建網(wǎng)絡(luò), 獲得數(shù)據(jù)的, 不過先通過“傳統(tǒng)"的方式來構(gòu)建網(wǎng)絡(luò), 作為后續(xù)網(wǎng)絡(luò)的一個參考標(biāo)準(zhǔn), 這個就搞了非常久, 還是基礎(chǔ)不行啊, 準(zhǔn)備一邊練習(xí)一邊學(xué), 有更好的方式和我弄得不好的地方, 請不吝指出!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容