我們使用均值漂移,繼續(xù)聚類和非監(jiān)督學習的話題,這次將其用于我們的泰坦尼克數(shù)據(jù)集。
這里有一些隨機度,所以你的結(jié)果可能并不相同,然而你可以重新運行程序來獲取相似結(jié)果,如果你沒有得到相似結(jié)果的話。
我們打算通過均值漂移聚類來看一看泰坦尼克數(shù)據(jù)集。我們感興趣的是,是否均值漂移能夠自動將乘客分離為分組。如果能,檢查它創(chuàng)建的分組就很有趣了。第一個明顯的興趣點就是,所發(fā)現(xiàn)分組的幸存率,但是,我們也會深入這些分組的屬性,來觀察我們是否能夠理解,均值漂移為什么決定了特定的分組。
首先,我們使用已經(jīng)看過的代碼:
import numpy as np
from sklearn.cluster import MeanShift, KMeans
from sklearn import preprocessing, cross_validation
import pandas as pd
import matplotlib.pyplot as plt
'''
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
'''
# https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls
df = pd.read_excel('titanic.xls')
original_df = pd.DataFrame.copy(df)
df.drop(['body','name'], 1, inplace=True)
df.fillna(0,inplace=True)
def handle_non_numerical_data(df):
# handling non-numerical data: must convert.
columns = df.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
#print(column,df[column].dtype)
if df[column].dtype != np.int64 and df[column].dtype != np.float64:
column_contents = df[column].values.tolist()
#finding just the uniques
unique_elements = set(column_contents)
# great, found them.
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
# creating dict that contains new
# id per unique string
text_digit_vals[unique] = x
x+=1
# now we map the new "id" vlaue
# to replace the string.
df[column] = list(map(convert_to_int,df[column]))
return df
df = handle_non_numerical_data(df)
df.drop(['ticket','home.dest'], 1, inplace=True)
X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])
clf = MeanShift()
clf.fit(X)
除了兩個例外,一個是original_df = pd.DataFrame.copy(df),在我們將csv文件讀取到df對象之后。另一個是從sklearn.cluster導入MeanShift,并且用其作為我們的聚類器。我們生成了副本,以便之后引用原始非數(shù)值形式的數(shù)據(jù)。
既然我們創(chuàng)建了擬合,我們可以從clf對象獲取一些屬性。
labels = clf.labels_
cluster_centers = clf.cluster_centers_
下面,我們打算向我們的原始數(shù)據(jù)幀添加新的一項。
original_df['cluster_group']=np.nan
現(xiàn)在,我們可以迭代標簽,并向空列添加新的標簽。
for i in range(len(X)):
original_df['cluster_group'].iloc[i] = labels[i]
現(xiàn)在我們可以檢查每個分組的幸存率:
n_clusters_ = len(np.unique(labels))
survival_rates = {}
for i in range(n_clusters_):
temp_df = original_df[ (original_df['cluster_group']==float(i)) ]
#print(temp_df.head())
survival_cluster = temp_df[ (temp_df['survived'] == 1) ]
survival_rate = len(survival_cluster) / len(temp_df)
#print(i,survival_rate)
survival_rates[i] = survival_rate
print(survival_rates)
如果我們執(zhí)行它,我們會得到:
{0: 0.3796583850931677, 1: 0.9090909090909091, 2: 0.1}
同樣,你可能獲得更多分組。我這里獲得了三個,但是我在這個數(shù)據(jù)集上獲得過六個分組?,F(xiàn)在,我們看到分組 0 的幸存率是 38%,分組 1 是 91%,分組 2 是 10%。這就有些奇怪了,因為我們知道船上有三個真實的“乘客分類”。我想知道是不是 0 就是二等艙,1 就是頭等艙, 2 是三等艙。船上的艙是,3 等艙在最底下,頭等艙在最上面,底部首先淹沒,然后頂部是救生船的地方。我可以深入看一看:
print(original_df[ (original_df['cluster_group']==1) ])
我們獲取cluster_group為 1 的original_df。
打印出來:
pclass survived name \
17 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput)
49 1 1 Cardeza, Mr. Thomas Drake Martinez
50 1 1 Cardeza, Mrs. James Warburton Martinez (Charlo...
66 1 1 Chaudanson, Miss. Victorine
97 1 1 Douglas, Mrs. Frederick Charles (Mary Helene B...
116 1 1 Fortune, Mrs. Mark (Mary McDougald)
183 1 1 Lesurer, Mr. Gustave J
251 1 1 Ryerson, Miss. Susan Parker "Suzette"
252 1 0 Ryerson, Mr. Arthur Larned
253 1 1 Ryerson, Mrs. Arthur Larned (Emily Maria Borie)
302 1 1 Ward, Miss. Anna
sex age sibsp parch ticket fare cabin embarked \
17 female 50.0 0 1 PC 17558 247.5208 B58 B60 C
49 male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C
50 female 58.0 0 1 PC 17755 512.3292 B51 B53 B55 C
66 female 36.0 0 0 PC 17608 262.3750 B61 C
97 female 27.0 1 1 PC 17558 247.5208 B58 B60 C
116 female 60.0 1 4 19950 263.0000 C23 C25 C27 S
183 male 35.0 0 0 PC 17755 512.3292 B101 C
251 female 21.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C
252 male 61.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C
253 female 48.0 1 3 PC 17608 262.3750 B57 B59 B63 B66 C
302 female 35.0 0 0 PC 17755 512.3292 NaN C
boat body home.dest cluster_group
17 6 NaN Montreal, PQ 1.0
49 3 NaN Austria-Hungary / Germantown, Philadelphia, PA 1.0
50 3 NaN Germantown, Philadelphia, PA 1.0
66 4 NaN NaN 1.0
97 6 NaN Montreal, PQ 1.0
116 10 NaN Winnipeg, MB 1.0
183 3 NaN NaN 1.0
251 4 NaN Haverford, PA / Cooperstown, NY 1.0
252 NaN NaN Haverford, PA / Cooperstown, NY 1.0
253 4 NaN Haverford, PA / Cooperstown, NY 1.0
302 3 NaN NaN 1.0
很確定了,整個分組就是頭等艙。也就是說,這里實際上只有 11 個人。讓我們看看分組 0,它看起來有些不同。這一次,我們使用 Pandas 的.describe()方法。
print(original_df[ (original_df['cluster_group']==0) ].describe())
pclass survived age sibsp parch \
count 1288.000000 1288.000000 1027.000000 1288.000000 1288.000000
mean 2.300466 0.379658 29.668614 0.496118 0.332298
std 0.833785 0.485490 14.395610 1.047430 0.686068
min 1.000000 0.000000 0.166700 0.000000 0.000000
25% 2.000000 0.000000 21.000000 0.000000 0.000000
50% 3.000000 0.000000 28.000000 0.000000 0.000000
75% 3.000000 1.000000 38.000000 1.000000 0.000000
max 3.000000 1.000000 80.000000 8.000000 4.000000
fare body cluster_group
count 1287.000000 119.000000 1288.0
mean 30.510172 159.571429 0.0
std 41.511032 97.302914 0.0
min 0.000000 1.000000 0.0
25% 7.895800 71.000000 0.0
50% 14.108300 155.000000 0.0
75% 30.070800 255.500000 0.0
max 263.000000 328.000000 0.0
這里有 1287 個人,我們可以看到平均等級是二等艙,但是這里從頭等到三等都有。
讓我們檢查最后一個分組,2,它的預期是全都是三等艙:
print(original_df[ (original_df['cluster_group']==2) ].describe())
pclass survived age sibsp parch fare \
count 10.0 10.000000 8.000000 10.000000 10.000000 10.000000
mean 3.0 0.100000 39.875000 0.800000 6.000000 42.703750
std 0.0 0.316228 1.552648 0.421637 1.632993 15.590194
min 3.0 0.000000 38.000000 0.000000 5.000000 29.125000
25% 3.0 0.000000 39.000000 1.000000 5.000000 31.303125
50% 3.0 0.000000 39.500000 1.000000 5.000000 35.537500
75% 3.0 0.000000 40.250000 1.000000 6.000000 46.900000
max 3.0 1.000000 43.000000 1.000000 9.000000 69.550000
body cluster_group
count 2.000000 10.0
mean 234.500000 2.0
std 130.814755 0.0
min 142.000000 2.0
25% 188.250000 2.0
50% 234.500000 2.0
75% 280.750000 2.0
max 327.000000 2.0
很確定了,我們是對的,這個分組全是三等艙,所以有最壞的幸存率。
足夠有趣,在查看所有分組的時候,分組 2 的票價范圍的確是最低的,從 29 到 69 磅。
在我們查看簇 0 的時候,票價最高為 263 磅。這是最大的組,幸存率為 38%。
當我們回顧簇 1 時,它全是頭等艙,我們看到這里的票價范圍是 247 ~ 512 磅,均值為 350。盡管簇 0 有一些頭等艙的乘客,這個分組是最精英的分組。
出于好奇,分組 0 的頭等艙的生存率,與整體生存率相比如何呢?
>>> cluster_0 = (original_df[ (original_df['cluster_group']==0) ])
>>> cluster_0_fc = (cluster_0[ (cluster_0['pclass']==1) ])
>>> print(cluster_0_fc.describe())
pclass survived age sibsp parch fare \
count 312.0 312.000000 273.000000 312.000000 312.000000 312.000000
mean 1.0 0.608974 39.027167 0.432692 0.326923 78.232519
std 0.0 0.488764 14.589592 0.606997 0.653100 60.300654
min 1.0 0.000000 0.916700 0.000000 0.000000 0.000000
25% 1.0 0.000000 28.000000 0.000000 0.000000 30.500000
50% 1.0 1.000000 39.000000 0.000000 0.000000 58.689600
75% 1.0 1.000000 49.000000 1.000000 0.000000 91.079200
max 1.0 1.000000 80.000000 3.000000 4.000000 263.000000
body cluster_group
count 35.000000 312.0
mean 162.828571 0.0
std 82.652172 0.0
min 16.000000 0.0
25% 109.500000 0.0
50% 166.000000 0.0
75% 233.000000 0.0
max 307.000000 0.0
>>>
很確定了,它們的幸存率更高,約為 61%,但是仍然低于精英分組(根據(jù)票價和幸存率)的 91%?;ㄙM一些時間來深入挖掘,看看你是否能發(fā)現(xiàn)一些東西。