2020-07-21

我們使用均值漂移,繼續(xù)聚類和非監(jiān)督學習的話題,這次將其用于我們的泰坦尼克數(shù)據(jù)集。

這里有一些隨機度,所以你的結(jié)果可能并不相同,然而你可以重新運行程序來獲取相似結(jié)果,如果你沒有得到相似結(jié)果的話。

我們打算通過均值漂移聚類來看一看泰坦尼克數(shù)據(jù)集。我們感興趣的是,是否均值漂移能夠自動將乘客分離為分組。如果能,檢查它創(chuàng)建的分組就很有趣了。第一個明顯的興趣點就是,所發(fā)現(xiàn)分組的幸存率,但是,我們也會深入這些分組的屬性,來觀察我們是否能夠理解,均值漂移為什么決定了特定的分組。

首先,我們使用已經(jīng)看過的代碼:

import numpy as np
from sklearn.cluster import MeanShift, KMeans
from sklearn import preprocessing, cross_validation
import pandas as pd
import matplotlib.pyplot as plt


'''
Pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival Survival (0 = No; 1 = Yes)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare (British pound)
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat Lifeboat
body Body Identification Number
home.dest Home/Destination
'''


# https://pythonprogramming.net/static/downloads/machine-learning-data/titanic.xls
df = pd.read_excel('titanic.xls')

original_df = pd.DataFrame.copy(df)
df.drop(['body','name'], 1, inplace=True)
df.fillna(0,inplace=True)

def handle_non_numerical_data(df):
    
    # handling non-numerical data: must convert.
    columns = df.columns.values

    for column in columns:
        text_digit_vals = {}
        def convert_to_int(val):
            return text_digit_vals[val]

        #print(column,df[column].dtype)
        if df[column].dtype != np.int64 and df[column].dtype != np.float64:
            
            column_contents = df[column].values.tolist()
            #finding just the uniques
            unique_elements = set(column_contents)
            # great, found them. 
            x = 0
            for unique in unique_elements:
                if unique not in text_digit_vals:
                    # creating dict that contains new
                    # id per unique string
                    text_digit_vals[unique] = x
                    x+=1
            # now we map the new "id" vlaue
            # to replace the string. 
            df[column] = list(map(convert_to_int,df[column]))

    return df

df = handle_non_numerical_data(df)
df.drop(['ticket','home.dest'], 1, inplace=True)

X = np.array(df.drop(['survived'], 1).astype(float))
X = preprocessing.scale(X)
y = np.array(df['survived'])

clf = MeanShift()
clf.fit(X)

除了兩個例外,一個是original_df = pd.DataFrame.copy(df),在我們將csv文件讀取到df對象之后。另一個是從sklearn.cluster導入MeanShift,并且用其作為我們的聚類器。我們生成了副本,以便之后引用原始非數(shù)值形式的數(shù)據(jù)。

既然我們創(chuàng)建了擬合,我們可以從clf對象獲取一些屬性。

labels = clf.labels_
cluster_centers = clf.cluster_centers_

下面,我們打算向我們的原始數(shù)據(jù)幀添加新的一項。

original_df['cluster_group']=np.nan

現(xiàn)在,我們可以迭代標簽,并向空列添加新的標簽。

for i in range(len(X)):
    original_df['cluster_group'].iloc[i] = labels[i]

現(xiàn)在我們可以檢查每個分組的幸存率:

n_clusters_ = len(np.unique(labels))
survival_rates = {}
for i in range(n_clusters_):
    temp_df = original_df[ (original_df['cluster_group']==float(i)) ]
    #print(temp_df.head())

    survival_cluster = temp_df[  (temp_df['survived'] == 1) ]

    survival_rate = len(survival_cluster) / len(temp_df)
    #print(i,survival_rate)
    survival_rates[i] = survival_rate
    
print(survival_rates)

如果我們執(zhí)行它,我們會得到:

{0: 0.3796583850931677, 1: 0.9090909090909091, 2: 0.1}

同樣,你可能獲得更多分組。我這里獲得了三個,但是我在這個數(shù)據(jù)集上獲得過六個分組?,F(xiàn)在,我們看到分組 0 的幸存率是 38%,分組 1 是 91%,分組 2 是 10%。這就有些奇怪了,因為我們知道船上有三個真實的“乘客分類”。我想知道是不是 0 就是二等艙,1 就是頭等艙, 2 是三等艙。船上的艙是,3 等艙在最底下,頭等艙在最上面,底部首先淹沒,然后頂部是救生船的地方。我可以深入看一看:

print(original_df[ (original_df['cluster_group']==1) ])

我們獲取cluster_group為 1 的original_df。

打印出來:

     pclass  survived                                               name  \
17        1         1    Baxter, Mrs. James (Helene DeLaudeniere Chaput)   
49        1         1                 Cardeza, Mr. Thomas Drake Martinez   
50        1         1  Cardeza, Mrs. James Warburton Martinez (Charlo...   
66        1         1                        Chaudanson, Miss. Victorine   
97        1         1  Douglas, Mrs. Frederick Charles (Mary Helene B...   
116       1         1                Fortune, Mrs. Mark (Mary McDougald)   
183       1         1                             Lesurer, Mr. Gustave J   
251       1         1              Ryerson, Miss. Susan Parker "Suzette"   
252       1         0                         Ryerson, Mr. Arthur Larned   
253       1         1    Ryerson, Mrs. Arthur Larned (Emily Maria Borie)   
302       1         1                                   Ward, Miss. Anna   

        sex   age  sibsp  parch    ticket      fare            cabin embarked  \
17   female  50.0      0      1  PC 17558  247.5208          B58 B60        C   
49     male  36.0      0      1  PC 17755  512.3292      B51 B53 B55        C   
50   female  58.0      0      1  PC 17755  512.3292      B51 B53 B55        C   
66   female  36.0      0      0  PC 17608  262.3750              B61        C   
97   female  27.0      1      1  PC 17558  247.5208          B58 B60        C   
116  female  60.0      1      4     19950  263.0000      C23 C25 C27        S   
183    male  35.0      0      0  PC 17755  512.3292             B101        C   
251  female  21.0      2      2  PC 17608  262.3750  B57 B59 B63 B66        C   
252    male  61.0      1      3  PC 17608  262.3750  B57 B59 B63 B66        C   
253  female  48.0      1      3  PC 17608  262.3750  B57 B59 B63 B66        C   
302  female  35.0      0      0  PC 17755  512.3292              NaN        C   

    boat  body                                       home.dest  cluster_group  
17     6   NaN                                    Montreal, PQ            1.0  
49     3   NaN  Austria-Hungary / Germantown, Philadelphia, PA            1.0  
50     3   NaN                    Germantown, Philadelphia, PA            1.0  
66     4   NaN                                             NaN            1.0  
97     6   NaN                                    Montreal, PQ            1.0  
116   10   NaN                                    Winnipeg, MB            1.0  
183    3   NaN                                             NaN            1.0  
251    4   NaN                 Haverford, PA / Cooperstown, NY            1.0  
252  NaN   NaN                 Haverford, PA / Cooperstown, NY            1.0  
253    4   NaN                 Haverford, PA / Cooperstown, NY            1.0  
302    3   NaN                                             NaN            1.0 

很確定了,整個分組就是頭等艙。也就是說,這里實際上只有 11 個人。讓我們看看分組 0,它看起來有些不同。這一次,我們使用 Pandas 的.describe()方法。

print(original_df[ (original_df['cluster_group']==0) ].describe())
            pclass     survived          age        sibsp        parch  \
count  1288.000000  1288.000000  1027.000000  1288.000000  1288.000000   
mean      2.300466     0.379658    29.668614     0.496118     0.332298   
std       0.833785     0.485490    14.395610     1.047430     0.686068   
min       1.000000     0.000000     0.166700     0.000000     0.000000   
25%       2.000000     0.000000    21.000000     0.000000     0.000000   
50%       3.000000     0.000000    28.000000     0.000000     0.000000   
75%       3.000000     1.000000    38.000000     1.000000     0.000000   
max       3.000000     1.000000    80.000000     8.000000     4.000000   

              fare        body  cluster_group  
count  1287.000000  119.000000         1288.0  
mean     30.510172  159.571429            0.0  
std      41.511032   97.302914            0.0  
min       0.000000    1.000000            0.0  
25%       7.895800   71.000000            0.0  
50%      14.108300  155.000000            0.0  
75%      30.070800  255.500000            0.0  
max     263.000000  328.000000            0.0  

這里有 1287 個人,我們可以看到平均等級是二等艙,但是這里從頭等到三等都有。

讓我們檢查最后一個分組,2,它的預期是全都是三等艙:

print(original_df[ (original_df['cluster_group']==2) ].describe())
       pclass   survived        age      sibsp      parch       fare  \
count    10.0  10.000000   8.000000  10.000000  10.000000  10.000000   
mean      3.0   0.100000  39.875000   0.800000   6.000000  42.703750   
std       0.0   0.316228   1.552648   0.421637   1.632993  15.590194   
min       3.0   0.000000  38.000000   0.000000   5.000000  29.125000   
25%       3.0   0.000000  39.000000   1.000000   5.000000  31.303125   
50%       3.0   0.000000  39.500000   1.000000   5.000000  35.537500   
75%       3.0   0.000000  40.250000   1.000000   6.000000  46.900000   
max       3.0   1.000000  43.000000   1.000000   9.000000  69.550000   

             body  cluster_group  
count    2.000000           10.0  
mean   234.500000            2.0  
std    130.814755            0.0  
min    142.000000            2.0  
25%    188.250000            2.0  
50%    234.500000            2.0  
75%    280.750000            2.0  
max    327.000000            2.0  

很確定了,我們是對的,這個分組全是三等艙,所以有最壞的幸存率。

足夠有趣,在查看所有分組的時候,分組 2 的票價范圍的確是最低的,從 29 到 69 磅。

在我們查看簇 0 的時候,票價最高為 263 磅。這是最大的組,幸存率為 38%。

當我們回顧簇 1 時,它全是頭等艙,我們看到這里的票價范圍是 247 ~ 512 磅,均值為 350。盡管簇 0 有一些頭等艙的乘客,這個分組是最精英的分組。

出于好奇,分組 0 的頭等艙的生存率,與整體生存率相比如何呢?

>>> cluster_0 = (original_df[ (original_df['cluster_group']==0) ])
>>> cluster_0_fc = (cluster_0[ (cluster_0['pclass']==1) ])
>>> print(cluster_0_fc.describe())
       pclass    survived         age       sibsp       parch        fare  \
count   312.0  312.000000  273.000000  312.000000  312.000000  312.000000   
mean      1.0    0.608974   39.027167    0.432692    0.326923   78.232519   
std       0.0    0.488764   14.589592    0.606997    0.653100   60.300654   
min       1.0    0.000000    0.916700    0.000000    0.000000    0.000000   
25%       1.0    0.000000   28.000000    0.000000    0.000000   30.500000   
50%       1.0    1.000000   39.000000    0.000000    0.000000   58.689600   
75%       1.0    1.000000   49.000000    1.000000    0.000000   91.079200   
max       1.0    1.000000   80.000000    3.000000    4.000000  263.000000   

             body  cluster_group  
count   35.000000          312.0  
mean   162.828571            0.0  
std     82.652172            0.0  
min     16.000000            0.0  
25%    109.500000            0.0  
50%    166.000000            0.0  
75%    233.000000            0.0  
max    307.000000            0.0  
>>> 

很確定了,它們的幸存率更高,約為 61%,但是仍然低于精英分組(根據(jù)票價和幸存率)的 91%?;ㄙM一些時間來深入挖掘,看看你是否能發(fā)現(xiàn)一些東西。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務。

友情鏈接更多精彩內(nèi)容