工程構(gòu)建
- 導(dǎo)入
sklearn相關(guān)包
import numpy as np
from sklearn import neighbors
- 定義
readDataSet(fileName, isTest)函數(shù),用于加載訓(xùn)練數(shù)據(jù),返回特征數(shù)據(jù)(dataSet)和標簽(label)
# fileName: 文件名字
# isTest: =True為測試數(shù)據(jù)時,只有54列,label直接設(shè)0
def readDataSet(fileName, isTest):
fr = open(fileName,encoding='utf-8')
lines = fr.readlines()
numLabels = len(lines)
labels = np.zeros(numLabels) #用于存放標簽數(shù)據(jù)
dateSet = np.zeros([numLabels,54],int) #用于存放特征數(shù)據(jù)
# 逐行讀取數(shù)據(jù)到dataSet,labels
# 總共55列,前54列是樣本特征,最后一列是樣本類別(label)
for i in range(numLabels):
line = lines[i]
label = 0
if isTest: #True:測試集只有54列,最后標簽直接設(shè)為0
label = 0
else: # False:取最后一列為標簽數(shù)據(jù)
label = line.split(' ')[54]
labels[i] = label
dates = np.zeros(54)
for j in range(53): # 前54列添加為特征數(shù)據(jù)
dates[j] = line.split(' ')[j]
dateSet[i] = dates
fr.close()
return dateSet,labels
添加isTest參數(shù)是為了區(qū)別訓(xùn)練集(55列)與測試集數(shù)據(jù)(54列)列數(shù)不同,便于交叉驗證時,統(tǒng)計錯誤數(shù)量,核算正確率.
- 加載訓(xùn)練數(shù)據(jù)
train_dataSet,train_labels = readDataSet('data_train.txt', False)
- 構(gòu)建KNN分類器,并調(diào)用fit()函數(shù)
knn = neighbors.KNeighborsClassifier(algorithm='kd_tree', n_neighbors=3)
knn.fit(train_dataSet,train_labels)
- 加載測試集,使用構(gòu)建好的KNN分類器對測試集進行預(yù)測
test_dataSet,test_labels = readDataSet('test.txt', False)
res = knn.predict(test_dataSet)
error = 0
for i in range(len(res)):
if res[i] != test_labels[i]:
error += 1
print('error:',error,'正確率:',(len(res) - error) / len(res))
鄰居數(shù)量K影響對比
| n_neighbors | 1 | 3 | 5 | 7 |
|---|---|---|---|---|
| 錯誤數(shù)量 | 1524 | 1446 | 1495 | 1621 |
| 正確率 | 0.96136 | 0.96333 | 0.96209 | 0.95889 |
k = 3時正確率最高,當K > 3 后正確率開始下降