continuous supervised learning 連續(xù)變量監(jiān)督學(xué)習(xí)
regression 回歸
continuous 有一定次序,且可以比較大小
1. Concept
slope:斜率
intercept:截距
coefficient:系數(shù)
2.Coding
import numpy
import matplotlib.pyplot as plt
from ages_net_worths import ageNetWorthData
ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(ages_train, net_worths_train)
### get Katie's net worth (she's 27)
### sklearn predictions are returned in an array, so you'll want to index into
### the output to get what you want, e.g. net_worth = predict([[27]])[0][0] (not
### exact syntax, the point is the [0] at the end). In addition, make sure the
### argument to your prediction function is in the expected format - if you get
### a warning about needing a 2d array for your data, a list of lists will be
### interpreted by sklearn as such (e.g. [[27]]).
km_net_worth = 1.0 ### fill in the line of code to get the right value
km_net_worth = reg.predict([[27]])[0][0]
### get the slope
### again, you'll get a 2-D array, so stick the [0][0] at the end
slope = 0. ### fill in the line of code to get the right value
slope = reg.coef_[0][0]
### get the intercept
### here you get a 1-D array, so stick [0] on the end to access
### the info we want
intercept = 0. ### fill in the line of code to get the right value
intercept = reg.intercept_[0]
### get the score on test data
test_score = 0. ### fill in the line of code to get the right value
test_score = reg.score(ages_test,net_worths_test)
### get the score on the training data
training_score = 0. ### fill in the line of code to get the right value
training_score = reg.score(ages_train,net_worths_train)
def submitFit():
# all of the values in the returned dictionary are expected to be
# numbers for the purpose of the grader.
return {"networth":km_net_worth,
"slope":slope,
"intercept":intercept,
"stats on test":test_score,
"stats on training": training_score}
3.線性回歸誤差
最好的線性回歸是最小化誤差平方和的回歸
4.最小化誤差平方和的算法
ordinary least squares(OLS)
gradient descent
5.SSE的問(wèn)題
sum of squared errors(SSE)
6.回歸的R平方指標(biāo)
0<R^2<1 越接近1,表明擬合表現(xiàn)的越好
優(yōu)點(diǎn):與訓(xùn)練點(diǎn)的數(shù)量無(wú)關(guān),比誤差平方和更可靠一點(diǎn)
在SKlearn中,用reg.score獲取r的平方
7.分類與回歸的比較

8.多變量回歸

9.迷你項(xiàng)目
#!/usr/bin/python
"""
Starter code for the regression mini-project.
Loads up/formats a modified version of the dataset
(why modified? we've removed some trouble points
that you'll find yourself in the outliers mini-project).
Draws a little scatterplot of the training/testing data
You fill in the regression code where indicated:
"""
import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") )
### list the features you want to look at--first item in the
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )
### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "b"
### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(feature_train, target_train)
print reg.coef_
print reg.intercept_
print reg.score(feature_train, target_train)
print reg.score(feature_test, target_test)
### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
plt.scatter( feature, target, color=test_color )
for feature, target in zip(feature_train, target_train):
plt.scatter( feature, target, color=train_color )
### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")
### draw the regression line, once it's coded
try:
plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()


根據(jù)LTI回歸獎(jiǎng)金
我們有許多可用的財(cái)務(wù)特征,就預(yù)測(cè)個(gè)人獎(jiǎng)金而言,其中一些特征可能比余下的特征更為強(qiáng)大。例如,假設(shè)你對(duì)數(shù)據(jù)做出了思考,并且推測(cè)出“l(fā)ong_term_incentive”特征(為公司長(zhǎng)期的健康發(fā)展做出貢獻(xiàn)的雇員應(yīng)該得到這份獎(jiǎng)勵(lì))可能與獎(jiǎng)金而非工資的關(guān)系更密切。
證明你的假設(shè)是正確的一種方式是根據(jù)長(zhǎng)期激勵(lì)回歸獎(jiǎng)金,然后看看回歸是否顯著高于根據(jù)工資回歸獎(jiǎng)金。根據(jù)長(zhǎng)期獎(jiǎng)勵(lì)回歸獎(jiǎng)金—測(cè)試數(shù)據(jù)的分?jǐn)?shù)是多少?
features_list = ["bonus", "long_term_incentive"]


異常值破壞回歸
這是下節(jié)課的內(nèi)容簡(jiǎn)介,關(guān)于異常值的識(shí)別和刪除。返回至之前的一個(gè)設(shè)置,你在其中使用工資預(yù)測(cè)獎(jiǎng)金,并且重新運(yùn)行代碼來(lái)回顧數(shù)據(jù)。你可能注意到,少量數(shù)據(jù)點(diǎn)落在了主趨勢(shì)之外,即某人拿到高工資(超過(guò) 1 百萬(wàn)美元!)卻拿到相對(duì)較少的獎(jiǎng)金。此為異常值的一個(gè)示例,我們將在下節(jié)課中重點(diǎn)講述它們。
類似的這種點(diǎn)可以對(duì)回歸造成很大的影響:如果它落在訓(xùn)練集內(nèi),它可能顯著影響斜率/截距。如果它落在測(cè)試集內(nèi),它可能比落在測(cè)試集外要使分?jǐn)?shù)低得多。就目前情況來(lái)看,此點(diǎn)落在測(cè)試集內(nèi)(而且最終很可能降低分?jǐn)?shù))。
現(xiàn)在,我們將繪制兩條回歸線,一條在測(cè)試數(shù)據(jù)上擬合(有異常值),一條在訓(xùn)練數(shù)據(jù)上擬合(無(wú)異常值)。來(lái)看看現(xiàn)在的圖形,有很大差別,對(duì)吧?單一的異常值會(huì)引起很大的差異。
