Udacity- 線性回歸

continuous supervised learning 連續(xù)變量監(jiān)督學(xué)習(xí)

regression 回歸

continuous 有一定次序,且可以比較大小

1. Concept

slope:斜率

intercept:截距

coefficient:系數(shù)

2.Coding

import numpy
import matplotlib.pyplot as plt

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()



from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(ages_train, net_worths_train)

### get Katie's net worth (she's 27)
### sklearn predictions are returned in an array, so you'll want to index into
### the output to get what you want, e.g. net_worth = predict([[27]])[0][0] (not
### exact syntax, the point is the [0] at the end). In addition, make sure the
### argument to your prediction function is in the expected format - if you get
### a warning about needing a 2d array for your data, a list of lists will be
### interpreted by sklearn as such (e.g. [[27]]).
km_net_worth = 1.0 ### fill in the line of code to get the right value
km_net_worth = reg.predict([[27]])[0][0]
### get the slope
### again, you'll get a 2-D array, so stick the [0][0] at the end
slope = 0. ### fill in the line of code to get the right value
slope = reg.coef_[0][0]
### get the intercept
### here you get a 1-D array, so stick [0] on the end to access
### the info we want
intercept = 0. ### fill in the line of code to get the right value
intercept = reg.intercept_[0]

### get the score on test data
test_score = 0. ### fill in the line of code to get the right value
test_score = reg.score(ages_test,net_worths_test)

### get the score on the training data
training_score = 0. ### fill in the line of code to get the right value
training_score = reg.score(ages_train,net_worths_train)


def submitFit():
    # all of the values in the returned dictionary are expected to be
    # numbers for the purpose of the grader.
    return {"networth":km_net_worth,
            "slope":slope,
            "intercept":intercept,
            "stats on test":test_score,
            "stats on training": training_score}

3.線性回歸誤差

最好的線性回歸是最小化誤差平方和的回歸

4.最小化誤差平方和的算法

ordinary least squares(OLS)
gradient descent

5.SSE的問(wèn)題

sum of squared errors(SSE)

6.回歸的R平方指標(biāo)

0<R^2<1 越接近1,表明擬合表現(xiàn)的越好
優(yōu)點(diǎn):與訓(xùn)練點(diǎn)的數(shù)量無(wú)關(guān),比誤差平方和更可靠一點(diǎn)
在SKlearn中,用reg.score獲取r的平方

7.分類與回歸的比較

image.png

8.多變量回歸

image.png

9.迷你項(xiàng)目

#!/usr/bin/python

"""
    Starter code for the regression mini-project.
    
    Loads up/formats a modified version of the dataset
    (why modified?  we've removed some trouble points
    that you'll find yourself in the outliers mini-project).

    Draws a little scatterplot of the training/testing data

    You fill in the regression code where indicated:
"""    


import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") )

### list the features you want to look at--first item in the 
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )

### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "b"



### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and 
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(feature_train, target_train)
print reg.coef_
print reg.intercept_
print reg.score(feature_train, target_train)
print reg.score(feature_test, target_test)



### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
    plt.scatter( feature, target, color=test_color ) 
for feature, target in zip(feature_train, target_train):
    plt.scatter( feature, target, color=train_color ) 

### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")




### draw the regression line, once it's coded
try:
    plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
    pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()
image.png

image.png
根據(jù)LTI回歸獎(jiǎng)金

我們有許多可用的財(cái)務(wù)特征,就預(yù)測(cè)個(gè)人獎(jiǎng)金而言,其中一些特征可能比余下的特征更為強(qiáng)大。例如,假設(shè)你對(duì)數(shù)據(jù)做出了思考,并且推測(cè)出“l(fā)ong_term_incentive”特征(為公司長(zhǎng)期的健康發(fā)展做出貢獻(xiàn)的雇員應(yīng)該得到這份獎(jiǎng)勵(lì))可能與獎(jiǎng)金而非工資的關(guān)系更密切。

證明你的假設(shè)是正確的一種方式是根據(jù)長(zhǎng)期激勵(lì)回歸獎(jiǎng)金,然后看看回歸是否顯著高于根據(jù)工資回歸獎(jiǎng)金。根據(jù)長(zhǎng)期獎(jiǎng)勵(lì)回歸獎(jiǎng)金—測(cè)試數(shù)據(jù)的分?jǐn)?shù)是多少?

features_list = ["bonus", "long_term_incentive"]
image.png

image.png
異常值破壞回歸

這是下節(jié)課的內(nèi)容簡(jiǎn)介,關(guān)于異常值的識(shí)別和刪除。返回至之前的一個(gè)設(shè)置,你在其中使用工資預(yù)測(cè)獎(jiǎng)金,并且重新運(yùn)行代碼來(lái)回顧數(shù)據(jù)。你可能注意到,少量數(shù)據(jù)點(diǎn)落在了主趨勢(shì)之外,即某人拿到高工資(超過(guò) 1 百萬(wàn)美元!)卻拿到相對(duì)較少的獎(jiǎng)金。此為異常值的一個(gè)示例,我們將在下節(jié)課中重點(diǎn)講述它們。

類似的這種點(diǎn)可以對(duì)回歸造成很大的影響:如果它落在訓(xùn)練集內(nèi),它可能顯著影響斜率/截距。如果它落在測(cè)試集內(nèi),它可能比落在測(cè)試集外要使分?jǐn)?shù)低得多。就目前情況來(lái)看,此點(diǎn)落在測(cè)試集內(nèi)(而且最終很可能降低分?jǐn)?shù))。

現(xiàn)在,我們將繪制兩條回歸線,一條在測(cè)試數(shù)據(jù)上擬合(有異常值),一條在訓(xùn)練數(shù)據(jù)上擬合(無(wú)異常值)。來(lái)看看現(xiàn)在的圖形,有很大差別,對(duì)吧?單一的異常值會(huì)引起很大的差異。

image.png
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書(shū)系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容