變量名
名字的長(zhǎng)度要小于等于 32 個(gè)字節(jié)。(一個(gè)字母 1 個(gè)字節(jié), 一個(gè)漢字 2 個(gè)字節(jié))
以字母或下劃線開頭。
可以包含字母、 數(shù)字、 或者是下劃線, 不能是%$!*&#@。
可以是小寫或大寫字母, 且不區(qū)分大小寫
Missing numeric data are represented by a single period (.) and missing character data are represented by blanks.
library name
1-8個(gè)字符,字母或者下劃線開頭,剩余部分為字母,數(shù)字或者下劃線
注釋
星號(hào)開頭 ;結(jié)尾
星號(hào)斜杠開頭, 星斜杠結(jié)尾 asterisk (*)
DATA steps與PROC steps區(qū)別
The DATA statement does three things
- Tells SAS that a DATA step is starting.
- Names the SAS dataset being created.
- Set variables used in the DATA step to missing values
three default windows
1.program editor window
2.log window
3.output window
The basics of using SAS
- Prepare the SAS program
- Submit it for analysis
- Review the resulting log for errors
- Examine the output files to view the results of your analysis
Executing the program
- Pull down the Locals menu and select Submit.
- Click on the run icon on taskbar, which is a picture of a man running.
- Push F8.
- Highlight text and click on run symbol
- Note: DATA or PROC step is not executed until next DATA and PROC. Use RUN; statement to force execution.
讀入dat文件;
DATA NAME;
INFILE 'E:\data\a.dat' FIRSTOBS=4 DLM=',';
INPUT V1 1-5 V2 5-10 V3 $ 15;
RUN;
PROC PRINT DATA=NAME; RUN;
infile控制
格式
INFILE 'AAAAA.DAT' XXX;
FIRSTOBS=行數(shù) 從哪一行開始讀取數(shù)據(jù)
OBS=行數(shù) 一直讀取到哪一行
MISSOVER 表示數(shù)據(jù)讀到行末時(shí),如果字段長(zhǎng)度短于申明字段長(zhǎng)度,則不從下一行讀取數(shù)據(jù),否則會(huì)自動(dòng)從下一行讀取數(shù)據(jù)
TURNCOVER column input中指定最長(zhǎng)的一行
INPUT Notes
(1) Duplicate formats can be used when variables have the same format. The examples below represent the same formats of variables x1-x5.
INPUT x1 4. x2 4. x3 4. x4 4. x5 4.; INPUT (x1 x2 x3 x4 x5) (4. 4. 4. 4. 4.); INPUT (x1-x5) (5*4.);(2) @@ tells SAS to hold the line of raw data and use it when processing the next
observation. The @@ must be the last entry in the INPUT statement.
(3) @ tells SAS to hold this line of data for possible use by INPUT statements later in theDATA step. The @ must be the last entry in the INPUT statement.
(4) / tells SAS to move to the next line of the raw dataset.
(5) #n tells SAS to skip to the nth line of the raw data for the observation.
(6) @n tells SAS to move to the nth column.
特殊字符
@40 跳至第40列 @‘a(chǎn)a’ 跳至aa后面
斜線/ 跳至原始數(shù)據(jù)第二行
#2 跳至某觀測(cè)值第二行
重復(fù)觀測(cè)值,將@@放在input句尾
input句尾加@, trailing at, 可用來選擇部分?jǐn)?shù)據(jù), 看例子
數(shù)據(jù)步讀取分隔符文件 delimited files
DLM=',' 指定逗號(hào)分隔符 '09'x Tab分隔符
DSD 忽略引號(hào)中數(shù)據(jù)的分隔符,例如一個(gè)觀測(cè) Joseph,76,"Red Racers, Washington"非引號(hào)中的逗號(hào)能識(shí)別成分隔符, 而引號(hào)中的逗號(hào)不能識(shí)別; 自動(dòng)將字符串中的引號(hào)去掉; 將兩個(gè)相鄰的分隔符當(dāng)作缺失值來處理。
Excel數(shù)據(jù)讀取
PROC IMPORT DATAFILE='D:\A.XLS' OUT=A REPLACE DBMS=XLS; GETNAMES=YES; SHEET="Sheet1"; RUN;
PROC PRINT DATA=A; RUN;
OUT= 輸出數(shù)據(jù)集名稱
DBMS= XLS XLSX
sas7dbat文件讀取 (桌面上的文件)
data new; set 'C:\Users\sdkyc\Desktop\hsb2.sas7bdat'; run;
proc print data=new; run;
數(shù)據(jù)集是臨時(shí)還是永久
變量賦值與運(yùn)算
IF-THEN DO IF-ELSE
- DO 與END 是一個(gè)組合,內(nèi)部actions都會(huì)被執(zhí)行
DATA A;
INFILE 'C:\A.DAT';
INPUT V1 $ V2 V3;
IF V2 = . THEN V4='MISSING';
ELSE IF V2<100 THEN V4='LOW';
ELSE IF V2<1000 THEN V4='MEDIUM';
ELSE V4 = 'HIGH';
RUN;
- 可以用來構(gòu)造子集
使用數(shù)組簡(jiǎn)化程序 ARRAY
ARRAY array-name <{n}> <$> <length> <elements> <(initialvalues)>;
array-name - is the name of the array.
{n} - is either the dimension of the array, or an asterisk (*) to indicate that the dimension is determined from the number of array elements or initial values.
$ indicates that the array type is character.
length - is the maximum length of elements in the array. For character arrays, the maximum length cannot exceed 200.
elements - are the variables that make up the array and they exist in a dataset or are created before the array definition.
initial-values - are the values to use to initialize some or all of the array elements. Separate these values with commas or blanksARRAY rain {5} janr febr marr aprr mayr; ARRAY days{7} d1-d7; ARRAY month{*} jan feb jul oct nov; ARRAY x{*} _NUMERIC_; ARRAY qbx{10}; ARRAY meal{3};
關(guān)于各個(gè)PROC的note鏈接
https://stats.idre.ucla.edu/other/annotatedoutput/
PROC CONTENTS 獲取數(shù)據(jù)集的描述部分,不包括數(shù)據(jù)本身
PROC MEANS
輸出一些Descriptive Statistics 功能與univariate重復(fù)
maxdec 小數(shù)位個(gè)數(shù)
proc means data=a N NMISS MEAN STD STDERR MAXDEC=4; run;
PROC UNIVARIATE t-test sample mean mu0
Test for location就是一個(gè)two-tail的t-test,查看student's t value,如果P<α,wirte的平均值不等于30.
proc univariate data = "D:\hsb2" plots normal mu0=30; var write; run;
用來測(cè)試normality,畫plot圖找到Shapiro-Wilk P value大于α,正態(tài)分布
proc univariate data=a normal plot; var write; run;
1.These tests check the assumption that the data is distributed as a normal distribution.
2.Null hypothesis: data is normal vs Alternate hypothesis: data not normal.
3.P-value large (eg > 0.05) indicate the data follow normal (we accept the null hypothesis) .
4.If 6 < sample size < 2001 use Shapiro-Wilk.
5.Sample size > 2000 use Kolmogorov-Smirnov test.
6.Within the appropriate sample size range Shapiro-Wilk is more powerful than Kolmogorov-Smirnov test.
7.Any departure from Skewness =0 and kurtosis = 0 implies non normality.
PROC FREQ TABLES chisq
用來測(cè)試變量之間有無association,相互是否獨(dú)立。找到輸出結(jié)果中chi-square值,大值對(duì)應(yīng)小p-value。如果P<α,兩個(gè)變量有相關(guān)關(guān)系,不相互獨(dú)立。
English: A large chi-square statistic will correspond to small p-value. If the p-value is small enough (say < 0.05), then we will reject the null hypothesis that the two variables are independent and conclude that there is an association between the row and the column variables.
PROC FREQ DATA=CLASSFIT2; TABLES SEX*HT/CHISQ; RUN;
PROC REG
Assumption
a.Normality of errors: The error distribution is normal.
b.Normality of errors is checked by doing residual analysis. In residual analysis we first calculate the residuals (r = y - ( ??) ???????????????) then verify the normality of the residuals using proc univariate or Q-Q plots.
c.Independence: The errors or observations are independent of each other. Example: apple stock price recorded on 10 consecutive days. Here the 10 observations are not independent
d.變量必須是numerical value
PROC ANOVA
Assumption sampled populations are normally distributed.
one-way ANOVA----only one factor (一個(gè)變量,這個(gè)變量可以有幾個(gè)level)
查看ppt
PROC GLM contrast
http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#glm_toc.htm
1.問題:不同年齡的身高平均值相同嗎?μ1=μ2=μ3=μ4
proc glm data=a; class age; model height=age; run;
2.問題: 11歲與12歲孩子的平均身高13-16歲孩子的平均身高有區(qū)別嗎
proc glm data=a; class age;
model height=age;
contrast '11&12 vs. rest'
age 2 2 -1 -1 -1 -1; run; quit;
PROC CORR
查看變量間的相關(guān)系數(shù) pearson correlation coefficients,負(fù)值 負(fù)相關(guān);正值正相關(guān)。
nosimple 不顯示Descriptive Statistics
proc corr data = "D:\hsb2" pearson nosimple; var read write; run;
PROC TTEST t-test
Assumption: all variables are normally distributed.
- Single sample t-test 例子:檢驗(yàn)score的平均值是否與50相同, p小于α,顯著不同
proc ttest data="D:\hsb2" H0=50; var score; run; - Dependent group t-test (paired t-test) 例子:一群學(xué)生都考了兩門考試,學(xué)生的write 成績(jī)與read成績(jī)的平均值是否相同, p小于α,顯著不同
proc ttest data="D:\hsb2"; paired write*read; run; - Independent group t-test 例子:男女性別對(duì)write成績(jī)有無影響
如果equality of variances Pr>F的值小于α, 那么兩個(gè)性別group的variance不同,必須選擇Satterthwaite (unequal)方法,然后查看這個(gè)方法對(duì)應(yīng)的Pr>|t|
如果equality of variances Pr>F的值小于α,選Satterhwaite,否則選pooled
proc ttest data="D:\hsb2"; class sex; var write; run;
PROC NPAR1WAY
可以用來Wilcoxon test,問題舉例:
Are test scores different from 4th grade to 5th grade on the same students?
Does a particular diet drug have an effect on BMI when tested one the same individuals?
該test的假設(shè)是:
Data comes from two matched, or dependent, populations.
The data is continuous.
Because it is a non-parametric test it does not require a special distribution of the dependent variable in the analysis. 對(duì)數(shù)據(jù)的distribution不做要求??!
尤其適用small sample size
one- and two-tail test
P value
如果 test H0=0,結(jié)果p<α 那么reject the H0,the mean is significantly different from 0.
預(yù)制代碼
proc print data= ; run;








