Proc Sort from Basics to Beyond:
data base_data;
input employee 1-2 sex $ 3-4 age 5-7 ps 8-9;
datalines;
1 F 45 0
4 M 63 2
3 M 57 1
5 F 72 3
2 F 39 0
3 M 57 1
4 M 63 0
;
run;
proc sort data=base_data out=base_data1;
by employee;
run;
Here is how the output data set BASE_DATA1 looks:
EMPLOYEE SEX AGE PS
1 F 45 0
2 F 39 0
3 M 57 1
3 M 57 1
4 M 63 2
4 M 63 0
5 F 72 3
- DESCENDING OPTION
proc sort data=base_data out=base_data2;
by descending employee;
run;
BASE_DATA2 data set:
EMPLOYEE SEX AGE PS
5 F 72 3
4 M 63 2
4 M 63 0
3 M 57 1
3 M 57 1
2 F 39 0
1 F 45 0
- DROP=, KEEP=, AND RENAME= OPTIONS
proc sort data=base_data out=base_data3(keep=employee age);
by employee;
run;
BASE_DATA3 data set :
EMPLOYEE AGE
1 45
2 39
3 57
3 57
4 63
4 63
5 72
proc sort data=base_data out=base_data4(rename=(employee=emp));
by employee;
run;
BASE_DATA4 data set:
EMP SEX AGE PS
1 F 45 0
2 F 39 0
3 M 57 1
3 M 57 1
4 M 63 2
4 M 63 0
5 F 72 3
SUGI 31 Coders’ Corner
There are a few of things to keep in mind when using these options.
First, if you use the RENAME= option, SAS changes the name of the variable in that procedure. Second, if you use RENAME= with either the DROP= or the
KEEP= options, the DROP= and the KEEP= options are applied before RENAME=.
Thus, use the “old name” in the DROP= and KEEP= options. The order in which you place the RENAME=, DROP=, and KEEP= options does not matter and does not change process order.
In regard to using parentheses, a list of multiplevariables to rename must be enclosed in parentheses, renaming just one variable does not, and the KEEP= variables should not be enclosed in parentheses.
Another thing to remember is that you cannot drop and rename the same variable in the same statement.
Here is an example of using both the KEEP= and RENAME= options:
proc sort data=base_data out=base_data5(rename=(employee=emp age=dxage) keep=employee age);
by employee;
run;
BASE_DATA5 data set:
EMP DXAGE
1 45
2 39
3 57
3 57
4 63
4 63
5 72
Notice that in this example you have to use the variable names EMPLOYEE and AGE in the KEEP= statement instead of EMP and DXAGE. Also notice that you have to use EMPLOYEE instead of EMP in the BY statement. You can actually rearrange this example in a couple of different ways and end up with the same result.
Below are two alternate ways:
proc sort data=base_data out=base_data5(keep=employee age rename=(employee=emp age=dxage));
by employee;
run;
proc sort data=base_data(keep=employee age rename=(employee=emp age=dxage)) out=base_data5;
by emp;
run;
Note that in the second rearrangement above that the variable name EMP must be used instead of EMPLOYEE in the BY statement because the KEEP= and RENAME= options are used before the OUT= option versus after it like in the original configuration.
- FORMAT AND LABEL STATEMENTS
Other statements that are the same in the SORT procedure as in a DATA step are the FORMAT and LABEL statements. You can apply a variable format or create variable labels within PROC SORT. First, I will show an
example using a format:
proc format;
value $SEX 'F'='Female'
'M'='Male';
run;
proc sort data=base_data out=base_data6;
format sex $SEX.;
by employee;
run;
BASE_DATA6 data set:
EMPLOYEE SEX AGE PS
1 Female 45 0
2 Female 39 0
3 Male 57 1
3 Male 57 1
4 Male 63 2
4 Male 63 0
5 Female 72 3
In this example, the format $SEX was created in the FORMAT procedure. This format was then applied to the variable SEX in a PROC SORT statement.
Now, in the output data set, instead of ‘F’ and ‘M’, you see the formatted values ‘Female’ and ‘Male’. Note that the FORMAT statement does not permanently alter the variables in the input data set.
Next I will show an example using labels:
proc sort data=base_data out=base_data7;
label ps='Performance Status'
age='Age at Diagnosis';
by employee;
run;
proc print data=base_data7 label;
run;
The output looks like:
Age at Performance
Obs employee sex Diagnosis Status
1 1 F 45 0
2 2 F 39 0
3 3 M 57 1
4 3 M 57 1
5 4 M 63 2
6 4 M 63 0
7 5 F 72 3
By using the PRINT procedure with the label option following the PROC SORT statement, you can see the labels created for the variables PS and AGE.
Like the FORMAT statement, the LABEL statement does not permanently alter the variables in the input data set.
- WHERE= OPTION OR WHERE STATEMENT
proc sort data=base_data(where=(age>50)) out=base_data8;
by employee;
run;
Here is an example using the WHERE statement:
proc sort data=base_data out=base_data8;
where age>50;
by employee;
run;
Both of these produce the same output data set:
EMPLOYEE SEX AGE PS
3 M 57 1
3 M 57 1
4 M 63 2
4 M 63 0
5 F 72 3
SUGI 31 Coders’ Corner
A nice feature of the WHERE= option and WHERE statement is being able to use some exclusive WHERE expressions such as: BETWEEN-AND, ? or CONTAINS, IS NULL or IS MISSING, LIKE (matches patterns), =*
(sounds like), and SAME-AND.
- FIRSTOBS= AND OBS= OPTIONS
There may be circumstances when you have a very large data set and you would like to split it up into smaller, more manageable data sets. This may be a time when it would be helpful to use the FIRSTOBS= and OBS= options within the SORT procedure.
The FIRSTOBS= option causes SAS to begin reading at a specified observation or record. The OBS= option specifies at which observation SAS processing ends. The two options are often used together to define a range of observations to be processed.
However, you do not have to use both options together. If you do not include the OBS= option with the FIRSTOBS= option, by default PROC SORT will stop at the last observation. If you do not include the FIRSTOBS= option with the OBS= option, by default PROC SORT will start at the first observation.
Here is an example of using both the FIRSTOBS= and OBS= options:
proc sort data=base_data(firstobs=3 obs=5) out=base_data9;
by employee;
run;
BASE_DATA9 data set:
EMPLOYEE SEX AGE PS
2 F 39 0
3 M 57 1
5 F 72 3
In this example, the procedure first took observations 3, 4, and 5 from the original, unsorted data set BASE_DATA and thensorted it by EMPLOYEE.
- NODUPRECS AND NODUPKEY OPTIONS
The NODUPRECS (or NODUP) and NODUPKEY options work similarly in that they both can eliminate unwanted observations, but NODUP compares all the variables in your data set while NODUPKEY compares just the BYvariables.
More specifically, if you use the NODUP option, PROC SORT compares all variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, the observation is not written to the output data set. If you specify the NODUPKEY option, PROC SORT compares all BY variable values for each observation to those for the previous observation written to the output data set.
If an exact match using the BY variable values is found, the observation is not written to the output data set.
First I will show an example of the NODUP option:
proc sort data=base_data nodup out=base_data10;
by employee;
run;
BASE_DATA10 data set:
EMPLOYEE SEX AGE PS
1 F 45 0
2 F 39 0
3 M 57 1
4 M 63 2
4 M 63 0
5 F 72 3
Notice in the output data set BASE_DATA10, the second observation for employee 03 is eliminated because all the variable values are the same. However, the second observation for employee 04 is not eliminated because performance score is different for these two observations.
Next I will show an example of the NODUPKEY option:
proc sort data=base_data nodupkey out=base_data11;
by employee;
run;
BASE_DATA11 data set:
EMPLOYEE SEX AGE PS
1 F 45 0
2 F 39 0
3 M 57 1
4 M 63 2
5 F 72 3
In this output data set you can see that the second observations for both employee 03 and 04 are eliminated because only the BY variable (in this case EMPLOYEE) had to be the same.
In this example, using NODUPKEY instead of NODUP may not be the better option to use because by using NODUPKEY, you eliminate the second observation for employee 04 even though there was a different performance score for this observation. This may be important information that needs to be kept in the data set.
No comments:
Post a Comment