Friday 2 December 2016

Proc SQL vs. Data Step in SAS


Proc SQL vs. Data Step in SAS - Part 1
1) Mr. X write uses Keep Col1 Col2  in data step while Mr.Y uses select Col1,Col2 statement in Proc SQL.

2) For a derived variable or flag creation, Mr. X writes uses If ... Else if statement,
Mr. Y would do the same thing with Case...when statements in Proc SQL.

3) Mr. X write uses Merge  in data step while Mr.Y uses Join statement in Proc SQL.



1.  If ... else if vs. case when statements

*_____________________________________________________________________;
Data check ;
do i = 1 to 10000000;
output;
end;
Run;

Data check_1;

set check ;
if mod(i,2) = 0 then flag = "Even";
else flag = "Odd";
Run;


Proc SQL;

create table check_2 as 
select i , (case when mod(i,2) = 0 then "Even" else "odd" end) as Flag
from check ;
Quit;

*_____________________________________________________________________;

Try the above code and check the real and CPU time in log, and also try to time it yourself. Check how much time it took SAS to process the codes. There is definitely a slight difference, Data step has proved itself to be more efficient.






Proc SQL vs. Data Step in SAS - Part 2

2.  Join (Proc SQL) vs. Merge (Data step)

This one is quite interesting, here efficiency as well as the result may vary ...

How ?  .............. wait don't get impatient ... for this first we need to deep dive a little into the background working algorithm of Join and Merge.





In Data step merge, data is merged point to point.

Suppose we merge following two datasets, without even a common field, check how it merges:

Data ABC;
input name_1 $ marks_1 ;
cards;
john 10
jack 20
jim  30
;
Run;


Data PQR;
input name_2 $ marks_2;
cards;
walker 100
daniel 200
beam  300
;
Run;


Data final;
Merge ABC PQR;
proc print;
Run;


 The result would be :



You can see that it merged the data point to point, based on position of each observation, however if we give matching field, SAS would ask you to first sort data sets based on the matching field and then again would merge data point to point position basis. Now we know how the data step merge works.

Let's see now, how SQL join would behave in the first example.



Data ABC;
input name_1 $ marks_1 ;
cards;
john 10
jack 20                                                
jim  30
;
Run;


Data PQR;
input name_2 $ marks_2 ;               
cards;
walker 100
daniel 200
beam  300
;
Run;


Proc SQL;
select * from ABC , PQR;
Quit;


The result would be :



What's that ... yes.... that's basic difference between Data step and a SQL join.

In SQL join, SAS first makes a Cartesian product (all possible combinations) as it made in above case. it then subsets this data based on the matching field condition given ... which was not present in the above code.

As in Proc SQL join, first SAS makes the Cartesian product and then subsets, suppose your datasets have 1 million (1000000) observations each. It would first make a data having :


1000000 X 1000000 = 1000000000000 observations



and then it would subset the data on the basis of matching field condition given. It is very time consuming and very much more memory consuming, which sometimes becomes a showstopper. This is the point number 1.

Let's see one more case by example.
*_____________________________________________________________________;
Data name_class;
input name : $10. Class $;
cards;
Ram 10
Shyam 12
Sita 10
Geeta 12
;
Run;


Data name_Age;
input name : $10.  Age $;
cards;
Ram 20
Shyam 22
Balram 25
Sita 18
Geeta 18
;
Run;

proc sort data = name_class;  by name; run;
proc sort data = name_age ;  by name; run;
Data Merged;
merge name_class name_age;
by name;
proc print;
Run;

Proc SQL;
create table joined as select * from name_class as a full join name_age as b
on a.name = b.name;
quit;
proc print data = joined;
run;
*_____________________________________________________________________;




Could you notice the contrast, the extra element in the second data appeared in the data, but the name was missing. So we have to be extra careful while using Proc SQL join. This is point number two.

Better Join code should be written this way :

*_____________________________________________________________________;
Proc SQL;
create table joined as select coalesce(a.name, b.name) as name, class, age from name_class as a full join name_age as b
on a.name = b.name;
quit;
proc print data = joined;run;
*_____________________________________________________________________;



Enjoy reading our other articles and stay tuned with ...



Ref: http://www.askanalytics.in/2015/09/case-when-and-if-else-if.html



No comments:

Post a Comment