Aggregation (creating one observation to summarize many) via DATA STEP programming

In this session we will be working with the records of patients enrolled in a LDL Cholesterol – reduction study.    Each patient’s LDL cholesterol (the bad stuff) reading was recorded each time they visited the test centre as illustrated below.    We have a different number of observations on the various patients as some came in for testing more frequently than did others.

data ldl ;
 input patient visit ldl ;
 datalines ;
 1 1 207
 1 2 198
 2 1 178
 2 2 176
 2 3 176
 3 1 227
 3 2 220
 3 3 210
 3 4 201
 3 5 199
run ;


We want to create one observation for each patient (i.e.  we want to aggregate the data).   What we want to know is the change in ldl cholesterol reading from the patient’s first visit to their last visit.

This free SAS book covers just about everything you’d ever need to do in the SAS  Data Step.  You will find most of what we’re doing today in Chapter 12, starting on page 187.

KEY POINT:  When you are doing aggregation in the SAS DATA STEP you use five SAS statements  to full advantage:


FIRST.   (pronounced “first dot” because there’s a dot at the end)
LAST.    (pronounced “last dot” because there’s a dot at the end)

The above statements may have funny names, but trust me, they’re your friends :>

BEFORE WE START:    We would use PROC MEANS for aggregation (i.e. to create one summary observation on ldl cholesterol for each patient), if what we wanted in that summary observation  was a statistic like the patient’s  mean, min, max, ldl cholesterol over the course of the study      HOWEVER, in this case we want the summary observation for each patient to be the difference between the patient’s  ldl cholesterol reading  recorded  at their first visit to the test centre, and their last visit’s ldl cholesterol reading, so we have to do that via DATA STEP programming – i.e. PROC MEANS can’t tell you the difference between two observations of many in a group (well, it actually can, but it’s not for the faint of heart <grin> ).  On page 196 in the SAS book referenced above we’re  “using a value from an earlier observation in a later observation”.

FYI:  here’s the SAS code to aggregate the records to create one record for each patient, with that record recording the range of LDL readings (i.e. the difference between the lowest and the highest LDL reading for each patient).

proc sort data=ldl ;
by patitent;
run ;

proc means data= ldl range;
by patient;
var ldl ;
output out=netChange range=change;
run ;



STEP 1:  Whether you’re aggregating with PROC MEANS or via DATA STEP programming the first step is ALWAYS to use PROC SORT to sort the data set by the group identification variable.    KEY POINT:  to aggregate in SAS the dataset must be sorted by the variable that identifies which observation is in what group – we call that variable the group identification variable.  In this case that variable is patient.


STEP 2:  Once the dataset is sorted by the GROUP IDENTIFICATION VARIABLE we’re all set to use our five friends:   BY  RETAIN   FIRST.    LAST.   and OUTPUT as below.

data summary ;
    set ldl_sorted;
    by patient;
    if first.patient then first_ldl = ldl ;
    if last.patient then do ;
        last_ldl = ldl ;
        difference_ldl = first_ldl - last_ldl ;
    retain first_ldl ;
    run ;


 by patient;
Tells SAS the dataset is sorted by the variable PATIENT. The dataset must be sorted by the GROUP IDENTIFICATION VARIABLE.

if first.patient then first_ldl = ldl ;
If SAS is reading the first observation for a patient, store the ldl reading in the variable first_ldl ;
if last.patient then do ;
        last_ldl = ldl ;
        difference_ldl = first_ldl – last_ldl ;
If SAS is reading the last observation for a patient, calculate the difference between the patient’s first and last ldl reading and
output the observation to the dataset “summary” that is being created.
retain first_ldl ;
A small statement but one that’s essential. Remember the value of the patient’s first ldl reading, so it is available for the calculation of the difference between the first and last readings for this patient.  KEY POINT:  If you want to access a variable’s value from an earlier observation you have to use the RETAIN statement.

PROC PRINT shows us what the dataset SUMMARY looks like:

proc print data=summary;
run ;

1 1 2 198 207 198 9
2 2 3 176 178 176 2
3 3 5 199 227 199 28

Screen Shot 2013-11-18 at 7.37.15 PM