# Calculating differences using DIF( ) and LAG( ) functions

This may be a short post but I think one that folks will find may come in handy.  I was recently asked how to calculate a difference of a variable listed in 2 rows.  As a made-up example.  I have 2 weight observations on some unit and I want to calculate the difference between the weights.  Old school me thought oh boys – I’ll have to transpose the data and then calculate the difference and then re-create the dataset in the same original format.  Way too much work!!

## DIF Function

Turns out I was missing this really cool DATA function called “diff”.  Who knew it could be that simple.  So let’s work through an example using the DIFF function and then we’ll work through it again using a different DATA function called “lag”.

The Proc sort is used to sort the data so that we have the 2 weight measures we want to calculate the difference between  beneath themselves.  In this case we want to calculate the difference between Time2 and Time1 for each ID Breed combination.

Proc sort data=wtdiff;
by id breed time;
Run;

Then we will create a new dataset called “test” and by using the BY statement in the DATA step we are telling SAS that we want to read the data to be read by the sorted ID Breed Time variables.  The first.time  tells SAS to look at the line of data of the first ID Breed Time combination then calculate the difference between the next time and this line – dif(varname) command.

Data test;
set wtdiff;
by id breed time;
if first.time then wtdiff = dif(weight);
Run;

Note that because the ID Breed Time changes every line that the first.time really isn’t doing anything.  You can remove this part of the statement and simply create the new variable  wtdiff = dif(weight); and that will work as well.  I’m providing this piece of code for situations when you will have more than 1 observation per grouping.

I then like to use Proc Print to see whether we were successful.

Proc print data=test;
Run;

You’ll notice that in the output dataset we now have a weight difference calculation for each set of lines except the first one – which makes sense – there is nothing for it to subtract from.  Now, our goal was to only have the difference for each ID Breed combination – so we need to delete those observations where the difference makes no sense.  If you look closely at the dataset, because of the sorting we did, we want to set the new variable wtdiff to a . whenever the time = 1.

Data test;
set wtdiff;
by id breed time;
if first.time then wtdiff = dif(weight);
if time = 1 then wtdiff = .;

Run;

Run a quick Proc Print to confirm the results.

Much, much quicker than transposing the data, calculating the difference and then reforming the data 🙂

## LAG Function

Another way to accomplish the same task is to use the LAG function available in the DATA step.  LAG(varname) will list the last observation for that variable.  Best way to explain this is to see it in action.

Data test;
set wtdiff;
by id breed time;
lastwt = lag(weight);
secondlastwt = lag2(weight);
Run;

lastwt = lag(weight) provides you with the last weight measure, whereas the secondlastwt = lag2(weight) goes back 2 datalines.  You can increase the lag# as much as you would like.

Proc print data=test;
Run;

So let’s drop the secondlastwt variable as this was more for demonstration purposes.  If you look at the new resulting dataset you can see that for every line where time = 2 – we want to calculate the difference between weight and lastwt to obtain the same wtdiff variable we did above using the DIF( ) function.

Data test;
set wtdiff;
by id breed time;
lastwt = lag(weight);
if time = 2 then do;
wtdiff = weight – lastwt;
drop lastwt;
end;
Run;

To accomplish this, I am using a DO Loop – if time =2 then I want you to do the following steps.  Create the variable wtdiff as the difference between weight and lastwt.  The second step is to drop or delete the variable called lastwt.  I then need to close the DO Loop and Run the Data step.

Of course, use the Proc Print to confirm the results.

## Conclusion

Two ways to accomplish the same task using 2 different DATA Step functions  DIF( ) and LAG( ).

Which way do you prefer???