Subsetting data with the IF and WHERE statement

sample_data

One of the strengths of SAS is the ability to work with data in a variety of ways.  Quite often as we’re working with our data we realize that we may need to look at a subset or a smaller portion of our data.  I have seen many students go back to their original Excel file and create a new worksheet with just the data they need, re-import into SAS and run their analysis.  YES this is a very viable way of doing it and if this is the best way for you, then please continue to do it this way.

In this post I’m going to show you 2 different ways to subset your data in SAS.  One method will create a new dataset and the second method will specify which data to use while you are in a PROC.  Both methods are more efficient than creating a new dataset outside of SAS and re-importing.

Using the IF statement

Our sample dataset contains 19 observations of first year students at University X.  We have females and males represented in this dataset.  Males have been coded as 1 and females coded as 2.  For our analysis we are interested in creating a new dataset with only the males.  Use the following syntax to accomplish this:

Data males;        <-  Name of the new subset data – with only males in it – think of this                                      as your File Save As option in Excel
    set sample;     <- Name of the complete data – remember we want to open the entire                                  dataset
    if sex = 1;        <-  We want only those individuals where sex has a value of 1
Run;

Once you run the above Data step you will have 2 datasets, our original dataset with all the observations (sample) and a new dataset called males with only those individuals in the sample dataset where the condition of sex = 1 was met, in other words only the males.

When you run an analysis where you only need the males remember that you need to specify the correct dataset males

Proc freq data=males;   <- using the dataset with males ONLY
    tables sex*course;
Run;

Will give you a table with only the males.

Proc freq data=sample;  <- using the entire dataset with males and females
    tables sex*course;
Run;

Will give you a table with males and females.

 Using the WHERE statement

This time we will not use the smaller dataset but keep the larger dataset intact and use a WHERE statement.  This is an extremely versatile statement that can be included in most if not all PROCs available in SAS.  To confirm whether it is available for the PROC you are working with, please check the SYNTAX section of the notes for the PROC.

For our example we want to reproduce the frequency table produced above for males only.

Proc freq data=sample;     <- Use the entire dataset with both males & females
    tables sex*course;                 
    where sex = 1;                <- specify that we only want observations where sex =                                                      1 or the males
Run;

What if my variable is a string variable?  Will these 2 statements still work?

YES!

In our dataset there is another variable called sex_alpha where the values are Male and Female.  Using the same coding from above we can replicate our results.

Using the IF statement

Data males;
   set sample;
    if sex_alpha = “Male”;
Run;

Using the WHERE Statement

Proc freq data=sample;
   tables sex_alpha*course;
   where sex_alpha = “Male”;
Run;

Things to be aware of if you are using a string variable with the IF and/or WHERE statements.  You have to ensure that the data has all been entered in an identical manner.  If in your dataset some of the Male entries were “male” or “MALE”, both IF and WHERE statements we used above will give you different results!  Remember you are looking for a string, a set of letters – SAS sees Male, male, and MALE as 3 separate values.  This is why coding string variables makes analysis a bit neater🙂

Adding a second variable to your WHERE statement

I would like to calculate the frequencies of individuals weighing more than 200lbs and who are male.

Proc print data=sample;
  var sex weight;
  where sex = 1 and weight > 100;
Run;

Screen Shot 2013-11-18 at 7.33.07 PM