Joining SAS datasets

Although we like to believe that our dataset is complete once we’ve finished entering the raw data, it is almost inevitable that we’ll need to add more data at some point.  Sometimes we may have more observations to add or at other times we may have more variables to add to a current set of observations.

We can easily perform either operation in SAS.

Adding observations or cases

Let’s start with a small dataset called SNH_WtHt that contains Weight (g) and Height (mm) of a new species of unicorns.  It just happens that these unicorns were placed onto 2 different treatments, labelled 1 and 2.

Data SNH_WtHt;
input ID Trmt Wt Ht;
datalines;
1 1 12 110
2 1 15 112
3 1 16 115
4 1 12 119
5 1 15 121
6 2 20 110
7 2 21 109
8 2 25 105
9 2 23 107
10 2 25 105
;
Run;

Remember you can use a Proc Print to ensure that the data has been entered correctly.  I strongly recommend you do this at every step.  Remember garbage in = garbage out!

Proc print data=SNH_WtHt;
Run;

Now, as luck would have it data on the last 4 unicorns in our herd was misplaced and found at a later date.  A labmate was very thoughtful and created a new dataset called SNH_NewObs with the 4 last unicorns.

Data SNH_NewObs;
input ID Trmt Wt Ht;
datalines;
15 1 12 110
18 1 15 115
12 2 15 118
29 2 19 102
;
Run;

Proc print data=SNH_NewObs;
Run;

We now want to add the 4 new observations to our current dataset so create a complete dataset with all 14 animals in it.  To perform this in SAS we will use a SET statement:

Data SNH_complete;
set SNH_WtHT SNH_NewObs;
Run;

Proc print data=SNH_complete;
Run;

The set statement will add the data from the SNH_NewObs dataset at the bottom of the SNH_WtHt dataset.  It is up to you to ensure that variables are listed in the same order in both datasets and that the data within each variable is the same type (numeric, string, date,e etc..).  SAS will NOT order the observations it simply adds them as they are.  If you look at the output from the Proc Print, you will see unicorns 1-10 at the top of the SNH_complete dataset with unicorns 15, 18, 12, and 22 at the bottom.

RECAP:

To add new observations or cases to your dataset:

  1. use the SET statement
  2. ensure that the variables are in the same order in all datasets
  3. ensure that variables are of the same type in all datasets
  4. can add several datasets in one statement – not just 2 as in this example.  For example, if you had data for Jan, Feb, Mar, Apr, May, Jun.  Your SAS program would be:

Data complete;
set Jan Feb Mar Apr May Jun;
Run;

Adding variables

Sometimes while working on a research experiment, you may discover that you need to collect additional information on your subjects, or maybe another research team has been collecting different data on your subjects, and you’d like to bring all that data together into one master file.  In other words, you’d like to add variables to your current dataset.

The important or key aspect to performing this function is to ensure that your subjects have unique ID numbers and that they match from one trial to the next.  You want to make sure that you’re adding the right information to the correct individual (or experimental unit) in your dataset.  For our unicorn example, we discovered that another research team measured arm length and we would like to add this to our current SNH_WtHt dataset,

Data SNH_ArmLength;
input ID Trt ArmLength;
datalines;
1 1 0.5
2 1 0.2
3 1 0.5
4 1 0.8
5 1 0.9
6 2 0.3
8 2 0.4
9 2 0.8
10 2 0.7
15 1 0.8
18 1 1
12 2 0.2
;
Run;

Please note that not all 14 unicorns have arm length measures, for circumstances beyond their control, the research team was unable to measure all of our unicorns.  Not to worry, we can still get SAS to add the data that we have on hand.

The first step is to ensure that both datasets are sorted by the “key” or unique identifier variable, in our case that would be the ID.

Proc sort data=SNH_complete;
by ID;
Run;

Proc sort data=SNH_ArmLength;
by ID;
Run;

If the datasets are not sorted then the merge in the next step will fail.  Don’t worry SAS will let you know🙂

Once the datasets are sorted or ordered then we can use the MERGE command to add the new variables to our current SNH_complete dataset.

Data SNH_Overall;
merge SNH_complete SNH_ArmLength;
by ID;
Run;

Proc print data=SNH_Overall;
Run;

Remember how we didn’t have arm length observations for all of our unicorns?  What do you suppose happened?  Take a look at the Proc Print output:

OBS ID TRMT WT HT ARMLENGTH
1 1 1 12 110 0.5
2 2 1 15 112 0.2
3 3 1 16 115 0.5
4 4 1 12 119 0.8
5 5 1 15 121 0.9
6 6 2 20 110 0.3
7 7 2 21 109 .
8 8 2 25 105 0.4
9 9 2 23 107 0.8
10 10 2 25 105 0.7
11 12 2 15 118 0.2
12 15 1 12 110 0.8
13 18 1 15 115 1.0
14 29 2 19 102 .

SAS included ALL 14 of our unicorns but added a “.” for missing data for any arm length measure that was not taken.

RECAP

To add new variables to your dataset:

  1. use the MERGE statement
  2. sort both datasets by the key or unique identifier
  3. data that is missing in one or the other dataset will be marked as missing with a “.”

Screen Shot 2013-11-18 at 7.33.07 PM