Imputing data and subsequent analysis

Let’s be honest now, how many times have you run a trial or an experiment and have had to deal with missing data?  The data could be missing for several reasons, missed measuring opportunities, individuals dropping out of a trial, experimental units dying or getting lost, the list goes on.  In the end, our datasets may not be as complete as we wish them to be.

So, how do we deal with this?  If you’ve run any sort of statistical analyses in the past, you are already too aware that for most statistical analyses, missing data results in dropped observations and a decrease in the power of our results.

One way to deal with missing data, is to NOT deal with it.  Accept that you are missing the information and the implications of this and move on.  I will admit that most people will fall into this category.  “Yes!  I sampled 105 units, but 23 of them dropped out of the study.  Therefore, my results are based on the remaining 82 individuals.”  But, did you ever wonder if your results would be different if those 23 individuals were still there?  As a side note, I always wonder WHY they left – but that’s another research question altogether🙂 .

A second way of working with missing data is to impute data, or in layman’s terms let’s create some plausible and credible data for those that are missing.

How do we impute data?  There are several different methods of imputation and the route you take will depend on the structure of your missing values.

Single Imputation

A single imputation is where you replace each missing data point with a value.  The most common value is the mean of the complete cases or records.  As an example:

ID    Wt    Ht

1      14    39
2      15    43
3      14    37
4      12    32
5        .     40
6     10     31
7     15       .
8     16     42

Wt average = 13.8  – new Wt value for ID#5 = 14
Ht average = 37.7  – new Ht value for ID#7  = 38

This is an easy and quick way to fill in missing values or to impute data, however, you are restricting the data and not allowing or accounting for any variation in the real data or uncertainty.

Multiple Imputation

Multiple imputation allows you to create a number of possible values for the missing data.  The general guideline is to create 3 to 5 imputed values, what this means is that rather than having 1 dataset to analyse you will now have 3 to 5 datasets to analyse.  There are also several different methodologies to use when imputing data.  For more information on the different imputation methodologies please review the SAS documentation.

Now that you have 3 to 5 datasets, but you only want to run the analysis to come to one conclusion what do you do next?  You will conduct your statistical analysis on each dataset separately, yes! that means you will have 3 to 5 sets of results (you’ll need to be very organized!).  Then you will combine your results to draw one conclusion.  In SAS, we will use the Proc MI to create our imputed datasets, then we will run Proc MIANALYZE to combine our results.  So summarize, remember these 3 steps:

Three steps to Multiple imputation inference:

  1. Create x number of imputed datasets
  2. Conduct your statistical analysis on the x number of imputed datasets
  3. Combine the results of the x number of imputed datasets

Let’s take a look an example to see how this all unfolds.  For the purposes of this blog posting we will use the Getting Started example from the SAS documentation.

The data is the familiar Physical Fitness datasets.  “These measurements were made on men involved in a physical fitness course at N.C. State University. Certain values have been set to missing and the resulting data set has an arbitrary missing pattern. Only selected variables of Oxygen (intake rate, ml per kg body weight per minute),  Runtime (time to run 1.5 miles in minutes),  RunPulse (heart rate while running) are used. ”

data Fitness1;
    input Oxygen RunTime RunPulse @@;
44.609 11.37 178 45.313 10.07 185
54.297 8.65 156 59.571 . .
49.874 9.22 . 44.811 11.63 176
. 11.95 176 . 10.85 .
39.442 13.08 174 60.055 8.63 170
50.541 . . 37.388 14.03 186
44.754 11.12 176 47.273 . .
51.855 10.33 166 49.156 8.95 180
40.836 10.95 168 46.672 10.00 .
46.774 10.25 . 50.388 10.08 168
39.407 12.63 174 46.080 11.17 156
45.441 9.63 164 . 8.92 .
45.118 11.08 . 39.203 12.88 168
45.790 10.47 186 50.545 9.93 148
48.673 9.40 186 47.920 11.50 170
47.467 10.50 170

To create the imputed dataset we will use the following code:

proc mi data=Fitness1 seed=501213 mu0=50 10 180 out=outmi;
      var Oxygen RunTime RunPulse;

Proc mi – calls on the multiple imputation procedure
seed=501213  <- specifies the “seed” for the random number generator.  This can be any                               number!
mu0= 50 10 180  <- these are the means that you are specifying for the three variables                                     we are imputing.  The goal is to have the overall mean of these                                             variables with their newly imputed values to be 50 for Oxygen, 10 for                                   RunTime, and 180 for RunPulse
out=outmi  <- saving our imputed datasets in a new dataset called outmi
mcmc;  <- for this example where we have an arbitrary design – where there is no pattern                    to the missing values, we are choosing to use the Markov Chain Monte Carlo                      method
var <- listing the variables that contain missing data and we are imputing for

The output can be found here

The model information provides us with the nuts and bolts of the MCMC process to create out imputations.  Please take note that there were 5 imputations, therefore we will see in the output dataset 5 different complete datasets.

The Missing Data Patterns table provides us with a visual of the different missing data patterns along with the basic descriptive stats for each pattern.  This is a great way to reaffirm that thus truly was an arbitrary missing design.

The next 2 tables provide us with information of the Variance and Covariance estimates of the imputed datasets.

Lastly we have a table of parameter estimates for the combined imputed datasets.  Remember we set out the means for the three variables to be 50, 10 and 180 respectively.  After imputed the missing values we have means of 47 for Oxygen, 10.6 for RunTime, and 171.8 for RunPulse.

Let’s take a look at the structure of our dataset:

Proc print data=outmi;

You’ll notice we have 4 variables:  _Imputation_ Oxygen, RunTime, and RunPulse.  The _Imputation_ variable relates to the number of imputation.  Scrolling through the data you will see that there are 5 imputations in total.

Now where do we go?  We have completed Step 1 listed above to conduct a Multiple Imputed Inference.  Now let’s move onto Step 2:  Conduct your statistical analysis on thex number of imputed datasets.

proc reg data=outmi outest=outreg covout noprint;
     model Oxygen= RunTime RunPulse;
     by _Imputation_;

We want to conduct a regression, predicting Oxygen intake rate with runtime and runpulse.  We have seen this Proc Reg syntax before, but now we have a few new options:

Outest=outreg <- create a new dataset called outreg that contains the parameter                                              estimates for our model
covout <- outputs the covariance matrix for the parameter estimates listed in the outreg                       dataset
noprint <- suppresses the output – we’re saving the output in a dataset to process further,                   we do not need to see all the output in the output window.
by _imputation_  <- SAS will run the Proc Reg for each imputation or imputed dataset in                                      our data.

When we run this code, there is no output.  Remember we said noprint.  I like to see what SAS has done, so run I ran a Proc Print on the outreg dataset to see it’s contents 

Proc print data=outreg;

In this dataset we see the parameter estimates for the 5 imputed datasets.

Now onto Step 3:  Combine the results of the x number of imputed datasets.

We have 5 sets of regression results, we now need to combine them to have one final set of results from which we will draw our conclusion from.  To accomplish this we will use the Proc MIANALYZE.

proc mianalyze data=outreg;
    modeleffects Intercept RunTime RunPulse;

modeleffects  <- lists the effects in the dataset that need to be analyzed.

The MIANALYZE output can be found here

The Model Information section provides with the dataset name and the number of imputations.

The next output table displays the Variance Information, showing each variable that was imputed and the variance data about each.

The last table diplays the final parameter estimates for our model.

 Final Words

There are many different ways of dealing with missing data.  Leaving your dataset as is with the missing values is one way, and as mentioned earlier probably the one we tend to do the most.  However, there are several different methodologies available today in SAS and other programs, to impute the missing values.  By imputing the values, or filling in the missing data, the dataset is complete and will result in a more powerful set of results.

For more information on imputing data in SAS, please check out the SAS documentation for Proc MI and for Proc MIANALYZE

Screen Shot 2013-11-18 at 7.33.07 PM