Proc GLIMMIX – working with Binomial outcome data

Most of us really enjoy the world of normal data, where we can stick to using Proc GLM and/or Proc MIXED and not worry about those pesky datasets with binomial type data. However, we cannot restrict our analysis based on the type of data we want to work with. The following example comes directly from the SAS Support documentation for Proc GLIMMIX. My goal by using the same example is to add a few more comments to the already great document and to lay it out in more of a layman’s terms.

The sample data comes from Gotway and Stroup(1997) and represents data from an agronomic field trial. I have to say that this example matches very closely many of the student projects I see come through my service over the years. But 9 times out of 10, students will find a way around their data to avoid this analysis😉

The researchers studied 16 varieties of wheat for their resistance to infestation with the Hessian fly. The experiment was arranged in a randomized complete block design on an 8 x 8 grid, where each 4 x 4 quadrant constituted a block.

The outcome of interest was the number of damaged plants out of the total number of plants growing on the unit. The lat and lng variables were used to identify the experimental unit on the 8 x 8 grid.

The goal of the trial was to evaluate the different varieties in terms of damaged or not damaged plants.

This post will provide you with the syntax and explanation of key points for each model.

data HessianFly;
label Y = ‘No. of damaged plants’
n = ‘No. of plants’;
input block variety lat lng n Y @@;
datalines;
1 14 1 1 8 2 1 16 1 2 9 1
1 7 1 3 13 9 1 6 1 4 9 9
1 13 2 1 9 2 1 15 2 2 14 7
1 8 2 3 8 6 1 5 2 4 11 8
1 11 3 1 12 7 1 12 3 2 11 8
1 2 3 3 10 8 1 3 3 4 12 5
1 10 4 1 9 7 1 9 4 2 15 8
1 4 4 3 19 6 1 1 4 4 8 7
2 15 5 1 15 6 2 3 5 2 11 9
2 10 5 3 12 5 2 2 5 4 9 9
2 11 6 1 20 10 2 7 6 2 10 8
2 14 6 3 12 4 2 6 6 4 10 7
2 5 7 1 8 8 2 13 7 2 6 0
2 12 7 3 9 2 2 16 7 4 9 0
2 9 8 1 14 9 2 1 8 2 13 12
2 8 8 3 12 3 2 4 8 4 14 7
3 7 1 5 7 7 3 13 1 6 7 0
3 8 1 7 13 3 3 14 1 8 9 0
3 4 2 5 15 11 3 10 2 6 9 7
3 3 2 7 15 11 3 9 2 8 13 5
3 6 3 5 16 9 3 1 3 6 8 8
3 15 3 7 7 0 3 12 3 8 12 8
3 11 4 5 8 1 3 16 4 6 15 1
3 5 4 7 12 7 3 2 4 8 16 12
4 9 5 5 15 8 4 4 5 6 10 6
4 12 5 7 13 5 4 1 5 8 15 9
4 15 6 5 17 6 4 6 6 6 8 2
4 14 6 7 12 5 4 7 6 8 15 8
4 13 7 5 13 2 4 8 7 6 13 9
4 3 7 7 9 9 4 10 7 8 6 6
4 2 8 5 12 8 4 11 8 6 9 7
4 5 8 7 11 10 4 16 8 8 15 7
;
Run;

First thing you will notice is that the basic syntax for Proc GLIMMIX is very similar to both Proc GLM and Proc MIXED.

The first model that we want to run, is a very basic model to see whether there is an effect of block and variety on the outcome variable. To start we will include both variety and block as fixed effects. Listing them in the Class statement and omitting a Randomstatement.

Let’s think about our outcome variable. In the initial description of the experiment it was stated that the outcome of interest was: number of damaged plants out of the total number of plants growing on the unit. In our data we have a variable called Y = ” number of damaged plants” and a variable called n = “number of plants in the unit”. We do not have a variable where the proportion of damaged to total was calculated. In my experience, many researchers will calculate this proportion, add that variable to their dataset, test it for normality, transform if required and then do Proc GLM or Proc MIXED. Let’s avoid that extra step and use the data as we have it. Our model statement will be:

model y/n = block entry; <- dependent variable = No. damaged / Total No.

Our complete syntax for the initial run, which will also include a solution option:

Proc glimmix data=HessianFly;
class block variety;
model y/n = block variety / solution;
Run;

While reviewing the output, you will notice a lot of similarities with the Proc MIXED output. Model Description, Class Level Information, Number of Observations, Dimensions, Iteration History, Fit Statistics, etc…

Items I would like to highlight from the output:

The Class level information – I know seems silly to look at this because by the time you reach this analysis, you are very intimate with your data. However, it never hurts to quickly review this to ensure that your data was read in correctly.
Dimensions – a quick review or something new. What exactly do these represent? They are the dimensions of the matrices used in your model. Columns in X – 21:
We have no random effects in this model – therefore we see 0 in the Z matrix
In the Fit Statistics section you will see a Pearson Chi-Square and Pearson Chi-Square / DF. What are these and why are they important to be aware of?
- The ratio of the Pearson Chi-square to the degrees of freedom is a measure of dispersion – in other works it is a measure of variance of variation in the model. Ideally we want to see this ratio close to 1.0. In this example the ratio is 2.37, which means we have overdispersion or we have more variation in the data than we’d expect for a binomial distribution. Another way of thinking about it – our data is not following a classic binomial distribution.

So this is something that we need to fix or modify our model. The goal is to produce a model with an overdispersion value of close to 1.

Before we do that – there is another way of writing the syntax we used above for this model:

Proc glimmix data=HessianFly;
class block variety;
model y/n = block variety / dist = binomial link=logit solution;
Run;

Rather than letting SAS assume that the distribution of our data is binomial, we are stating that it is by adding the dist=binomial option to our model statement. We have also added that we want the link function, the function that links the response probabilities to linear predictors to be a log odds function by adding the link=logit function. These are the defaults that SAS uses, but let’s add the coding so that we know where we can make changes in the future models.

Let’s see about changing our model to deal with the overdispersion problem we’ve encountered above. We will now consider our block as a random effect. Similar to Proc MIXED, we will add a random statement for block and remove the block effect from the model statement:

Proc glimmix data=HessianFly;
class block variety;
model y/n = variety / dist = binomial link=logit solution;
random block;
Run;

Items to highlight from the output:

Dimensions – a quick review or something new. What exactly do these represent? They are the dimensions of the matrices used in your model. Columns in X – 17:
Columns in Z – 4:
Note that because we did not specify a SUBJECT= option in the RANDOM statement, GLIMMIX treats all the data as if they were created from a single subject with 64 observations. The SUBJECT= option identifies the experimental unit.

Let’s remove the effect of the random block:

Proc glimmix data=HessianFly;
class block variety;
model y/n = variety / dist = binomial link=logit solution;
Run;

Items to highlight in the output:

We still have a Person chi-square / DF value that is greater than 1 – so we still have overdispersion.

Let’s try to deal with this. One option is to add a scale parameter to our model – the _residual_ random effect.

Proc glimmix data=HessianFly;
class block variety;
model y/n = variety / dist = binomial link=logit solution;
random _residual_;
Run;

Items to highlight from the output:

The model solutions are the same!
But… the standard errors are now larger. They are actually 1.507 times larger (Square root of 2.27 – which is our overdispersion factor)
The results are that our effects are not as significant as they were in the previous model.

Also, remember how the dataset contains 2 variables for lat and lng to identify the individual experimental units as well as their location within the blocks, let’s see how to incorporate these into our current Proc GLIMMIX syntax – can they help us model the overdispersion?

Proc glimmix data=HessianFly;
class variety;
model y/n = variety / dist = binomial link=logit solution ddfm=contain;
random _residual_ / subject=intercept type=sp(exp) (lat lng);
Run;

By adding a subject= statement we are now taking into account the spatial location of the experimental units.

subject=intercept – stating that our observations are independent and not correlated
type= – specifies the covariance structure we want to use
sp(ex) – is the exponential or temporal covariance structure – most appropriate with the information we are provided with this data

Items to highlight from the output: