# Principal Components Analysis

Principal Components Analysis or PCA, is one of several data reduction statistical methods.  The concept behind PCA is to take several variables and reduce them to two or three “components” or “factors” that describe your data.  For this example, we will use the results from the 2008 Olympic Decathlon.  The Decathlon data is available here

Import the data into SAS using the File-> Import option.  Be sure to import the Points\$ worksheet.  Run a Proc Print on the data to view the contents.  You should have 13 variables and 26 observations.  The variables include:

• Year
• Decathlete
• Run100m_pts – Points awarded for the 100m sprint
• LJ_pts – Points awarded for Long Jump
• SP_pts – Points awarded for Shot Put
• HJ_pts – Points awarded for High Jump
• Run400_pts – Points awarded for the 400m run
• H_pts – Points awarded for Hurdles
• DT_pts – Points awarded for Discus Throw
• PV_pts – Points awarded for Pole Vault
• JT_pts – Points awarded for Javelin Throw
• Run1500_pts – Points awarded for the 1500m run
• Overall – Overall points awarded

The goal of the PCA analysis will be to take the 10 events of these decathletes and see if we can reduce them to 2 or 3 components or factors.

In SAS you will use the following coding:

Proc princomp data=deca plots(ncomp=3)=all n=5;;
var Run100_pts LJ_pts SP_pts HJ_pts Run400_pts H_pts DT_pts PV_pts                     JT_pts Run1500_pts;
Run;

Remember if you want to save the output as a PDF file use the ods statements:

ods pdf file=”C:\Users\edwardsm\Documents\Michelle_Docs\SASsy_Fridays\PCA\output.pdf”;

Proc princomp data=deca plots(ncomp=3)=all n=5;;
var Run100_pts LJ_pts SP_pts HJ_pts Run400_pts H_pts DT_pts PV_pts                      JT_pts Run1500_pts;

Run;
ods pdf close;

The output file can be obtained here

Rather than provide you with screen captures of the output, please download and review.  Key areas to highlight are:

## Correlation Matrix

Take a few moments to review and highlight any strong associations between the variables we are looking at. For example, 100m Run and Long jump have a correlation coefficient of 0.53. If you think about it – this makes sense on a practical level. I need the speed I can gather in a short distance to help me jump farther.

## Eigenvalues of the Correlation Matrix

The column labelled “Proportion” highlights the variation of the data that can be attributed to or explained by the first component – 25% in this example.  The second component explains 17% for a cumulative total of 42% for both components.

## Eigenvectors

This table provides you with the weightings that each variable contributes to the component.  This is what you will use to try to define what the component represents.

You are then presented with a number of plots.  The Scree Plot helps you to determine how many components should you use for this dataset.  The idea is to look for the “elbow” in the plot.  In this example it is difficult to clearly see – but we will only use the first 2 components.

The component Pattern Profile can also be helpful in defining your components, as can the component pattern plots.

Let’s take a closer look at the Eigenvectors table and concentrate on Prin1 and Prin2.

If you scan the values of the eigenvectors of Prin1 – you will notice that the all have similar weightings.  We could interpret this component as a measure of overall athleticism.

If you scan the values of the eigenvectors of Prin2 – you will notice the following:

Negative values for:

• 100m Run
• Long jump
• 400m Run
• Hurdles
• 1500m Run

Positive values for:

• Shot put
• High jump
• Discus throw
• Pole vault
• Javelin Throw

A clear division of events that need speed and events that need strength.  So you could interpret this component as a measure of strength.

Interpretation of PCA components is very subjective.  You may look at these same results and define the 2 components highlighted here in a different way – and that’s OK!

Remember PCA is one method of reducing your data. 