Principal Components Analysis

Introduction

Principal Components Analysis (PCA) attempts to analyse the structure in a data set in order to define uncorrelated components that capture the variation in the data. The identification of components is often desirable as it is usually easier to consider a relatively small number of unrelated components which have been derived from the data than a larger group of related variables. PCA is particularly useful in management research, as it is often used as a first step in assigning meanings to the structure in the data (by attaching descriptions to the components) through the technique of factor analysis. PCA can also help in alleviating some of the problems with variable selection in regression models that are associated with multicollinearity, which is caused by correlations between the explanatory variables.

Key Features

• • • PCA attempts to represents a data frame containing correlated variables in terms of uncorrelated components. The principal components identified account for successively smaller amounts of the variability in the data frame. By selecting those components that account for relatively large amounts of variability, PCA can be used to reduce a large number of correlated variables to a smaller number of uncorrelated components. PCA can help to identify the under-lying structure in the data and provide clues about causal connections.

•

Put very simply, principal component analysis converts correlated variables into uncorrelated components. It accomplishes this by identifying directions in the data (called components) where the variation is at a maximum and uses linear combinations of the observed variables to describe the component. Below is the general form for the formula to compute scores on the first component extracted in a principal component analysis: Principal Component 1 = β11 Variable 1 + β12 Variable 2 +... + β1k Variable k where β1k is the coefficient of Variable k for component 1.

Principal component 1 accounts for the largest amount of variation in the data that can be accounted for by a single linear model. Additional principal components can be derived by applying other linear models which identify sources of variance uncorrelated with the first principal component. Principle component 2 can therefore be computed using the linear model: Principal Component 2 = β21 Variable 1 + β22 Variable 2 +... + β2k Variable k where β2k is the coefficient of Variable k for component 2.

Principal component 2 accounts for the largest amount of variation in the data that can be accounted for by a single linear model after principal component 1 has been accounted for. This process continues with additional components being computed which account for successively smaller amounts of variance in the data until all variance in the data has been accounted for. This happens when the number of components equals the number of variables. A PCA analysis converts the scores from each subject on each variable into loadings for each subject on each component (see Table 1). The component loadings indicate the strength of the relationship between the subject and the component and sum to 1.0 for each subject when k components are included (100% of the variance in the responses from each subject is accounted for). It should be noted that the components and the variables are not directly related to each – Component 1 does not have a direct relationship to Variable 1. Each component accounts for a different amount of variance in the total data set. For example, in a three-component solution, principal component 1 may account for 58% of the variance in the data, principal component 2 may account for 24% and principal component 3 may account for 18 %. This is demonstrated in the example below. Table 1: Recorded scores and principal components Recorded Scores...