# Probability and Statistics Research Project

Topics: Factor analysis, Singular value decomposition, Principal component analysis Pages: 12 (3283 words) Published: April 18, 2007
Probability and Statistics Research Project

Name: Lakeisha M. Henderson
ID: @02181956

Spring 2007

Abstract

Principle Component Analysis (PCA)
Definition.4
Uses of PCA5
Illustrative Example of PCA5
Method to Determine PCA..6

Basic Analysis of Variance (ANOVA)
Purpose and Definition of ANOVA12
Illustrative Example of ANOVA.12

Risk Based Design Concepts
Definition.15
Predictions and Relation to Risk Based Designs.15

Principle Components Analysis (PCA)

Definition:

Principal Components Analysis is a method that reduces data dimensionality by performing a covariance analysis between factors. As such, it is suitable for data sets in multiple dimensions. It is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where the luxury of graphical representation is not available, PCA is a powerful tool for analyzing data. The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, i.e. by reducing the number of dimensions, without much loss of information. Technically speaking, PCA is an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA can be used for dimensionality reduction in a data set while retaining those characteristics of the data set that contribute most to its variance, by keeping lower-order principal components and ignoring higher-order ones. Such low-order components often contain the "most important" aspects of the data. But this is not necessarily the case, depending on the application.

Figure 1. The blue lines represent 2 consecutive principle components. Note that they are at right angles to each other.

PCA is also called the (discrete) Karhunen-Loève transform (or KLT, named after Kari Karhunen and Michel Loève) or the Hotelling transform (in honor of Harold Hotelling). Unlike other linear transforms, PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set.

Uses of PCA:
Some of the direct uses of PCA involve the identification of groups of inter-related variables, and the reduction of number of variables. However, one major indirect use includes a method of transforming data. Transformation of data through rewriting the data with properties the original data did not have. Illustrative Example of PCA :

Given-
Say that you measure 10,000 genes in 8 different patients. These values could form a matrix of 8 x 10,000 measurements. Now imagine that each of these 10,000 genes is plotted in a multi-dimensional on a scatter plot consisting of 8 axes, 1 for each patient. The result is a cloud of values in multi-dimensional space.

Solution-
To characterize the trends exhibited by this data, PCA extracts directions where the cloud is more extended. For instance, if the cloud is shaped like a football, the main direction of the data would be a midline or axis along the length of the football. This is called the first component, or the principal component. PCA will then look for the next direction, orthogonal to the first one, reducing the multidimensional cloud into a two-dimensional space. The second component would be the axis along the football width (Fig. 2).

Figure 2. Football- shaped data set with two main components In this particular example, these two components explain most of the cloud's trends. In a more complex data set, more components might add information about interesting trends in the...

References: DeMuth, James E. Basic Statistics and Statistical Applications. New York: Marcel Dekker Publications, 1999.
Dunteman, George H. Principal Component Analysis. Chicago: Sage Publications Inc., 1989.
Frantzen, Kurt A. Risk Based Analysis for Environmental Managers. New York: CRC Press, 2002.
Iversen, Gudmund R., Norpoth, Helmut. Analysis of Variance. St. Louis: Sage Publications, 1987.
Jolliffe, Ian T. Principal Component Analysis. New York: Springer, 2002.
Todinov, Michael. Risk-Based Reliability Analysis and Generic Principles for Risk Reduction. Texas: Elsevier, 2006.
Yeung & Ruzzo (2001) Principal component analysis for clustering gene expression data. Bioinformatics 17(9): 763-74.