American Statistical Association
Exploratory analysis of gene expression and other high dimensional data often begins with row and column clustering, which yields a partition of the data matrix into disjoint sample-variable blocks (submatrices). Of particular interest in practice are submatrices whose entries are large on average. In conjunction with clinical and functional annotation, large average submatrices are often the starting point for subsequent biological analyses, such as the identification of genetic pathways and new disease subtypes.
We describe a simple algorithm, belonging to the general category of biclustering methods, for identifying large average submatrices in high dimensional data. Like other biclustering methods, the algorithm improves on independent sample variable clustering in several respects. First, the submatrices it identifies can overlap and they need not cover the entire data matrix, features that better reflect underlying biology. Secondly, the inclusion of samples and variables in a submatrix depends only on their expression values inside that submatrix. The algorithm seeks to maximize a simple measure of statistical significance, and through this measure, has close connections with the minimum description length principle. We will discuss the applications of the algorithm to a recent breast cancer study, and compare its performance with several other biclustering methods. If time permits, we will present some related theoretical results.
The talk should be accessible to statisticians, computer scientists, and computational biologists.
|Date:||Thursday, October 25, 2007|
|Time:||4:00 - 5:00 P.M.|
Mailman School of Public Health
Department of Biostatistics
722 West 168th Street
Judith Jansen Conference Room
4th Floor - Room 425
New York, New York