The dimension and complexity of raw gene expression data obtained by oligonucleotide chips, spotted arrays, or whatever technology is used, create challenging data analysis and data management problems. In a limited way these challenges can be met by existing software systems and analysis methods in the hands of end users. However, we are convinced that a much more active scientific endeavor is called for. We anticipate that, broadly defined, bioinformatics will encompass statistical and biometrical questions of experimental design, data analysis, graphics and modeling, and computational questions concerning efficient algorithms for various learning tasks such as classification and clustering.
Microarray data can be analysed using several approaches (Claverie, 1999). Clustering methods (i.e. unsupervised learning) are used widely and have the ability to uncover coordinated expression patterns from a collection of microarrays (e.g., Eisen et al. 1998; Getz et al. 2000; Tibshirani et al. 2000; Dudoit, Fridlyand et al. 2000; Kerr and Churchill 2000a). The use of standard clustering methods is most appropriate when the microarrays arise from some common source cell type, for example from a common tissue type from animals in some controlled cross. Refinements may be necessary when other sources of variation affect the microarrays (van der Laan and Bryan 2000). Classification methods (i.e. supervised learning) have proven very useful to identify patterns of gene expression that can be correlated with qualitative disease phenotypes (e.g. Golub et al. 1999) and for classifying genes according to their functional role (Brown et al. 2000). Related methods of multivariate statistical analysis, such as those using the singular value decomposition (Alter et al. 2000; West et al. 2000) or multidimensional scaling can be effective at reducing the dimension of the objects under study.
Statistical methods are emerging to account for multiple sources of variation when trying to pool information from many microarrays and to identify genes exhibiting significant differential expression between cell types. One approach is to decompose the appropriately transformed expression measurement as a linear combination of effects from different sources of variation (Kerr et al. 2000). This is basically ANOVA for microarrays. In the context of a two-group comparison with replication Dudoit, Yang et al. (2000) have proposed the use of permutation-testing and p-value adjustment to account for the multiple-testing problem. Lin et al. (2001) describe a nonparametric method suited to uncovering differential expression for low-abundance transcripts. Alternatively, the mixture-model approach can be used to directly assay the probability that a given gene is truly expressed (Lee et al. 2000) or the probability that a gene is truly differentially expressed between two conditions (Newton et al. 2001; Efron et al. 2001). The functional patterns of expression identified by such statistical calculations will be backed up by laboratory examination to verify findings (cf. Nadler et al. 2000).
Although analysis methods have been a central concern in most bioinformatics research to date, the issue of experimental design is critical. The use of replication, for example, in controlled experiments can significantly improve power to uncover differentially expressed genes (Kerr and Churchill 2000b, Lee et al. 2000). Our internal review of requests for microarray support will include careful examination of experimental design considerations.
Microarray analysis typically uses background-adjusted expression intensities, (PM-MM for Affymetrix chips). However, this can create problems with negative adjusted values, since the log-transform is often applied to these adjusted values. This has prompted ad hoc procedures (cf. Roberts et al. 2000). However, arbitrary handling of low expression genes is unsatisfactory since these may be the most interesting, e.g. transcription factors and receptors. Instead Lin et al. (2001) advocated an approximate normal scores transformation of background-adjusted expression which allows the use of all data (see also Efron et al. 2001). These normal scores appear to have better properties for clustering, and are well behaved for inference on differential expression.
Patterns of gene expression evinced by data analysis is only the beginning. In many cases, greater biological understanding can be attained by using expression data in conjunction with sequence data (Craven et al. 2000), pathway data (Zien et al. 2000), and biomedical text sources (Shatkay et al. 2000). It may in addition involve constructing predictive models from diverse data sources (Craven et al. 2000), and developing automated methods for exploiting text and Web data (Craven and Kumlien, 1999; Shavlik et al. 1999).
Return to Statistical Genomics References.
Brian Yandell (yandell@stat.wisc.edu)