|
Stats
S+ArrayAnalyzer web seminar (June 22, 2004).
Michael O'Connell and Richard Park gave a nice web seminar on the
S+ArrayAnalyzer, a
software program for analysis of microarray data that is marketed by
Insightful Corporation. This company makes a lot of
very nice software
-
S+, an object oriented
language for statistical analysis,
-
S+SeqTrial, a system
for designing and analyzing group sequential trials,
-
Insightful Miner, data
mining software,
-
Infact, text mining software.
The S+ArrayAnalyzer software is built on the open source Bioconductor project. It remains
faithful to the Bioconductor implementaiton of expression sets and code written for
Bioconductor will work in S+ArrayAnalyzer. S+ArrayAnalyzer adds additional slots, consistent
accessor methods, and a graphical user interface. It also offers Affymetrix API support, and
an SPXML library for graphics.
You can run S+ArrayAnalyzer algorithms within the Spotfire DecisionSite application.
Details are available at
Spotfire
S-PLUS Server Solution [pdf].
The speakers described two experiments. The first experiment looked at granulocyte
differentiation in a series of mice, with measurements at day 0, 1, 2, ..., 6 with four mice
evaluated at each day. The goal was to identify genes that are differentially expressed while
minimizing the number of false positives.
The second experiment looked at young versus old animals in the time 0, 0.5, 1, 2, 4 hours
after surgically induced injury. There were 3 animals of each age at each time point. The
goal was to see the effect of age on recovery.
S+ArrayAnalyzer can read the CEL and CHP formats as well as AADM links used by Affymetrix
chips. It can also read a variety of formats for the two color spotted arrays.
Initial exploratory methods include MvA plots (Bland-Altman plots), box plots, image plots
of spatial expression, and RNA degradation plots. I had not heard about the RNA degradation
plot before. This plot aligns all the Affymetrix probes from the 5' end of the gene to the 3'
end. Since RNA degradation starts at the 5' end, any degradation would appear as a trend in
the plot with lower expression values on the 5' end. A brief description of this plot appears
on page 17 of the
pdf
handout, Introduction to Affymetrix GeneChip Data Analysis, by Han-Ming Wu and the
AffyRNAdeg function in Bioconductor will produce this graph.
Affymetrix chips have a set of Mismatch probes that attempt to adjust for background and
cross hybridization. There are several ways to incorporate the mismatch probes. The approach
used by Affymetrix is called MAS 5 and is described at
Alternative approaches for handling the mismatch probes appear in the following
references:
-
Li C, Wong W (2001). Model-based analysis of oligonucleotide arrays: Expression index
computation and
outlier detection. Proceedings of the National Academy of Science U S A 98:31-36. -
Affymetrix MAS 5 method
-
Irizarry, RA, Hobbs, B, Collin, F, Beazer-Barclay, YD, Antonellis, KJ, Scherf, U, Speed, TP
(2003)
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level
Data.
Biostatistics. -
Zhang PDNN
Differential expression is tricky because of the large number of genes tested. To minimize
the number of false positives, you need to use an approach with control of Family Wise Error
Rate. The best known approach is the
Bonferroni correction, but this is very conservative. Alternative to Bonferroni include
-
Holm step-down procedure. Holm, S. (1979). A simple sequentially rejective multiple
test procedure. Scandinavian Journal of Statistics. 6: 65-70. (1979)
-
Hochberg step-down procedure. Hochberg, Y. (1988). A sharper Bonferroni procedure for
multiple tests of significance. Biometrika 75: 800-802.
-
Westfall and Young resampling methods. Westfall, P. H. and Young, S. S. Resampling-based
multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons, 1993.
Alternately you can consider an approach with control of False Discovery Rate. Some
references for this approach are:
-
Benjamini, Y., Yekutieli, D. (2001). The control of the false discovery rate in multiple
hypothesis testing under dependency. Annals of Statistics 29,4: 1165-1188.
-
Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false
discovery rate controlling procedures. Bioinformatics. 2003 Feb 12;19(3):368-75.
[Medline]
-
Benjamini Y, Hochberg Y (1995). Controlling the false discovery rate: A practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B,
Methodological 57:289-300.
-
Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Stat
Med. 1990 Jul;9(7):811-8.
[Medline]
-
Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci
U S A. 2003 Aug 5;100(16):9440-5. Epub 2003 Jul 25.
[Medline] [Abstract]
[Full text]
[PDF]
Cluster analysis will filter the genes into groups of genes that behave similarly
-
PAM
-
K-means
-
Hierarchical
-
Model Based
A heat map will allow you to see how well the clustered genes behave.
The final step is annotation, which tries to place the genes in context and link to freely
available web resources like
-
Locus Link
http://www.ncbi.nlm.nih.gov/LocusLink/ LocusLink provides a single query interface to
curated sequence and descriptive information about genetic loci. It presents information on
official nomenclature, aliases, sequence accessions, phenotypes, EC numbers, MIM numbers,
UniGene clusters, homology, map locations, and related web sites.
-
Unigene
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=unigene UniGene is an experimental
system for automatically partitioning GenBank sequences into a non-redundant set of
gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique
gene, as well as related information such as the tissue types in which the gene has been
expressed and map location.
-
Pubmed http://www.ncbi.nlm.nih.gov/PubMed/
PubMed, a service of the National Library of Medicine, includes over 14 million citations
for biomedical articles back to the 1950's. These citations are from MEDLINE and additional
life science journals. PubMed includes links to many sites providing full text articles and
other related resources.
-
GO http://www.geneontology.org/ The goal of the
Gene OntologyTM (GO) Consortium is to produce a controlled vocabulary that can be applied to
all organisms even as knowledge of gene and protein roles in cells is accumulating and
changing. GO provides three structured networks of defined terms to describe gene product
attributes. GO is one of the controlled vocabularies of the Open Biological Ontologies.
-
KEGG http://www.genome.ad.jp/kegg/ A grand
challenge in the post-genomic era is a complete computer representation of the cell and the
organism, which will enable computational prediction of higher-level complexity of cellular
processes and organism behaviors from genomic information. Towards this end we have been
developing a bioinformatics resource named KEGG, Kyoto Encyclopedia of Genes and Genomes, as
part of the research projects in the Kanehisa Laboratory of Kyoto University Bioinformatics
Center.
-
Affymetrix GO Browser
-
Onto-Express
http://vortex.cs.wayne.edu/projects.htm The typical result of a microarray experiment is
a list of tens or hundreds of genes found to be differentially regulated in the condition
under study. Independently of the methods used to select these genes, the common task faced
by any researcher is to translate these lists of genes into a better understanding of the
biological phenomena involved. Currently, this is done through a tedious combination of
searches through the literature and a number of public databases. We developed Onto-Express
(OE) as a novel tool able to automatically translate such lists of differentially regulated
genes into functional profiles characterizing the impact of the condition studied. OE
constructs functional profiles (using Gene Ontology terms) for the following categories:
biochemical function, biological process, cellular role, cellular component, molecular
function and chromosome location. Statistical significance values are calculated for each
category. We demonstrated the validity and the utility of this comprehensive global analysis
of gene function by analyzing two breast cancer data sets from two separate laboratories. OE
was able to identify correctly all biological processes postulated by the original authors,
as well as discover novel relevant mechanisms (Draghici et.al, Genomics, 81(2), 2003). Other
results obtained with Onto-Express can be found in Ostermeier et.al, Lancet, 360(9335),
2002.
-
DAVID/EASE
http://david.niaid.nih.gov/david/ease.htm EASE is a customizable, standalone software
application that facilitates the biological interpretation of gene lists derived from the
results of microarray, proteomic, and SAGE experiments. EASE provides statistical methods
for discovering enriched biological themes within gene lists, generates gene annotation
tables, and enables automated linking to online analysis tools.
-
Swiss-Prot http://us.expasy.org/sprot/ Swiss-Prot;
a curated protein sequence database which strives to provide a high level of annotation
(such as the description of the function of a protein, its domains structure,
post-translational modifications, variants, etc.), a minimal level of redundancy and high
level of integration with other databases
For further details look at the
handout for this
web seminar [pdf].
07/08/2008.
Category: Data mining,
Category: Statistical computing
CMH Employees
|