ProbCD: enrichment analysis accounting for
categorization uncertainty
Ricardo Z.N. V^
encio
1
, Ilya Shmulevich
1
July 5, 2007
Abstract
As in many other areas of science, systems biology makes extensive
use of statistical association and significance estimates in contingency
tables, a type of categorical data analysis known in this field as enrich-
ment (also over-representation or enhancement) analysis. In spite of
efforts to create probabilistic annotations, especially in the Gene On-
tology context, or to deal with uncertainty in high throughput-based
datasets, current enrichment methods largely ignore this probabilis-
tic information since they are mainly based on variants of the Fisher
Exact Test. We developed an open-source R package to deal with
probabilistic categorical data analysis, ProbCD, that does not require
a static contingency table. The contingency table for the enrichment
problem is built using the expectation of a Bernoulli Scheme stochas-
tic process given the categorization probabilities. An on-line interface
was created to allow usage by non-programmers and is available at:
http://xerad.systemsbiology.net/ProbCD/.
We present an analysis
framework and software tools to address the issue of uncertainty in
categorical data analysis. In particular, concerning the enrichment
analysis, ProbCD can accommodate: (i) the stochastic nature of the
high-throughput experimental techniques and (ii) probabilistic gene
annotation.
1
Institute for Systems Biology, 1441 North 34th street, Seattle, WA 98103-8904, USA
to whom correspondence should be addressed: rvencio@gmail.com
1
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
Background
The system-level approach to data analysis known as enrichment analysis
(also known as over-representation or enhancement analysis) is now com-
monplace. Moreover, the number of available software tools to perform such
analysis is large (see [10, 22] for comprehensive reviews). The preferred way
to formalize the enrichment problem is by means of a contingency table, often
2 × 2.
The mathematical problem is conceptually generic, being applied to di-
verse types of data, such as genomics, transcriptomis or proteomics datasets;
diverse types of analysis, including multiple and/or ordered outcomes; and
diverse types of gene classification schemes, such as Gene Ontology (GO),
KEGG or organism-specific ones. For a given ontology term t defining the set
of genes G
t
and its complementary set G
c
t
, the general enrichment analysis
contingency table is:
G
t
G
c
t
outcome
1
X
1,1
X
1,2
outcome
2
X
2,1
X
2,2
· · ·
· · ·
· · ·
outcome
k
X
k,1
X
k,2
Besides measuring the statistical significance of the null hypothesis that
the rows and columns are independent, as yielded by Fisher's Exact Test
[13] and Fisher-like methods [10, 22], it is also possible to measure statistical
association between a table's rows and columns [15] (a detailed discussion on
significance vs. association in the enrichment problem context can be found
in [25]).
Most of the attention in the enrichment analysis problem has focused on
issues such as the search for the best multiple-test correction or the imple-
mentation of better user-friendly software interfaces to facilitate biologist's
exploratory work [10]. However, one of the limitations that the available
approaches still share is that they assume, explicitly or implicitly, that one
is able to construct the contingency table exactly, without uncertainty in
populating its cells.
Recently, the computational biology community has been witnessing an
2
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
increasing interest in probabilistic approaches to gene annotation, particu-
larly in the Gene Ontology (GO) context, as a realization of the limitations
imposed by the traditional deterministic and context-independent gene an-
notation schemes [17, 18, 12, 21, 11, 9, 26, 16]. These efforts are motivated
by: the necessity to assess the error propagation in automatic gene anno-
tation [18, 16]; desire to include different types of evidence sources such as
protein-protein interaction [17, 9] or phylogenomics [12, 11] and annotation
extrapolation from model organisms to others [26, 21]. Meanwhile, the prob-
abilistic nature of data obtained by high-throughput measurement techniques
is well recognized and a number of attempts to model it were proposed over
the past decade in various experimental contexts [27, 28]. However, these
efforts are not integrally taken into account when usual enrichment analysis
is performed.
We describe a computational solution that is able to deal with the uncer-
tainty introduced in enrichment analysis due to: (i) the stochastic nature of
the results obtained with such high-throughput experimental techniques or
(ii) probabilistic gene annotation.
Implementation
ProbCD is an open-source software designed to perform probabilistic categor-
ical data analysis. ProbCD is written in R [6] with a level of modularity that
makes it suitable to be incorporated by existing development efforts of inte-
grative tools [24]. To facilitate the usage by researchers with no knowledge
of R, we implemented a user-friendly web-based interface for the software,
which is not limited to any particular organism. The on-line interface and
the source-code are available on the project's website [1].
The idea behind ProbCD's implementation is to formally represent the
intuitive process of building a contingency table in a probabilistic manner.
Informally speaking, each element to be placed in the contingency table is
not considered to be indivisible, but instead is "shared", according to prob-
abilistic rules, among the contingency table's cells in a manner that is con-
ceptually similar to fuzzy membership. The theoretical and computational
implementation aspects are described in detail below.
3
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
Without loss of generality, the following descriptions are applied consid-
ering one particular ontology term t that is associated with a set of genes,
named simply as G
t
. It should be noted that G
t
is not restricted to the Gene
Ontology categorization and can be any kind of classification or annotation.
The vector q contains a probabilistic annotation for all g of the organism's
genes: q
j
= P(gene
j
G
t
) for j {1, · · · , g}. This probabilistic annotation
is assumed to be given, typically obtained from some analysis process. The
deterministic scenario corresponds simply to P(gene
j
G
t
) {0, 1}, and
hence is a special case.
The matrix P contains a probabilistic description for all k possible out-
comes of the property being studied. Therefore, P is a k × g matrix with
elements P
i,j
= P(gene
j
outcome
i
) for j {1, · · · , g} and i {1, · · · , k}.
This probabilistic description of the data uncertainty is assumed to be given.
To motivate the general probabilistic model, it is useful to examine an
arbitrary 2 × 2 example in the deterministic scenario:
G
G
c
H
x
1,1
x
1,2
H
c
x
2,1
x
2,2
where all x's are the counts of a regular contingency table over the gene
sets G and H. In its matrix representation:
x
1,1
x
1,2
x
2,1
x
2,2
=
j
1
{gene
j
H}
1
{gene
j
G}
j
1
{gene
j
H}
1
{gene
j
G
c
}
j
1
{gene
j
H
c
}
1
{gene
j
G}
j
1
{gene
j
H
c
}
1
{gene
j
G
c
}
where 1
{}
is the indicator function.
Inspired by this representation, it is easy to see that the "hard" indica-
tor functions may be substituted by Bernoulli random variables in order to
account for the categorization uncertainty. Since all sets are finite, the indi-
cator functions can be represented as vectors in {0, 1}
g
and the sums over all
genes as dot products. In a generic scenario, with given non-deterministic
P and q, the contingency table represented by X|P, q is a random matrix
that is difficult to describe in closed form. It is also not compatible with
the statistical formalism supporting Fisher's Exact Test or other well-known
Fisher-like approaches, as these are not applicable to random tables.
4
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
The contingency table is defined in terms of Bernoulli Schemes [2] which
is the generalization of the Bernoulli Process to more than two possible out-
comes. The notation Z Be(p
1
, · · · , p
n
) represents the distribution:
z =
(1, 0, 0, · · · , 0)
with probability p
1
;
(0, 1, 0, · · · , 0)
with probability p
2
;
(0, 0, 1, · · · , 0)
with probability p
3
;
· · ·
(0, 0, 0, · · · , 1)
with probability p
n
.
p
1
+ · · · + p
n
= 1
The random variable X is a matrix representation of a k × 2 contingency
table:
X
1,1
X
1,2
· · ·
· · ·
X
k,1
X
k,2
=
d
1
· a
1
d
1
· a
2
· · ·
· · ·
d
k
· a
1
d
k
· a
2
where · is the usual dot-product, a
i
= (A
i,1
, · · · , A
i,g
) is a row-vector of
a 2 × g binary matrix A such that (A
1,j
, A
2,j
)|q
j
Be(q
j
, 1 - q
j
) and d
i
=
(D
i,1
, · · · , D
i,g
) is a row-vector of a k×g binary matrix D such that (D
1,j
, · · · , D
k,j
)|(P
1,j
, · · · , P
k,j
)
Be(P
1,j
, · · · , P
k,j
).
It is very easy to extend this framework for completely generic k×m tables
(m > 2), but this would be outside the scope of the ontology enrichment
problem.
To measure statistical association between rows and columns in an or-
dered contingency tables, as is analogously made when correlations are calcu-
lated for non-categorical data, Goodman-Kruskal's gamma (X) can be used
[15, 14, 25]. ProbCD calculates the statistical association accounting for the
stochastic nature of the table's categorization reporting = (E[X|P, q]),
where E is the expectation operator. This is a particular association mea-
surement that can be easily changed for other user-implemented options.
The dichotomous case, which is the simplest one, gives a more intuitive
illustration on how the association is calculated in practice for the particular
implementation: E[X
1,1
|P, q] = E
1,1
= P
1,1
q
1
+ · · · + P
1,g
q
g
, E[X
2,1
|P, q] =
E
2,1
= (1-P
1,1
)q
1
+· · ·+(1-P
1,g
)q
g
, E[X
1,2
|P, q] = E
1,2
= P
1,1
(1-q
1
)+· · ·+
5
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
P
1,g
(1 - q
g
), E[X
2,2
|P, q] = E
2,2
= (1 - P
1,1
)(1 - q
1
) + · · · + (1 - P
1,g
)(1 - q
g
)
and = (E
1,2
E
2,2
- E
1,2
E
2,1
)/(E
1,2
E
2,2
+ E
1,2
E
2,1
).
To measure the statistical significance of the estimated association, ProbCD
uses a randomization approach. The null distribution for the Goodman-
Kruskal's gamma,
, is proposed to be estimated from several permuta-
tion rounds.
In each round a gene j receives randomly its probabilities
(P
1,j
, · · · , P
k,j
) from one of the g possible columns of P and an association
value is calculated. The significance of the statistical association between
rows and columns in the contingency table is calculated as p = P(
).
A term t is significantly over-represented (or equivalently, the gene list is
enriched for t) depending on user-defined thresholds for significance and/or
association.
Results
The following examples illustrate the potential utility of considering prob-
abilistic annotations and/or data uncertainty assessment in the enrichment
analysis using ProbCD on artificial datasets and a published yeast dataset.
The point of the following illustration is to show that even ontology terms
annotated with modest probabilities can be considered to be over-represented
if the list of genes obtained behave in a supportive pattern. Consider a hypo-
thetical organism with 100 genes annotated in several GO terms, as described
in the Additional Files. The genes gene
1
to gene
20
are deterministically an-
notated to the ontology term t = a. In other words, assume that it is well
known that these 20 genes have some given functionality a. The experiment,
for example from a hypothetical proteomics dataset, yielded a deterministic
list of differentially expressed (DE) genes ranging from gene
1
to gene
10
. The
contingency table for this problem is, therefore:
G
a
G
c
a
DE
10
0
DE
c
10
80
In this case, the gene list is clearly enriched for a within any meaningful
significance cutoff. Consider now a second ontology term b obtained from
a probability-based source with P(gene
i
G
b
) = 40%, i {1, · · · , 20}. A
6
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
probability of only 40% generally would not be sufficient evidence to warrant
the inclusion of those 20 genes in G
b
considering a usual deterministic frame-
work and, therefore, would not be analyzed by deterministic-based methods,
such as the Fisher's Exact Test. However, ProbCD is able to incorporate
this information and yields: = 0.87 and p < 10
-4
in 10000 permutation
rounds, a significant enrichment for b. One can easily imagine, for example,
genes that have a main function a but also have a different function b in, say,
40% of documented conditions.
The point of the following illustration is to show that the incorporation
of probabilistic annotation information does not always translate to addition
of terms into the enrichment result, as in the example above, but it can also
mean the exclusion of non-relevant terms. Consider a hypothetical organism
with 1100 genes. Let the genes gene
1
to gene
100
be grouped together in a
cluster H after some genomic sequence analysis. Let the term a be annotated
deterministically (Additional Files) yielding the contingency table:
G
a
G
c
a
H
100
0
H
c
100
900
In this situation, H is clearly enriched for a within any meaningful sig-
nificance cutoff. Let now the same annotation incorporate some evidence
levels by defining: P(gene
i
G
a
) = 99% for i {1, · · · , 10} and P(gene
i
G
a
) = 1% for i {11, · · · , 100}. Intuitively, this means that only 10 out of
100 genes clustered in H are, in fact, confidently annotated with the ontol-
ogy term a. The incorporation of this information results in non-significant
enrichment of H for a since: = 0.0425 and p = 0.42 in 1000 permutation
rounds. Therefore, it can be useful to incorporate uncertainty information
into the enrichment analysis to also down-rank potentially spurious enrich-
ment results.
The purpose of the following illustration is to show the impact of consid-
ering the uncertainty in lists of genes, rather than in the annotations, on the
enrichment analysis. In this example, the aim is to find which GO terms, an-
notating the yeast Saccharomyces cerevisiae, are statistically associated with
periodic expression levels, measured by microarray technology [7]. Andersson
and colleagues [7] devised a Bayesian model that produces the probability
7
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
that a gene is periodically expressed during the cell-cycle. This simple pre-
sentation is sufficient for our objectives in this work, but the interested reader
can find more details (e.g. the definition of "periodic", etc.) in the original
work. In this example, the annotation is considered to be deterministic and
it was downloaded from the GO project page (August 2007) [4].
To perform the usual enrichment analysis one needs to define a probability
cutoff value in order to split the gene list in two: the periodic genes and
the non-periodic genes. Consider initially the reasonable cutoff P(gene
i
is
periodic) 70% and focus on a single GO term GO:0007090 (regulation of
S phase of mitotic cell cycle), defined as "a cell cycle process that modulates
the frequency, rate or extent of the progression through the S phase of mitotic
cell cycle". Although this GO term is clearly associated with periodic gene
expression, performing a usual enrichment analysis results in the conclusion
that the periodic genes are not significantly enriched for GO:0007090 within
usual significance cutoffs (p-value = 0.065).
Suspecting that this non-intuitive result could be due to the probability
threshold chosen to select periodic genes, illustrated in the Figure 1, one could
repeat the same analysis above building the contingency table considering
the cutoffs P(gene
i
is periodic) 50%, 95%, 99% or 99.99%. The result of
this repeated analysis is also non-intuitive since the p-values are: 0.12, 1.0,
1.0 and 1.0 for 50%, 95%, 99% and 99.99% cutoffs, respectively, meaning
that increasing the stringency to define a gene as periodic only decreases the
significance of the enrichment for GO:0007090.
8
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
Figure 1: Probability of being periodic.
The blue curve represents the
probability of a gene being periodic (Pr) according to the model of [7].
The genes are sorted by probability values (rank) on the horizontal axis
to facilitate the visualization. The red curve is the deterministic approx-
imation using a 70% probability cutoff to consider a gene as periodic:
P(gene
i
is periodic) 0.70 P(gene
i
is periodic) = 1 and P(gene
i
is
periodic) < 0.70 P(gene
i
is periodic) = 0. This approximation labels 15%
of the genes as periodic.
Using ProbCD one can consider the actual probability of being periodic
(blue curve in Figure 1) in the enrichment analysis instead of using the de-
terministic approximation (red curve in Figure 1). This result in a relatively
high statistical association between periodicity and the term "regulation of
S phase of mitotic cell cycle" ( = 0.78) also with high significance (p =
0.009 in 1000 simulation rounds). Judging subjectively by the definition of
GO:0007090, ProbCD returned a meaningful result.
Other similar cases can be easily identified. For example, the GO term
GO:0000083 (G1/S-specific transcription in mitotic cell cycle) exhibits er-
9
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
ratic behavior depending on the chosen cutoff for the probability of being
periodic: p-value of 0.15, 0.10, 0.01, 0.096 and 1.0 for 50%, 75%, 95%, 99%
and 99.99% cutoffs, respectively. The probability stringency used to build
the contingency table and the subsequent significance test are not necessarily
correlated. ProbCD yielded a significant (p = 0.006) moderate association
( = 0.48) for GO:0000083. Other examples include GO:0045787 (positive
regulation of progression through cell cycle), defined as "any process that
activates or increases the frequency, rate or extent of progression through the
cell cycle", which would be called significant using the regular enrichment
method only if the right probability cutoff P(gene
i
is periodic) 95% is
guessed initially: p-value of 0.047, 0.024, 0.0058, 0.086 and 0.024 for 50%,
75%, 95%, 99% and 99.99%, respectively.
The above analysis process is repeated for all GO terms, with the results
available as Additional Files and summarized in Figure 2.
10
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
Figure 2: Venn diagram of over-represented terms. The Venn diagram shows
the number of GO terms considered significantly over-represented (p-value
0.01) by the Fisher Exact Test using four different probability cutoffs P(gene
i
is periodic) A, B, C or D periodic: A = 0.70, B = 0.95, C = 0.99 and
D = 0.9999.
This figure suggests that there is a large variability in the possible final
outcome of an enrichment analysis depending on the probability cutoff used
to build the associated contingency table. This variability is avoided by
ProbCD because it directly takes into account the uncertainty in the data
instead of introducing a discretization step (Figure 1).
Figure 3 shows that ProbCD considers more terms (vertical axis in Figure
3) containing the word "cell cycle", likely associated to periodically expressed
genes, as significant if compared to the usual enrichment analysis in a wide
range of significance values (p in Figure 3). Although this is not a proof,
since one cannot be certain about which "cell cycle"-marked terms should be
enriched, we argue that this is a reasonable indication that one can, in fact,
11
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
avoid the discretization step when building the enrichment problem using
ProbCD and obtain meaningful results.
Figure 3: Fraction of "cell-cycle" GO terms selected as a function of the
p-value. The curves show the fraction of GO terms containing the word
"cell-cycle" in their definition that are considered significant as a function of
the significance cutoff (p). The red curve is obtained with ProbCD and all
others are obtained with one of the probability cutoffs: 50%, 70%, 95%, 99%
or 99.99%.
Discussion and Conclusions
The usual enrichment analysis is a particular case in this probabilistic frame-
work and can be obtained by ProbCD ignoring the difference between evi-
dence sources in gene annotation and defining fixed gene lists, which would
12
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
correspond to the deterministic setting: q
j
= P(gene
j
G
t
) = 1 or 0 and
P
i,j
= P(gene
j
outcome
i
) = 1 or 0.
Even if a probabilistic annotation is not readily available for a given or-
ganism, it could be interesting to perform enrichment analysis taking into
account some form of weighting on available annotations according to their
reliability. For a concrete example, the GO Consortium [3] provides annota-
tions accompanied with evidence codes related to the kind/level of evidence
available for a given GO annotation [5], such as IEA: Inferred from Electronic
Annotation, IMP: Inferred from Mutant Phenotype, RCA: inferred from Re-
viewed Computational Analysis or IDA: Inferred from Direct Assay. It is
known that some evidence sources are more reliable than others and this
knowledge can be used, in a Bayesian sense, as subjective probabilities.
Once an annotation is considered in a probabilistic framework, it could re-
flect a dependence on the context. One can consider cases in which P(gene
j
G
t
| disease )
P(gene
j
G
t
), defining context-dependent gene annotations
derived, for instance, from automatic literature mining [8].
Our intention is to complement existing approaches, rather than substi-
tute them. Toward this aim, we built ProbCD to be as modular as possible
in order to be incorporated into existent software or pipelines [24], composed
of ontology pre-processing [19] or powerful visualization capabilities [20, 23].
It is important to note that ProbCD is also applicable to other categorical
data analysis contexts in which the construction of contingency tables is
subject to uncertainty, a recurrent theme in science.
Authors contributions
RZNV implemented the project. IS supervised the project. All authors read
and approved the final manuscript.
13
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
Acknowledgments
We thank Drs. John Boyle, Vesteinn Thorsson, Nathan Price and other ISB
colleagues for insightful discussions. This work is partially supported by NIH
grants U54-AI54253, U19-AI057266 and P50-GMO-76547.
References
[1] Baygo2 home page.
[2] Bernoulli scheme -- wikipedia, the free encyclopedia.
[3] The gene ontology consortium.
[4] Gene ontology current annotations.
[5] Guide to go evidence codes.
[6] The r project for statistical computing.
[7] C.R. Andersson, A. Isaksson, and M.G. Gustafsson. Bayesian detection
of periodic mRNA time profiles without use of training examples. BMC
Bioinformatics, 7(1):63, 2006.
[8] M. Aubry, A. Monnier, C. Chicault, M. de Tayrac, M.D. Galibert,
A. Burgun, and J. Mosser. Combining evidence, biomedical literature
and statistical dependence: new insights for functional annotation of
gene sets. BMC Bioinformatics, 7(1):241, 2006.
[9] S. Carroll and V. Pavlovic.
Protein classification using probabilis-
tic chain graphs and the Gene Ontology structure.
Bioinformatics,
22(15):1871, 2006.
[10] J. Dopazo.
Functional Interpretation of Microarray Experiments.
OMICS: A Journal of Integrative Biology, 10(3), 2006.
[11] B.E. Engelhardt, M.I. Jordan, and S.E. Brenner. A graphical model for
predicting protein molecular function. Proceedings of the 23rd interna-
tional conference on Machine learning, pages 297304, 2006.
14
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
[12] B.E. Engelhardt, M.I. Jordan, K.E. Muratore, and S.E. Brenner. Protein
molecular function prediction by Bayesian phylogenomics. PLoS Comput
Biol, 1(5), 2005.
[13] R.A. Fisher. On the Interpretation of
2
from Contingency Tables, and
the Calculation of P. Journal of the Royal Statistical Society, 85(1):87
94, 1922.
[14] G.D. Garson. Political science methods. Holbrook Press, 1976.
[15] L.A. Goodman and W.H. Kruskal.
Measures of Association for
Cross Classifications. Journal of the American Statistical Association,
49(268):732764, 1954.
[16] C.E. Jones, A.L. Brown, and U. Baumann. Estimating the annotation
error rate of curated GO database sequence annotations. BMC Bioin-
formatics, 8(1):170, 2007.
[17] T. Joshi, Y. Chen, J.M. Becker, N. Alexandrov, and D. Xu. Genome-
Scale Gene Function Prediction Using Multiple Sources of High-
Throughput Data in Yeast Saccharomyces cerevisiae. Omics A Journal
of Integrative Biology, 8(4):322333, 2004.
[18] E.D. Levy, C.A. Ouzounis, W.R. Gilks, and B. Audit. Probabilistic
annotation of protein sequences based on functional classifications. BMC
Bioinformatics, 6(1):302, 2005.
[19] A. Lewin and I.C. Grieve. Grouping Gene Ontology terms to improve
the assessment of gene set enrichment in microarray data. BMC Bioin-
formatics, 7:426, 2006.
[20] S. Maere, K. Heymans, and M. Kuiper. BiNGO: a Cytoscape plugin
to assess overrepresentation of Gene Ontology categories in Biological
Networks. Bioinformatics, 21(16):34483449, 2005.
[21] D.M.A. Martin, M. Berriman, and G.J. Barton. GOtcha: a new method
for prediction of protein function assessed by the annotation of seven
genomes. BMC Bioinformatics, 5:178, 2004.
[22] I. Rivals, L. Personnaz, L. Taing, and M.C. Potier. Enrichment or de-
pletion of a GO category within a class of genes: which test? Bioinfor-
matics, 23(4):401407, 2007.
15
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007
[23] R.S.G. Sealfon, M.A. Hibbs, C. Huttenhower, C.L. Myers, and O.G.
Troyanskaya. GOLEM: an interactive graph-based gene-ontology navi-
gation and analysis tool. BMC Bioinformatics, 7:443, 2006.
[24] P.T. Shannon, D.J. Reiss, R. Bonneau, and N.S. Baliga. Gaggle: An
open-source software system for integrating bioinformatics software and
data sources. BMC Bioinformatics, 7(1):176, 2006.
[25] R. Vencio, T. Koide, S. Gomes, and CA Pereira. BayGO: Bayesian
analysis of ontology term enrichment in microarray data. BMC Bioin-
formatics, 7(1):86, 2006.
[26] A. Vinayagam, C. del Val, F. Schubert, R. Eils, K.H. Glatting, S. Suhai,
and R. K¨
onig. GOPET: A tool for automated predictions of Gene On-
tology terms. BMC Bioinformatics, 7(1):161, 2006.
[27] W. Zhang and I. Shmulevich. Computational and Statistical Approaches
to Genomics - 2nd edition. Springer, New York, NY, USA, 2006.
[28] W. Zhang, I. Shmulevich, and J. Astola. Microarray Quality Control.
Wiley-Liss, 2004.
16
Nature Precedings : hdl:10101/npre.2007.369.1 : Posted 6 Jul 2007