Annotation-Modules is a tool for ontological analysis which finds statistically significant combinations of annotations (the modules)
in a given gene list compared to a set of reference genes.
Short summary of the main features:
The underlying annotation database is substantially extended compared to those used by similar algorithms,
usually exploiting functional annotations like GO ontology, KEGG pathways or Swiss-Prot keywords. We have implemented
approximately 60 different annotation features which have a total of around 18,000 different values.
The annotated features are obtained from many different fields like gene regulation (TFBS, microRNA), conservation and
evolution (ultra-conserved elements, taxonomic depth) population dynamics (SNPs), functional annotations
(GO categories, Swiss-Prot keywords) and sequence properties.
The algorithm performs an enrichment/depletion analysis not just for all single items (annotations) but for all combinations of items up to a given size.
The user can chose between many different, pre-computed sets of reference genes, provide its own reference genes,
or even provide a pre-annotated set of reference genes.
Currently 12 different input IDs can be used (the input IDs can be even a mixture of different gene identifiers).
The Method:
The general functionality of the algorithm is basically the same as in all methods for ontological analysis (reviewed by: Khatri P, Draghici S (2005)
Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 21:3587-95), except that it analyses not just
single annotations but also all combinations between them up to a given number as proposed by Carmona-Saez et al (GENECODIS: A web-based tool for finding
significant concurrent annotations in gene lists. Genome Biology 2007 8).
Briefly, the algorithm performs the following steps: 1) First, the annotations which have been chosen by the user are assigned to the gene list, 2)
for each (combination of) annotation(s), the number of genes which are assigned to it is determined, 3) The p-values are calculated by means of the
hypergeometric distribution, 4) The p-values are corrected applying the FDR method by Benjamini and Hochberg (J R Stat Soc 1995, 57:289?300)
Some important comments:
Mapping between different IDs
A crucial point in each annotation database is the mapping between different IDs. These mappings are necessary at two different levels.
First, our current knowledge is spread out over various databases and in order to incorporate all these data into the annotation database,
we need to map between different IDs. Second, to improve the versatility of the tools, a high number of accepted input IDs is desirable
which again will require the mapping between different types of IDs.
To cope with these tasks, we used several publically available mapping tables. By means of carefully cross-linking them we obtain all mapping
rules we needed for the current status of the database.
The following mapping tables have been used:
Database
Description
Reference
1) EBI
This mapping table links the following IDs: SwissProt, IPI, VEGA, UniGene, Trembel, RefSeq Protein, GeneBank GI and Ensemble
Protein IDs
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/ipi.SPECIE.xrefs.gz
SPECIE can be HUMAN, MOUSE or RAT
2) UCSC
This table links the UCSC known genes to Ensembl transcript IDs
ftp://hgdownload.cse.ucsc.edu/goldenPath/DB/database/knownToEnsembl.txt.gz
DB can be hg18 (human), mm8 (mouse) or rn4 (rat)
3) UCSC
This table links the different Ensembl IDs (Transcript, Gene and Protein IDs) between each other.
ftp://hgdownload.cse.ucsc.edu/goldenPath/DB/database/ensGtp.txt.gz
DB can be hg18 (human), mm8 (mouse) or rn4 (rat)
4) UCSC
This table links UCSC Known Gene IDs, mRNA IDs (Accession numbers), SWISS-PROT protein Accession number,
Gene Symbol, RefSeq ID and NCBI protein Accession number.
ftp://hgdownload.cse.ucsc.edu/goldenPath/DB/database/kgXref.txt.gz
DB can be hg18 (human), mm8 (mouse) or rn4 (rat)
5) Affymetrix
Links the Affymetrix IDs to several commonly used IDs like: RefSeq IDs, RefSeq Protein IDs, Gene Symbols, Ensembl Gene IDs,
Swiss-Prot protein Accession number, etc.
To obtain the mapping rules we need (e.g. all mappings towards the internal cross-link tables & all mappings towards the primary databases
from where we retrieve the data - like Gene Ontology, Swiss-Prot/UniProt KnowledgeBase etc.) we link the tables mentioned above using several
key IDs. We followed the steps described below:
Link the IDs on a transcript level (RefSeq IDs, Ensembl - ENST, UCSC Known Genes). This has been done by means of table 2)
and table 4) (see above). The UCSC Known Gene IDs are present in table 2) and table 4) and can be used to link RefSeq IDs and Ensembl transcript IDs
together.
Link the Ensembl Gene and Protein IDs to the RefSeq IDs using Ensembl Transcript (ENST) <--> RefSeq ID mappings as cross links.
Link the rest of the IDs available in table 1) to the RefSeq IDs. RefSeq Protein IDs are available in table 1) and in table 4) and
therefore can be used to link IPI, UniGene, GI and VEGA IDs to the RefSeq IDs.
Link IPI, VEGA, Swiss-Prot, UniGene and GI IDs to the Ensembl transcript IDs by means of cross linking them to the Ensembl Protein IDs
Link mRNA IDs and Gene Symbols (table 4) to Ensembl transcript IDs by means of cross-linking them to UCSC KnownGene IDs.
Link IPI, VEGA, Swiss-Prot, UniGene, GI, mRNA IDs and Gene Symbols to UCSC KnownGene IDs by means of cross-linking them to Ensembl
transcript IDs.
Link the Ensembl Protein and Gene IDs to the IPI and Swiss-Prot IDs by cross-linking them to Ensembl Protein IDs (present in table 1 and 3).
Link Gene Symbols and mRNA IDs to IPI and Swiss-Prot IDs by cross-linking them to RefSeq Protein IDs (present in table 1 and 4).
Internal cross-link gene table
Algorithms which accept different gene identifiers need to map these input IDs to an internal "working" table (for example Ensemble IDs).
In general, this is invisible to the user and the table cannot be chosen. The disadvantage is that when mapping between gene IDs (like NCBI gene symbols),
Protein IDs (like IPI, Swiss-Prot) or transcript IDs (Ensemble -ENST, or RefSeq), this will always lead to ambiguous decisions like the handling of
multiple mappings etc. We therefore decided to offer different "cross link tables" to avoid unnecessary mappings in some cases. For example, if the user
is just interested in GO categories (which are annotated on a protein level) and his input list consists of protein IDs, a protein table like IPI or
Swiss-Prot can be chosen to avoid unnecessary mappings.
The set of reference genes
A crucial issue in the assessment of statistically significance is the correct selection of the set of reference genes. The number of genes
assigned to an annotation and the corresponding number of genes which are not assigned are used to calculate p-value. The p-values are
calculated as probabilities that in the submitted gene list more genes (enrichment) or fewer genes (depletion) are assigned to the item than
expected by chance alone.
The reference set determines the expectation values assuming a hypergeometric null distribution. Therefore a wrong selection of the reference
genes will lead to biased p-values which in turn might lead to wrong biological interpretations of the statistical analysis. As a rule of thumb,
the pool of reference genes should contain all genes which might appear in the input list. For example, if the analysis is on differentially
expressed genes, the input gene list theoretically can be composed of all genes which are on the DNA microarray chip and consequently the
reference set must be made up of all genes on the chip.
Missing annotations & reference genes
Two different treatments can be chosen. 1) use just genes with all annotations assigned or 2) add a "unknown"
label to the gene if the annotation is not available.
just genes with all annotations assigned: When choosing this option, just those genes which have all chosen annotation features assigned will be in the final reference gene set.
The genes for which no information on a given annotation exists will be dropped. As an example, let's take an analysis run on two
annotation features, e.g. GO terms at level 3 (GO-level-3) and CpG islands (CGI-R1). For the RefSeq gene table, 14441 different genes
have a GO term annotated at this level while for 24929 genes an CpG island annotation exists. This means that the maximum number of
genes in the final reference set would be 14441. The final size in our current database would be 14315 as around 100 genes have a GO-term
annotated but no CpG island information is available. Therefore, the p-values of single annotations obtained from an individual experiment
(using just one annotation feature) will not coincide with those from a combination experiment (choosing more than one annotation feature).
The user should just choose those annotation features in which he is interested most.
add a "unknown" label if the annotation label is unknown: this option adds a label called "unknown" if the chosen annotation is not available for the gene.
In this way all genes from the input list can be used without deleting genes due to lack of information.
P-values and correction for multiple testing
Different significance levels can be chosen for the FDR correction. Note that, if the level is chosen to be one: 1) just single annotations
will be calculated, and 2) the top-ranked 20 annotation items will be reported, regardless of their significance.
Output
Three different output files are written. The first is in html format to provide the user with a brief overview of the results. In this file all
the modules which have a single annotation with a lower p-value are highlighted in red. This means we highlight those modules which are probably
less interesting, as their significance may be caused by a very significant single annotation and not by the interplay of different annotations
(see also next point).
The second file is in text mode and displays the results in a more detailed way. Finally, the third file gives a brief summary
of the input data used.
Red shading
We mark with a red shading those combinations for which one of the members has a better (smaller) p-value than the p-value for the annotation module (combination).
The argumentation is the following. If we have a very significant single annotation and we combine it with a randomly distributed annotation,
then the combination is very likely still significant, however with a higher p-value that those of the single annotation. In such a case,
the combination is most likely not interesting in a biological context as no "real" interplay between annotations can be inferred from it.
Update Frequency
The underlying database can be automatically updated. We are planning to perform an update every two month approximately.
Each update of the database or the tool will be published in the Annotation-Modules News page.
The following species are currently implemented:
human (Homo sapiens), mouse (Mus musculus), rat (Ratus norvegicus),
Accepted Input IDs:
human (Homo sapiens)
Reference Sequence Gene and Protein IDs (NM_, NP_) Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST) UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2) UniProtKB/trEMBL IDs (A6NLT7) UniGene (Hs.) VEGA genes (OTTHUMP00000162794) Gene Symbols (ARNTL) Gene ID (10047090) International Protein Index (IPI00305185) Known Genes/UCSC genes (uc001kod) Affymetrix IDs (1556017_at)
mouse (Mus musculus)
Reference Sequence Gene and Protein IDs (NM_, NP_) Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST) UniProtKB/trEMBL IDs (A6NLT7) UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2) UniGene (Mm.) VEGA genes (OTTMUSP00000029240) Gene Symbols (ARNTL) Gene ID (10048422) International Protein Index (IPI00228959) Known Genes/UCSC genes (AK169072) Affymetrix IDs (1415760_s_at)
rat (Ratus norvegicus)
Reference Sequence Gene and Protein IDs (NM_, NP_) Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST) UniProtKB/trEMBL IDs (A6NLT7) UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2) UniGene (Rn.) Gene Symbols (ARNTL) Gene ID (10120494) International Protein Index (IPI00204201) Known Genes/UCSC genes (AB012231) Affymetrix IDs (1367478_at)
Annotations:
The availability of annotations depends on both the selected species and the selected gene table. Some annotations are not available for all species;
others may exist for all species but just for one specific gene table. The following table gives an overview of the features contained in our database.
Gene regulation and gene expression
TFBS
Detected by means of TransFac
PFM (Position Frequency Matrix) in the promoter region from TSS-1500 bp to TSS+500bp. To be included, the predicted
binding site must exist at the same site in a multiple sequence alignment in human, mouse and rat with a score higher than 90%.
Furthermore, the predicted TFBS have been binned to take into account a distance dependency from the TSS (4 different bins have been
used with widths of 200, 500, 1000 and 1500 bp). Finally, each of the 4 TFBS annotations sets (one for each bin width) exists also in
strand sensitive and insensitive variant which yields finally 8 different annotation sets derived from TFBS.
CpGislands
The prediction of CpG islands was done by means of the
CpGcluster algorithm running it with default parameters.
Furthermore, for the human genome we also incorporated the predictions by Bock et al. (2007) CpG island mapping by epigenome prediction,
PLoS computational biology, 3, e110, which takes into acount in silico the epigenetic states, histone modifications, and chromatin accessibility.
microRNA
Small noncoding RNAs called microRNA play important role in the post-transcriptional regulation of gene expression by either degrading mRNA or inhibiting translation.
Currently we include two different target site predictions
(PicTar and MirBase.
For each of the two we prepared two different annotations. In the first we assigned the name of the miRNA as the annotation label while
for the second we just distinguish between two groups: target and no-target genes.
Expression Breadth
To estimate the expression breadth (the percentage of tissues where the gene is expressed) we used the expression data from the
GeneAtlas.
As the expression breadth is continuous it needs to be binned or classified in order to obtain discrete labels (see below).
Functional categories and keywords
GO ontology
The
GO ontology
and
gene associations
were processed as described before. If a gene is annotated to a given level then we annotate it automatically to all parent levels as well.
The user has to choose a level for all three organizing principles (molecular function, biological process and cellular component).
Swiss-Prot/UniProt KnowledgeBase
We included several annotations which we extracted from the Swiss-Prot/UniProt KnowledgeBase. Beside the commonly used
keywords
, we assigned also some identifiers from
feature table
like post-transcriptional modifications (MOD_RES) or trans-membrane proteins (TRANS_MEM).
Finally we also extracted information from the
comment line
, like possible DISEASE associations of a gene.
Evolution and conservation
Taxonomic Depth
We define the taxonomic depth as the last common taxonomic level of the genes belonging to the same homologous gene cluster.
The gene clusters have been extracted from the
HomoloGene database
at NCBI. For example, if we have a cluster with three genes from Mouse: Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Mammalia;Eutheria;Euarchontoglires;Glires;Rodentia;Sciurognathi;Muroidea;Muridae;Murinae;Mus
Human: Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Mammalia;Eutheria;Euarchontoglires;Primates;Haplorrhini;Catarrhini;Hominidae;Homo
Dog: Eukaryota;Metazoa;Chordata;Craniata;Vertebrata;Euteleostomi;Mammalia;Eutheria;Laurasiatheria;Carnivora;Caniformia;Canidae;Canis
the assigned taxonomic depth would be Eutheria as this is the last common ancestor taxonomic level.
Overlap with Phylogenetically conserved elements (PhastCons)
PhastCons are known to be associated with 3' UTRs of regulatory genes and show also statistical evidence for enrichment for RNA secondary structure.
We associated them with several gene regions, like the promoter regions, the introns and 3'UTR.
Sequence Properties
Codon usage
The 'effective number of codons' used in a gene (Wrigth 1990) may reveal constraints on the evolution of codon usage as the Synonymous Codon Usage may be caused by
various forms of natural selection (to optimize the efficiency/accuracy of translation) and/or maintain structural features of the mRNA/DNA.
This value can vary between 21 (very biased codon usage) and 61 (random usage).
Base Composition
Several annotations exist related to the base composition of the CDS/mRNA like3 the G+C content of the mRNA, the C+G content at the third positions in codons
(GC3) or the composition of synonymous codons (GC3s).
Miscellaneous
Isochores
We used the
IsoFinder
algorithm to predict the isochores in the respective genomes. For the classification we used 6 classes L1, L2 (which are AT-rich)
and H1, H2, H3 and H4 (which are increasingly G+C-rich).
Finally we assigned each gene to a physical isochore and one of the 6 classes as annotation label.
SNPs
To assess the relation to SNPs, we split the genes into two groups: with and without a known
SNP
in the coding region.
Overlap with Repetitive Elements
The overlap with repetitive elements may be interesting in some cases (for example the exonization of Alu elements).
We quantified the overlap with these elements in several gene regions.
Simple classification of promoter regions
Bajic et al. (Mice and Men: Their Promoter Properties - 2006 Plos Genetics:e54) introduced a 4 family classification of human and mouse promoters
which can be linked to certain functional properties. Briefly, the GC-content upstream and downstream of the transcription start site is calculated
(in regions of 100 bp and 200 bp). A region is considered GC-rich if its GC-content is higher than 50% and GC-poor otherwise. 4 different combinations
are possible, poor-poor (upstream and downstream GC-poor), rich-poor (upstream GC-rich and downstream GC-poor), poor-rich (upstream GC-poor and downstream GC-rich)
and rich-rich (upstream and downstream GC-rich).
Co-localization with genomic elements or entities
The presence of certain genomic elements near or within genes has a clear biological meaning (TFBs in the promoter region or CpG islands overlapping the
TSS of most house-keeping genes). We annotate different gene regions and define a genomic element as present if it overlaps at least one base pair with a
given gene region. In this way we annotated the 'intrinsic', unambiguous gene regions like exons, intrones, 5'UTR and 3'UTR. More complicated is the definition
of the promoter regions and no consistent definition exists in the literature. The "real" borders of the promoter regions may vary widely between the different
genes and will also depend on the genomic element whose co-localization with the genes is going to be established. As a consequence, we defined 8 'arbitrary'
regions, of which 6 are definitions of the promoter region and 2 define regions at the 3' end of the gene. The classification can be seen in the following table:
Region
Region Borders [From;To]
R1
[TSS;TSS]
R2
[TSS-100 bp;TSS+100 bp]
R3
[TSS-200 bp;TSS+200 bp]
R4
[TSS-500 bp;TSS]
R5
[TSS-500 bp;TSS-200 bp]
R6
[TSS-1500 bp;TSS]
R7
[TES;TES]
R8
[TES;TES+500 bp]
Binning continuous data
Many of the features actually have a continuous distribution e.g. the codon usage, expression breadth, G+C contents etc. The method we use, however,
uses just discrete data (with labels). Therefore, the continuous data must be binned or classified in order to obtain discrete feature values. We do that
following a rather simple recipe. We introduce 3 classes, one for low values, one for intermediate values and one for high values. We assign the label "Low"
to the 20% of genes with the lowest values, "High" to the 20% of genes with the highest values, and "Intermediate" to the 60% of genes with intermediate values.