Test data for Annotation-Modules:

This data file holds all RefSeq genes for human chromosome 1 which have a CpG island overlapping its TSS (Transcription Start Site):
Gene List
In order to set an appropiate statistical background, just RefSeq genes from chromosome 1 should be used, otherwise the p-value calculations would be biased:
Reference Genes

Guide for Annotations-Modules

In the following, we will briefly discuss the different steps in Annotation-Modules.

Species

After launching the Annotation-Modules php interface, the first page which is generated asks for the species which is going to be analysed (see Figure 1)


Figure 1: The species currently implemented in Annotation-Modules.

Input & Parameters

The second page can be divided into two parts, the upper part shows the input options, and the bottom part deals with the different parameters.
Three different input options are implemented:
  1. Gene List & Fixed Reference List: this option should only be used if the user is absolutely sure that one of the offered lists of reference genes (see See the documentation) fits his needs.
  2. Gene List & User Reference List: the user uploads both, the list he wants to analyse and the list of reference genes (the statistical background)
  3. Gene List & User Pre-annotated Reference List (in reference genes!): By means of this option, the user can upload his own annotations. The annotations must be placed in the file which holds the reference genes (the input gene list gets then automatically annotated). The file must have two columns: the first containing the gene IDs and the second the coma separated items (annotations).
ID Annotations (comma separated)
NM_152486 blue,yellow
NM_032129 blue
NM_198576 yellow,black


Figure 2: The figure shows the input options of Annotations-Modules. Note that in all the pages the top box gives a short review of the options which have been already chosen. In this case it's the species/database, which can be also changed in this box.

The second part of this page deals with some program parameters. Right now, 5 different options exist:

Choose a gene table Algorithms which accept different gene identifiers need to map these input IDs to an internal "working" table (for example Ensemble IDs). In general, this is invisible to the user and the table cannot be chosen. The disadvantage is that when mapping between gene IDs (like NCBI gene symbols), Protein IDs (like IPI, Swiss-Prot) or transcript IDs (Ensemble -ENST, or RefSeq), this will always lead to ambiguous decisions like the handling of multiple mappings etc. We therefore decided to offer different "cross-link tables" to avoid unnecessary mappings in some cases. For example, if the user is just interested in GO categories (which are annotated on a protein level) and his input list consists of protein IDs, a protein table like IPI or Swiss-Prot can be chosen to avoid unnecessary mappings.
The filter method It may sometimes happen that various genes in a list start or end (or both) at the same position in the genome
The following options to exist:
no filtering: Do not perform any filtering, accept all genes
TSS:Group transcripts which start at the same position in the genome and remove all except one
TSSTES: Group transcripts which start and end at the same position in the genome and remove all except one
The combination depth The maximal number of combinations which are considered (the maximal number of items in one module)
The p-value limit Different p-value limits which can be chosen by the user. If p-value = 1 is chosen, just the first 20 Annotation-Modules with the lowest p-values are presented (whether significant or not!!!)
The maximal number of combinations The maximal number of combinations which are considered at each level. This parameter is important for lowering computational time. If you use a high combination depth (maximum number of items in a set of annotations) and a large number of different annotations, this parameter should not be set to a high value (otherwise, the computational time would increase considerably).


Figure 3: The figure shows the different parameters implemented in Annotations-Modules. The input data (the test data at the top of the page) is composed of RefSeq identifiers, so we have chosen RefSeq genes as the cross-link table. The test data gene list consists of genes which have a CpG island in their promoter regions. Therefore, we group the genes by their TSS coordinates, to eliminate the redundancy caused by these genes.

The Annotations

After choosing the species, input options and program parameters, the last step is the selection of the annotations for the analysis . The annotations available vary depending on the chosen species and the gene table. Figure 4 shows all annotations which are available for human RefSeq genes. The annotations are roughly divided into 6 classes: features related to Regulation/Expression, Evolution/Conservation, FunctionalAnnotation, PopulationGenetics, Miscellaneous and Sequence Properties . Within these classes, some related features are grouped and placed in lists out of which just one type of annotation can be chosen. Examples are the different predictions of microRNA binding sites or the different predictions of CpG islands.
Note that, in theory, the user can choose all annotations simultaneously, but please be aware that the computational time will increase sharply (especially when selecting a large number of combinations).
The "Send Features" button launches the application.



Figure 4: The annotations available for human RefSeq genes.

Running

After running the program, the php interface sends back to the user the information on the actual status of the job (see Figure 5).


Figure 5: While the program is running, the php interfaces will check periodically if the application has finished. The user can leave open the browser window or bookmark the link and come back later. The results will be posted to this window.

The Output

The program writes 4 different output files (see Figure 6).

Figure 6: The output page presenting the 4 different output files.

The first is in html format to provide the user with a brief overview of the results. An important feature here is the red shading of those combinations for which one of the members has a better (smaller) p-value than the p-value for the annotation module (combination). The reasoning is as follows. If we have a very significant single annotation and we combine it with a randomly distributed one, then the combination is very likely still significant, however with a higher p-value that those of the single annotation. In such a case, the combination is most likely not interesting in a biological context as it cannot infer any "real" interplay between the annotations. In the last file on the output page (Just "Better Combinations"), the red shaded combinations are removed.


Figure 7: The output in html format with red shading see documentation.

The program also writes out a more detailed output in a tab separated text file which holds all frequencies and all genes.
Finally, a brief summary of the job is given showing mainly the number of mapped and annotated genes and the chosen parameters.


Figure 7: A short summary of the job.