Restart Example Input Data & Tutorial Documentation Go to Annotation-Modules Publication

ContDist Documentation


1. Introduction

In general, tools for the ontological analysis of gene lists used functional annotations like Gene Ontology, KEGG pathways or Swiss-Prot keywords. Recently, other annotations such as TFBs, microRNA target sites, posttranslational modification of proteins have been added (as in Annotation-Modules). All these annotations are assigned to the genes as labels like for example "Cellular metabolic process (GO:0044237)" (for a GO term) or "let-7a" (for a microRNA). To establish if those terms are enriched or depleted among an input list of genes, dichotomous statistics like hypergeometric or binomial distribution are used.

1.1 The importance of promoter region analysis

The understanding of how promoter regions regulates gene expression is complicated and far from being fully understood. It is known that histone regulation of DNA compactness, DNA methylation, transcription factor binding sites and CpG islands play a role in transcriptional regulation of a gene. However, the precise mechanism of regulation in different tissues and whether or not other factors are important is an open question. It is therefore important to explore the relation between features in promoter regions in relation to observed gene expression patterns.

1.2 Analysis of quantitative of gene properties

Many biological features can not be assigned as a label due to its quantitative character, like the number of PPIs (Protein-Protein Interactions), codon usage, physical DNA properties (rotational and translational parameters of helical deformations), methylation probabilities etc.
ContDist assigns quantitative (continuously distributed) gene properties to user provided input lists and implements the appropriate statistical tools to analyze the user input in several different ways. It has a strong focus on the promoter regions of the genes which so far have been disregarded in that kind of analysis (DNA properties, methylation and chromatin state, G+C contents, dinucleotides densities etc).

Back to index


2. Short summary of the main features

  • ContDist implements 3 different input options
    1. input list vs. list of reference genes (reference list can be provided by user)
    2. comparison of to input list
    3. comparison of a input list with the corresponding homologous genes
  • As a consequence of the different input options, ContDist implements different statistical test:
    1. list vs. reference: Kolmogorov-Smirnov test and a randomization test (link to test)
    2. list vs. list: Kolmogorov-Smirnov and randomization test of mean.
    3. list vs. homologous genes: paired t-test
  • The current version of the annotation database holds information on 7 species, human (Homo sapiens), mouse (Mus musculus), rat (Ratus norvegus), fruit fly (Drosophila malanogaster), chimpanzee (Pan troglodites), zebrafish (Danio rerio) and cow (Bos taurus)
  • The annotations are grouped in 6 sections: base composition, physical properties of DNA and chromatin properties, evolution, general gene/protein properties, overlap with genomic elements and gene expression. Depending on the species, between approx. 220 different annotations (in human) and approx. 180 different annotations in drosophila are available.

Back to index



3. Input format

ContDist has been implemented to accept a wide range of gene identifier (ID) formats such as those from Affymetrix, UCSC or uniGene. A separate set of formats have been implemented for each particular specie. Table 1 shows a detailed description of all accepted formats for each specific specie considered. Input gene list must have a single gene identifier per line and the file should be written in plain text format.

human (Homo sapiens) Affymetrix IDs (1556017_at)
Known Genes/UCSC genes (uc001kod)
International Protein Index (IPI00305185)
Gene ID (10047090)
Gene Symbols (ARNTL)
VEGA genes (OTTHUMP00000162794)
UniGene (Hs.)
UniProtKB/trEMBL IDs (A6NLT7)
UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2)
Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST)
Reference Sequence Gene and Protein IDs (NM_, NP_)
mouse (Mus musculus) Affymetrix IDs (1415760_s_at)
Known Genes/UCSC genes (AK169072)
International Protein Index (IPI00228959)
Gene ID (10048422)
Gene Symbols (ARNTL)
VEGA genes (OTTMUSP00000029240)
UniGene (Mm.)
UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2)
UniProtKB/trEMBL IDs (A6NLT7)
Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST)
Reference Sequence Gene and Protein IDs (NM_, NP_)
rat (Ratus norvegicus) Affymetrix IDs (1367478_at)
Known Genes/UCSC genes (AB012231)
International Protein Index (IPI00204201)
Gene ID (10120494)
Gene Symbols (ARNTL)
UniGene (Rn.)
UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2)
UniProtKB/trEMBL IDs (A6NLT7)
Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST)
Reference Sequence Gene and Protein IDs (NM_, NP_)
fruit fly (Drosophila melanogaster) Reference Sequence Gene and Protein IDs (NM_, NP_)
FlyBase genes (CG41370-RA)
chimpanzee (Pan troglodytes) Reference Sequence Gene and Protein IDs (NM_, NP_)
Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST)
zebrafish (Danio rerio) Reference Sequence Gene and Protein IDs (NM_, NP_)
Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST)
cow (Bos taurus) Reference Sequence Gene and Protein IDs (NM_, NP_)

Table 1. Available input formats for each specie considered by ContDist

Currently, the user can select between two different gene tables, RefSeq genes and Ensembl genes. The vast majority of the annotations depend on the genomic context, e.g. depend on the DNA sequence. Therefore, we opted to include just gene tables for which genomic coordinates are available.

Back to index



4. Input options & statistical analysis

ContDist have been designed to provide the user the possibity of select one out of three different input options. Subsequently, a particular statistical analysis is to be performed according to the input option required by the user. The different input options considered by ContDist are:

  • i) the comparison of a input gene list to a background or set of reference genes (these can be also provided by the user). Please note that the input list is a subset of the reference set. The input genes are not removed from the reference list in the analysis. If the user wants to compare the input list against the "rest of the genome minus the input list" he has to use input option ii)
    • Statistical test: Kolmogorov-Smirnov & Random sampling
  • ii) the comparison of two user provided gene lists
    • Statistical test: Kolmogorov-Smirnov & Randomization test of means
  • iii) the comparison of an input gene list to the corresponding homologous genes in another specie (the comparison can be done with all the other 7 species in the database)
    • Statistical test: T-test

Kolmogorov-Smirnov Test:
The Kolmogorov-Smirnov is a parameter free test which tries to determine if two datasets differ significantly. For a description of this test, see for example here.
Random sampling:
Briefly, we generate a sample distribution of the mean by X times randomly extracting N genes (the size of the input list) out of the reference genes. For N sufficiently large N, the distribution will be Gaussian which allows us to apply the standard z-score to calculate the p-values.
Randomization test of the means (comparison of two lists):
We reassign randomly X times (normally 10000) the values to the two list, calculating in each run the difference of the mean between the two lists. Then we determine the number of random reassignments which lead to a larger distance (abs of difference) of means than the observed distance between the means. Finally, by dividing this number by the number of random runs, we get the two-tailed p-value (the probability that such an extreme distance between the means can be reached by change alone). For a more detailed description please see here.

Back to index



5. Available features

Depending on the specie between 180 and 220 annotations have been made available. These annotations can be clustered into six different categories: i) physical properties of DNA and chromatin, ii) base composition, iii) evolution, iv) general gene/protein properties, v) overlap with genomic elements and vi) gene expression.

5.1 Genomic regions under consideration

These features pertain to different genomic regions such as the promoter region, exons, introns, the 3'UTR and the 5'UTR. However, many of the available annotations were focussed on the analysis of the promoter region. Unfortunately, the promoter region is not a very well defined entity, and it might depend on the context of the analysis which definition is more appropriate. We therefore define several different promoter regions to give the user the change to select the one which is most appropriate for the analysis. In the table above the regions are shown. TSS is the Transcription Start Site and TES is the Transcription End Site (the coordinate of the last base in the transcript)

Region Region Borders
[From;To]
R1 [TSS;TSS]
R2 [TSS-100 bp;TSS+100 bp]
R3 [TSS-200 bp;TSS+200 bp]
R4 [TSS-500 bp;TSS]
R5 [TSS-500 bp;TSS-200 bp]
R6 [TSS-1500 bp;TSS]
R7 [TES;TES]
R8 [TES;TES+500 bp]

Table 2. Definition of the different promoter regions considered by ContDist.

5.2 Annotation features

The availability of annotations depends on both the selected species and the selected gene table. Some annotations are not available for all species; others may exist for all species but just for one specific gene table. The following table gives an overview of the features contained in our database.

Base composition G+C content:
The G+C content is defined as the frequency of G plus the frequency of C divided by the length of the region ((#G + #C)/length). We calculated the G+C content in various promoter regions and intrinsic gene regions as exons, introns and UTRs.
Density of dinucleotides:
The dinucleotides density is simply the number of dinucleotides of a given type divided by the length of the regions (the density may vary between 0 and 0.5).
Evolution Substitution rates and patterns in pairwise alignments
To access this information on pairwise alignments, which may give hints on the evolutionary history of the protein/gene, we used the homologene.xml file (can be found here). Many different statistics exist in a pre-calculated form like: the ratio of nucleotide differences between the pair, the ratio of nucleotide differences between the pair (corrected for back substitions through Jukes and Cantor formula), the ratio of amino acid differences between the pair, Ka for the pair (ratio of non-synonymous differences per non-synonymous site, Ks for the pair (ratio of synonymous differences per synonymous site), Knr for the pair (ratio of radical non-synonymous differences per radical non-synonymous site), Knc for the pair (ratio of conservative non-synonymous differences per conservative non-synonymous site). Furthermore, we calculated the Ka/Ks ratio which might shed light on the selective pressures acting on the gene. We extracted the mentioned parameters for all species pairs which are currently in the database (6 pairwise comparisons for each species).
Substitution rates and patterns in pairwise alignments
We calculated the SNP densities in several promoter regions and in the coding region. We used dbSNP version 126 for human and mouse and version 125 for rat. We downloaded the data from the UCSC table browser.
Physical DNA and chromatin properties Helical deformation:
We calculated the mean values for 6 helical deformations (Twist, Tilt, Roll, Shift, Slide and Rise) in different promoter regions using the stiffness constants given in (J Ramon Go?i et al. Determining promoter location based on DNA structure first-principles calculations. Genome Biology ).
Methylation probability:
Das et al. estimated a methylation probability for all CpGs on human autosomes. They used a support vector machine to learn a predictor from experimental methylation data. They estimate the accuracy of their approach in 86%. We calculated the probability of remaining unmethylated of a given region as the mean probability of all CpGs within this region.
CpG island strength:
Bock et al. calculate CpG islands assigning them also ?epigenetic scores?. This includes ?epigenetic state? and ?chromatin state?. We assigned the values of these scores to all genes which have a Bock CpG island in its promoter region (regions R1 and R3).
Gene Expression Gene Atlas:
Su et al. published in 2004 a Gene Atlas with expression data for 79 human tissues and 61 mouse tissues. We first filtered out all probes with lower expression values than 200 units. Next we averaged over different probes of the same gene. Finally, we calculated a peak expression (maximal expression values of all tissues), a expression breadth (% of tissues where the gene is expressed) and a mean expression (the mean expression to the gene over all tissues).
Overlap with genomic elements Repetitive Elements:
We calculated the coverage of distinct repetitive elements (downloaded from the UCSC table browser ) in different gene and promoter regions. The coverage is simply the fraction of base pairs in a given region which belong to a repetitive element (# repetitive bp/region length). We calculated the overlap first for all annotated repeats (RepeatMasker) and later separately for Alus, LINE1, LTR and DNA transposons.
PhastCons:
We downloaded all phylogenetically conserved elements from the UCSC table browser and calculated, like done above, the coverage of several gene and promoter regions.
CpG islands:
We used CpGcluster predictions for CpG islands. We first determined all genes which have a CpG island in its promoter region (R1 and R3). Next we assign to those genes the following values of the overlapping islands: Observed/Expected ratios of CpGs, CpG density of the island, length of the island and G+C content of the island.
Overlap with genomic elements PPI:
We downloaded the Protein-Protein interaction file from here. Next we calculated the number of interactions (we did not distinguish between the different interaction types) for each gene/protein and assigned these values.
Codon bias:
We calculated the effective number of codons, Nc (Wright, 1990). This quantity might reveal constraints on the evolution of codon usage. The Synonymous Codon Usage may be caused by various forms of natural selection, to optimize the efficiency and accuracy of translation or maintain structural features of the mRNA or DNA. It can vary between 21 (very biased codon usage) and 61 (random usage).

Table 3. Categories of annotated features. Particular features within each category have been described.

Back to index



6. Other relevant information

Missing annotations: Difference between original and effective sizes

Sometimes a notable difference between the original and the effective input size may exist. The original input size corresponds to the number of genes in the gene list(s) supplied by the user. The effective input size is the number of genes that: i) could be mapped to the gene table (e.g. some of the input identifiers may be unknown or discarded), and ii) for which this annotation exists. Note, that for some annotations, like for PPI (number of protein-protein interactions), just for a reduced number of genes/proteins information exist and therefore many of the input identifiers can not be assigned.

Update Frequency

The underlying database can be automatically updated. We are planning to perform an update every two month approximately. Each update of the database or the tool will be published in this news page.

Data storage

Please be aware that the user data is stored just during one week. After this period all the data is removed.

Back to index


For questions or feedback please contact: Rune Matthiesen or Michael Hackenberg