|
ContDist Documentation
1. Introduction
In general, tools for the ontological analysis of gene lists
used functional annotations like Gene Ontology, KEGG pathways or
Swiss-Prot keywords. Recently, other annotations such as TFBs, microRNA
target sites, posttranslational modification of proteins have been
added (as in Annotation-Modules). All these annotations are
assigned to the genes as labels like for example "Cellular metabolic
process (GO:0044237)" (for a GO term) or "let-7a" (for a microRNA). To
establish if those terms are enriched or depleted among an input list
of genes, dichotomous statistics like hypergeometric or binomial
distribution are used.
1.1 The importance of promoter region analysis
The understanding of how promoter regions regulates gene expression is complicated and far from being fully understood. It is known that histone regulation of DNA compactness, DNA methylation, transcription factor binding sites and CpG islands play a role in transcriptional regulation of a gene. However, the precise mechanism of regulation in different tissues and whether or not other factors are important is an open question. It is therefore important to explore the relation between features in promoter regions in relation to observed gene expression patterns.
1.2 Analysis of quantitative of gene properties
Many biological features can not be assigned as a label
due to its quantitative character, like the number of PPIs
(Protein-Protein Interactions), codon usage, physical DNA properties
(rotational and translational parameters of helical deformations),
methylation probabilities etc.
ContDist assigns quantitative (continuously distributed) gene
properties to user provided input lists and implements the appropriate
statistical tools to analyze the user input in several different ways.
It has a strong focus on the promoter regions of the genes which so far
have been disregarded in that kind of analysis (DNA properties,
methylation and chromatin state, G+C contents, dinucleotides densities
etc).
Back to index
2. Short summary of the main features
ContDist implements 3 different input options
- input list vs. list of reference genes (reference list can be provided by user)
- comparison of to input list
- comparison of a input list with the corresponding homologous genes
As a consequence of the different input options, ContDist implements different statistical test:
- list vs. reference: Kolmogorov-Smirnov test and a randomization test (link to test)
- list vs. list: Kolmogorov-Smirnov and randomization test of mean.
- list vs. homologous genes: paired t-test
The current version of the annotation database holds information on 7
species, human (Homo sapiens), mouse (Mus musculus), rat (Ratus
norvegus), fruit fly (Drosophila malanogaster), chimpanzee (Pan
troglodites), zebrafish (Danio rerio) and cow (Bos taurus)
The annotations are grouped in 6
sections: base composition, physical properties of DNA and chromatin
properties, evolution, general gene/protein properties, overlap with
genomic elements and gene expression. Depending on the species, between
approx. 220 different annotations (in human) and approx. 180 different
annotations in drosophila are available.
Back to index
3. Input format
ContDist has been implemented to accept a wide range of gene identifier (ID) formats such as those from Affymetrix, UCSC or uniGene. A separate set of formats have been implemented for each particular specie. Table 1 shows a detailed description of all accepted formats for each specific specie considered. Input gene list must have a single gene identifier per line and the file should be written in plain text format.
| human (Homo sapiens) | Affymetrix IDs (1556017_at) Known Genes/UCSC genes (uc001kod) International Protein Index (IPI00305185) Gene ID (10047090) Gene Symbols (ARNTL) VEGA genes (OTTHUMP00000162794) UniGene (Hs.) UniProtKB/trEMBL IDs (A6NLT7) UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2) Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST) Reference Sequence Gene and Protein IDs (NM_, NP_)
| | mouse (Mus musculus) | Affymetrix IDs (1415760_s_at) Known Genes/UCSC genes (AK169072) International Protein Index (IPI00228959) Gene ID (10048422) Gene Symbols (ARNTL) VEGA genes (OTTMUSP00000029240) UniGene (Mm.) UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2) UniProtKB/trEMBL IDs (A6NLT7) Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST) Reference Sequence Gene and Protein IDs (NM_, NP_)
| | rat (Ratus norvegicus) | Affymetrix IDs (1367478_at) Known Genes/UCSC genes (AB012231) International Protein Index (IPI00204201) Gene ID (10120494) Gene Symbols (ARNTL) UniGene (Rn.) UniProtKB/Swiss-Prot IDs (Q9UHA3,Q9UJA2) UniProtKB/trEMBL IDs (A6NLT7) Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST) Reference Sequence Gene and Protein IDs (NM_, NP_)
| | fruit fly (Drosophila melanogaster) | Reference Sequence Gene and Protein IDs (NM_, NP_) FlyBase genes (CG41370-RA)
| | chimpanzee (Pan troglodytes) | Reference Sequence Gene and Protein IDs (NM_, NP_) Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST)
| | zebrafish (Danio rerio) | Reference Sequence Gene and Protein IDs (NM_, NP_) Ensemble Gene-Protein-Transcript IDs (ENSG,ENSP,ENST)
| | cow (Bos taurus) | Reference Sequence Gene and Protein IDs (NM_, NP_)
|
Table 1. Available input formats for each specie considered by ContDist
Currently, the user can select between two different gene tables, RefSeq genes and Ensembl genes. The vast majority of the annotations depend on the genomic context, e.g. depend on the DNA sequence. Therefore, we opted to include just gene tables for which genomic coordinates are available.
Back to index
4. Input options & statistical analysis
ContDist have been designed to provide the user the possibity of select one out of three different
input options. Subsequently, a particular statistical analysis is to be performed according to the input
option required by the user. The different input options considered by ContDist are:
- i) the comparison of a input gene list to a background or set of reference genes
(these can be also provided by the user). Please note that the input list is a subset of the reference set.
The input genes are not removed from the reference list in the analysis. If the user wants to compare the input
list against the "rest of the genome minus the input list" he has to use input option ii)
- Statistical test: Kolmogorov-Smirnov & Random sampling
- ii) the comparison of two user provided gene lists
- Statistical test: Kolmogorov-Smirnov & Randomization test of means
- iii) the comparison of an input gene list to the corresponding homologous genes in another specie (the comparison can be done with all the other 7 species in the database)
- Kolmogorov-Smirnov Test:
- The
Kolmogorov-Smirnov is a parameter free test which tries to determine if
two datasets differ significantly. For a description of this test, see
for example here.
- Random sampling:
- Briefly,
we generate a sample distribution of the mean by X times randomly
extracting N genes (the size of the input list) out of the reference
genes. For N sufficiently large N, the distribution will be Gaussian
which allows us to apply the standard z-score to calculate the
p-values.
- Randomization test of the means (comparison of two lists):
- We
reassign randomly X times (normally 10000) the values to the two list,
calculating in each run the difference of the mean between the two
lists. Then we determine the number of random reassignments which lead
to a larger distance (abs of difference) of means than the observed
distance between the means. Finally, by dividing this number by the
number of random runs, we get the two-tailed p-value (the probability
that such an extreme distance between the means can be reached by
change alone). For a more detailed description please see here.
Back to index
5. Available features
Depending on the specie between 180 and 220 annotations have been made available. These annotations can be clustered into six different categories: i) physical properties of DNA and chromatin, ii) base composition, iii) evolution, iv) general gene/protein properties, v) overlap with genomic elements and vi) gene expression.
5.1 Genomic regions under consideration
These features pertain to different genomic regions such as the promoter region, exons, introns, the 3'UTR and the 5'UTR. However, many of the available annotations were focussed on the analysis of the promoter region. Unfortunately, the promoter region is not a very well defined entity, and it
might depend on the context of the analysis which definition is more
appropriate. We therefore define several different promoter regions to
give the user the change to select the one which is most appropriate
for the analysis. In the table above the regions are shown. TSS is the
Transcription Start Site and TES is the Transcription End Site (the
coordinate of the last base in the transcript)
| Region |
Region Borders [From;To] |
| R1 |
[TSS;TSS] |
| R2 |
[TSS-100 bp;TSS+100 bp] |
| R3 |
[TSS-200 bp;TSS+200 bp] |
| R4 |
[TSS-500 bp;TSS] |
| R5 |
[TSS-500 bp;TSS-200 bp] |
| R6 |
[TSS-1500 bp;TSS] |
| R7 |
[TES;TES] |
| R8 |
[TES;TES+500 bp] |
Table 2. Definition of the different promoter regions considered by ContDist.
5.2 Annotation features
The availability of annotations depends on both the selected species
and the selected gene table. Some annotations are not available for all
species; others may exist for all species but just for one specific
gene table. The following table gives an overview of the features
contained in our database.
| Base composition |
G+C content:
The G+C content is defined as the frequency of G plus the frequency of
C divided by the length of the region ((#G + #C)/length). We calculated
the G+C content in various promoter regions and intrinsic gene regions
as exons, introns and UTRs.
Density of dinucleotides:
The dinucleotides density is simply the number of dinucleotides of a
given type divided by the length of the regions (the density may vary
between 0 and 0.5). |
| Evolution |
Substitution rates and patterns in pairwise alignments
To access this information on pairwise alignments, which may give hints
on the evolutionary history of the protein/gene, we used the
homologene.xml file (can be found here).
Many different statistics exist in a pre-calculated form like: the
ratio of nucleotide differences between the pair, the ratio of
nucleotide differences between the pair (corrected for back substitions
through Jukes and Cantor formula), the ratio of amino acid differences
between the pair, Ka for the pair (ratio of non-synonymous differences
per non-synonymous site, Ks for the pair (ratio of synonymous
differences per synonymous site), Knr for the pair (ratio of radical
non-synonymous differences per radical non-synonymous site), Knc for
the pair (ratio of conservative non-synonymous differences per
conservative non-synonymous site). Furthermore, we calculated the Ka/Ks
ratio which might shed light on the selective pressures acting on the
gene. We extracted the mentioned parameters for all species pairs which
are currently in the database (6 pairwise comparisons for each
species).
Substitution rates and patterns in pairwise alignments
We calculated the SNP densities in several promoter regions and in the coding region. We used dbSNP version 126 for human and mouse and version 125 for rat. We downloaded the data from the UCSC table browser.
|
| Physical DNA and chromatin properties |
Helical deformation:
We calculated the mean values for 6 helical deformations (Twist, Tilt, Roll, Shift, Slide and Rise) in different promoter regions using the stiffness constants given in (J Ramon Go?i et al. Determining promoter location based on DNA structure first-principles calculations. Genome Biology ).
Methylation probability:
Das et al.
estimated a methylation probability for all CpGs on human autosomes.
They used a support vector machine to learn a predictor from
experimental methylation data. They estimate the accuracy of their
approach in 86%. We calculated the probability of remaining
unmethylated of a given region as the mean probability of all CpGs
within this region.
CpG island strength:
Bock et al.
calculate CpG islands assigning them also ?epigenetic scores?. This
includes ?epigenetic state? and ?chromatin state?. We assigned the
values of these scores to all genes which have a Bock CpG island in its
promoter region (regions R1 and R3). |
| Gene Expression |
Gene Atlas:
Su et al. published
in 2004 a Gene Atlas with expression data for 79 human tissues and 61
mouse tissues. We first filtered out all probes with lower expression
values than 200 units. Next we averaged over different probes of the
same gene. Finally, we calculated a peak expression (maximal expression
values of all tissues), a expression breadth (% of tissues where the
gene is expressed) and a mean expression (the mean expression to the
gene over all tissues). |
| Overlap with genomic elements |
Repetitive Elements:
We calculated the coverage of distinct repetitive elements (downloaded from the UCSC table browser
) in different gene and promoter regions. The coverage is simply the
fraction of base pairs in a given region which belong to a repetitive
element (# repetitive bp/region length). We calculated the overlap
first for all annotated repeats (RepeatMasker) and later separately for
Alus, LINE1, LTR and DNA transposons.
PhastCons:
We downloaded all phylogenetically conserved elements from the UCSC table browser and calculated, like done above, the coverage of several gene and promoter regions.
CpG islands:
We used CpGcluster predictions for CpG islands.
We first determined all genes which have a CpG island in its promoter
region (R1 and R3). Next we assign to those genes the following values
of the overlapping islands: Observed/Expected ratios of CpGs, CpG
density of the island, length of the island and G+C content of the
island. |
| Overlap with genomic elements |
PPI:
We downloaded the Protein-Protein interaction file from here.
Next we calculated the number of interactions (we did not distinguish
between the different interaction types) for each gene/protein and
assigned these values.
Codon bias:
We calculated the effective number of codons, Nc (Wright, 1990). This
quantity might reveal constraints on the evolution of codon usage. The
Synonymous Codon Usage may be caused by various forms of natural
selection, to optimize the efficiency and accuracy of translation or
maintain structural features of the mRNA or DNA. It can vary between 21
(very biased codon usage) and 61 (random usage). |
Table 3. Categories of annotated features. Particular features within each category have been described.
Back to index
6. Other relevant information
Missing annotations: Difference between original and effective sizes
Sometimes a notable difference between the original and the effective
input size may exist. The original input size corresponds to the number
of genes in the gene list(s) supplied by the user. The effective input
size is the number of genes that: i) could be mapped to the gene table
(e.g. some of the input identifiers may be unknown or discarded), and
ii) for which this annotation exists. Note, that for some annotations,
like for PPI (number of protein-protein interactions), just for a
reduced number of genes/proteins information exist and therefore many
of the input identifiers can not be assigned.
Update Frequency
The underlying database can be automatically updated. We are planning to perform an update every two month approximately.
Each update of the database or the tool will be published in this news page.
Data storage
Please be aware that the user data is stored just during one week. After this period all the data is removed.
Back to index
|