Restart Tutorial & Test Data Documentation Go to Annotation-Modules Group page

Introduction to tutorial

This tutorial was been developed in order to give the user a basic understanding of our tool. The necessary test data and a step-by-step tutorial can be found below. However, the user is encouraged to read the ContDist manual in order to obtain more detailed information about our software (e.g. different types of input format ContDist accepts and a full description of the available features).

Test data for ContDist tutorial

To show the usefulness of this tool, we used gene lists with hypomethylated and differentially methylated promoters . These gene lists were generated from the HEP data (http://www.epigenome.org/). In our case study, we define a promoter as unmethylated if the mean methylation is smaller than 20 in all tissues and we define it as differentially methylated if it is in at least one tissue methylated and in one tissue un-methylated.

Step by step tutorial for ContDist:

The PHP interface leads the user through the 3 different steps which compound the input procedure. The three different input pages are dynamically generated and depend partially on previous selections.

Step 1: Specie selection

The first step consists in selecting the particular specie the user wants to analyze (see figure 1). Currently 7 different species are implemented:
  • Human (Homo sapiens, USCS assembly versiton: hg18)
  • Mouse (Mus musculus, USCS assembly versiton: mm8)
  • Rat (Ratus norvegus, USCS assembly versiton: rn4)
  • Fruit fly (Drosophila malanogaster, USCS assembly versiton: dm3)
  • Chimpanzee (Pan troglodytes, USCS assembly versiton: panTro2)
  • Zebrafish (Danio rerio, USCS assembly versiton: danRer5)
  • Cow (Bos Taurus, USCS assembly versiton: bosTau4)


Figure 1: The first page of ContDist (shows all implemented species).

Step 2: Input and gene table selection

This step requires the user to select an input option and a gene table to be matched (figure 2).
Input options
Currently three different input options have been implemented:
  • i) compare a input gene list to a list of reference genes (either using the entire genome of the corresponding specie as reference or using a reference list uploaded by the user)
  • ii) compare two input lists of genes
  • iii) compare a input gene list to the corresponding homologous genes
In the first input option, the user has two possibilities, either uploading a user defined reference set or using the whole genome as a statistical background. If the latter option is chosen, all genes in the corresponding genome which have the annotation (depending on the gene feature, missing values might exist as no information for this gene is available) are used. When the third input option is selected the user is also required to select a species from which the homologous gene should be taken.
Depending on the selected input option ConDist will perform a particular statistical test that best suits the input data (for a more detailed description please refer to the user manual).
Gene tables
Currently, the user can select between two different gene tables, RefSeq genes and Ensembl genes. The vast majority of the annotations depend on the genomic context, e.g. depend on the DNA sequence. Therefore, we opted to include just gene tables for which genomic coordinates are available.


Figure 2: The second page of ContDist. On the top it shows the already chosen options (the specie) and gives the possibility to change it. In the middle part one of the 4 different input options can be chosen. Finally, one of the available genomic tables must be chosen (by default RefSeq genes). The input options correlate directly with the analysis type and the used statistical tests (see the three different output pages above).

Step 3: Selection of available annotations for statistical comparison

After, selecting the species, the input option and the gene table, the PHP interface looks up all available annotations for the selected combination. The third input page can be divided into three parts. On the top, the already selected parameters can be seen and changed if wanted. The second section consist of a list of default annotations sets. The advantage of these default sets resides in that the user has not to click-through all annotations but can directly choose one of the default sets and launch in directly. Therefore, several different default sets have been generated. For example, one set gives a cross-section of all available feature groups containing the most important annotations of each group. After that also for each feature group a default set with the most important annotations exists.


Figure 3: Overview on the annotation input page.

Finally, the third part of the page is made up of all existing annotations. Given that the annotation database is comprised by approximately 200 features, we opted to present the annotations in a compact way. By default, just the 6 feature groups can be seen: i) base composition, ii) physical properties of DNA and chromatin, iii) evolution, iv) general gene/protein properties, v) overlap with genomic elements and vi) gene expression (see figure 3 at the bottom). By clicking on top of the feature group name, all available annotations within this group become visible (see figure 4). Each feature group may be comprised by different sub groups. Figure 4 shows the expansion of a particular feature group (Base composition) which consists of several subgroups, among them those which can be seen in figure 4 like GC-content in different genomic gene regions, GC-content in the coding region and density of dinucleotides in different promoter regions. Each annotations appears with a short name (under which it is saved in de DB) and a short description of the feature. Moreover, when passing the mouse over the feature name, a pop-up window opens with a longer description. Each annotation can be selected by means of a check-box. After selecting, all desired annotations, by means of the Send Features button at the bottom of the page; the program is launched (send to a queuing system).


Figure 4: Pop-up view of annotations

After launching the program, an intermediate page is shown which displays the current status of the job (figure 5). First, the PHP interface reads out the information about the job of the queuing system and shows the current status like, pending + position in queue or executing. Moreover, the job ID is shown and a link to the output page is given. When the program finishes, the output page gets automatically loaded in the browser. However, by means of the link, the user can also bookmark it and check the output later.


Figure 5: The intermediate �launch� page. It shows: i) the current status like pending (and the position in the queue) or executing, ii) the job ID and a link to the output page.

The figures 6-8 show the output pages for three possible input options. Each of the three pages is made up of a common part (virtually identical in all three pages) and an individual part (which corresponds to the particular statistical analysis for the selected input option). The common part contains:

  • i) Header with the information on the selected analysis, size of input data, selected species and the job ID
  • ii) Annotated input data for download (first white table on the right)
  • iii) The effective number of genes (the number of genes to which a annotation could be assigned)
  • iv) The basic statistic of the data like mean, standard deviation, several percentiles, minimum and maximum
  • v) A graphical representation of the input data (raw and normalized histograms). The histograms are made with binned data (50 bins) using GnuPlot (http://www.gnuplot.info/).


Figure 6: Output page when comparing an input list to a reference set of genes.

Figure 6 shows the output for the analysis of an input gene list comparing it to a set of reference genes (the input list is a sub-set of the reference set). Like mentioned above, the top part of the page displays a common summary of the input data (common between all three types of analysis). The bottom table finally shows the specific output data for this analysis. On the left part, the result for the Kolmogorov-Smirnov test can be seen, and on the right side the outcome for the randomization test is shown. In case of the KS-test, the p-value, the maximal difference and the cumulative fraction plot are given. In case of the random sampling method the z-score, p-value, sampling mean and standard deviation are given. Moreover, the distribution of the sampling means is given. Note that this distribution should be fairly Gaussian, otherwise this test will loose its validity.


Figure 7: Output page for the comparison of two input lists.

Figure 7 shows the output for a comparison of two gene lists. It can be seen that the top part of the output page is virtually identical to the output explained above (figure 6) and also the left part of the bottom table (the KS-test) is identical. The difference is in the other statistical test displayed in the bottom table on the right (randomization test of the means). The table shows the observed distance of the means between the two tables, the mean distance of the random reassignments (note that this should be virtually 0), the standard deviation and p-value. The p-value is the fraction of the random assignments which show a higher distance (Abs(difference)) than the observed distance.


Figure 8: Output page for the comparison of an input gene list to the corresponding homologous gene in a different species.

Finally, figure 8 shows the results for the comparison of an input gene list with the corresponding homologous genes in a different species. Again, the top part is common as explained above. For this analysis type just one statistical test has been implemented, the paired t-test (the two gene lists are not independent as they hold the homologous of the same gene). The output table (bottom) shows the paired mean difference (the mean value of the differences between each pair of homologous genes), its standard deviation, the Student t and the corresponding p-value in the normal approximation (e.g. t is treated like it would be z, which from very low sample sizes on 30 is a very good approximation). Moreover, at the right side of the table, the distribution of the differences between the gene pairs is also shown.