Overview:
miRanalyser is a free web-server tool for the processing of small-RNAs data obtained from next generation sequencing techniques such as the Genome Analyzer of Illumina Inc. or Genome SequencerTM FLX (454 Life ScienceTM and Roche Applied Science). The input data are grouped sequence reads (sequence tags, unique reads) that are typically 16 to 26 bp long (depending on the sequencing protocol), and their expression values (number of times the unique read has been found to be expressed in the experiment). Then, the tool performs several analysis steps and outputs detailed results. The analysis steps are the following:
- Alignment of all reads against the libraries of known mature microRNAs (including also the mature-star libraries – the sequences which pair with the mature microRNAs in the secondary structure of the pre-microRNAs). Additionally, the tool outputs the predicted target genes together with direct links to ontological analyses of the predictions.
- Mapping against all theoretically possible mature-star microRNAs including those which are not annotated in miRBase. This allows to detect the expression of previously undetected mature-star microRNAs.
- Alignment against other libraries of transcribed sequences. For example, the number of matches found in the transcriptome will be inversely proportional to the RNA sample quality (the fewer degradation products, the fewer the matches). Furthermore, matches found in RFam will indicate expression of other small ncRNA molecules.
- Prediction of previously unknown microRNAs. This is important as 1) experiments can be mined for the detection of previously unknown microRNAs and 2) for many species none or just a few microRNAs are known and therefore the detection of microRNAs relies almost entirely on the computational prediction of new microRNAs
How to use this web server tool
In order to run a microRNA analysis with miRanalyser, the user needs just an appropriate input file; then he can select the optional parameters required.
Input file format
The only input data needed to run the web-server tool are the sequences of the unique reads and their expression values.The application accepts two different input formats:
1) A tab or space separated file as in the following example:
| GAGGTAGTAGGTTGTA | 49862 |
| ACCCGTAGAACCGACC | 15490 |
| GGAGCATCTCTCGGTC | 13762 |
| ... | ... |
2) a multifasta file, see below:
| >ID 49862 |
| GAGGTAGTAGGTTGTA |
| >ID 15490 |
| ACCCGTAGAACCGACC |
| >ID 13762 |
| GGAGCATCTCTCGGTC |
We supply a Perl script to generate the tab-separated input format out of files containing Genome Analyzer (Illumina Inc.) raw data (see description at the end of this tutorial download the script).
Figure 1: A screenshot of the input page.
The user can also run the program automatically with the test data set (rat hepatocytes microRNA deep sequencing data) by activating the checkbox as shown in figure 1:
Species and genome assembly:
In the current version one of the following 7 species genome assemblies can be chosen:
- Homo sapiens (hg18, NCBI 36.1)
- Mus musculus (mm8, NCBI 36)
- Rattus norvegicus (rn4, version 3.4)
- Drosophila melanogaster (dm3, BDGP Release 5)
- Caenorhabditis elegans (ce6, WUSTL School of Medicine GSC and Sanger Institute version WS190)
- Canis familiaris (canFam2, v2.0)
- Danio rerio (danRer5)
Number of mismatches
Next generation sequencing data are normally characterized by a higher sequencing error than the Sanger sequencing, but this error rate is balanced by a much higher redundancy. Therefore, the user should carefully choose the number of allowed mismatches (0, 1 or 2) to assign a sequence-read to a microRNA (default value is 1). Note that for the detection of new microRNAs and the overlapping with repetitive sequences, only perfect matches to the genome are considered.
Target Gene table:
The program makes available the putative target genes for each detected microRNA ..and direct links to their ontological analysis. Thus, the user must select a set of predicted target genes. We offer 2 different prediction methods:
- Predictions from miRBase hosted at the Sanger Institute
- Predictions from TargetScan (conserved family – conserved target)
Posterior probability threshold
The posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence is taken into account ( wiki). One of the available values should be selected (default is 0.9).
Do not consider adapter sequences
By default, the tool tries to detect if adapter sequences exist by searching for the perfect matches in subsequences of the input reads. If this option is selected, the tool will only look for the perfect matches of the input reads complete sequences.
Detect just new microRNAs
This option will skip the detection of known microRNAs.
Remove reads detected in mRNA database
All reads detected in a RefSeq transcript are removed from the input. Note that reads which map perfectly to a mRNA sequence may correspond to degradation products. If this option is not chosen, the tool automatically eliminates all reads which map to more than 5 mRNA sequences
Remove reads detected in RFam
This option removes reads which correspond to other known RNAs sequences as these might be easily confused with microRNAs (in the prediction of new microRNAs).
Removes reads detected in RepBase
This option removes all reads found in a RepBase sequence (transposons)
Predict only conserved microRNAs
All read clusters which do not overlap with a Phylogenetically Conserved Element (PhastCons) are removed.
Description of output
The output can be divided into a first summary page and several detail output pages.
Figure 2: The summary output page
The summary page:
The results overview page is made of the following 5 summary boxes:
-
Queuing and Execution
Each requested analysis is sent to a queuing system and the current status is indicated within this box (see Figure 3). When the program has terminated all the analysis processes, a link to download the results as plain text emerges (see Figure 4). - Known microRNAs
Values are given for the mature, ambiguous mature, mature-star, ambiguous mature-star, unknown mature-star (mature-star sequences which are not in miRBase ), ambiguous unknown mature-star, hairpin and ambiguous hairpin categories (see Figure 5): - total number: total number of microRNAs detected in the input.
- fraction (number) of known microRNAs: total number and percentage of expressed sequence tags mapping to known microRNAs. This shows the percentage of total expression which corresponds to known microRNA.
- number of unique reads: total number of microRNAs that have been detected by means of unique sequence reads
- fraction of unique reads: proportion of microRNAs that have been detected by means of unique sequence reads
- read count: number of different reads that mapped to the known microRNAs
- fraction of read count: percentage of different reads that mapped to the known microRNAs
- links to detail pages: links to detailed pages for each microRNA category
- Alignment to transcriptome
Number of sequence reads that align to sequences in mRNA databases, total number of copies per mRNA and the corresponding percentages are shown (see Figure 6). - Predicted candidate microRNAs
The number of predicted putative-new microRNAs and the expression data of the corresponding read sequences are given. - Unmatched reads
A summary of all reads that either were filtered out or could not be mapped to any of the integrated libraries (mature microRNA, mature-star, mRNA, whole genome). Again, the number of reads, the counts and the normalized expression values are given.
Figure 3: The Queuing and Execution box while the program is running.
Figure 4: The Queuing and Execution box after the program has terminated
Figure 5: Summary of analysis on known microRNAs
Figure 6: Summary of alignment against different libraries of transcribed sequences
Figure 7: Predicted microRNAs.
Figure 8: Unmatched and filtered reads
Detailed output pages:
- Known microRNAs
This output is contained in a file where the reads are sorted on the level of expresion, in descending order. The total expression value is determined by the sum of the number of copies of all reads which mapped to a particular microRNA. The table shows also the normalized expression values: the first value is the normalization of expression counts to the total number of expressed sequence reads, and the second is the same value normalized according to the number of reads that mapped to microRNAs. Furthermore, the number of reads and their expression counts are given for perfect matches, 1 mismatch and 2 mismatches. Note that if the query was limited to perfect matches or 1 mismatch, the columns corresponding to the other options will just show 0 counts. Finally, the last columns are dedicated to putative target genes and ontological analyses. The target genes can be viewed by clicking the corresponding link. By clicking the “launch AM” link, the target genes are sent automatically to the Annotation-Modules algorithm and the results are depicted in another window. Once an ontological analysis have been run for a list of target genes, the output page detects the presence of some results and after reloading (i.e. pressing F5), it replaces the “launch AM” link with the “see AM results” link. - Mapping to other libraries of transcribed sequences
For each sequence-read which maps perfectly to a mRNA, the following information is given: its expression count, the normalized expression based on the count-sum of all sequence reads, the name(s) of the mRNAs (RefSeq identifiers) separated by a “:”, and the number of mRNAs where each read matched perfectly. - microRNA prediction
The following information is provided for all the reads that are potential sequences of a new microRNA: the sequence read, the read count, the normalized expression value, the chromosome, the start and end coordinates on the chromosome, the sequence and secondary structure of the new microRNA and the mean free energy of the pre-microRNA. - Overlap with repetitive sequences and transposons
Each sequence read can map in several positions in the genome, and therefore, one sequence read may have several overlaps. For each match, the RepeatMasker annotation is checked. In case of an overlap, the repeat is assigned to the sequence read and written out. - Unmapped sequence reads
Counts and expression values of the sequence reads that were not identified in the previous steps (or filtered out) are shown
Header descriptions of detailed output pages:
microRNAs (including mature-star and hairpin)
Figure 9: Detailed output page for known microRNAs.
- microRNA
- the name of the microRNA (in case of multiple mappings this corresponds to a set of all microRNAs mapped by the same reads)
- #reads
- the number of different sequence reads which could be mapped to this microRNA
- read count all
- the sum of the number of copies of all reads mapped to this microRNAs
- norm expr. all
- The normalized read count: the read count divided by the total number of detected reads
- norm expr. mapped
- The read count normalized just to the reads mapped to the given library (mature microRNA, mature*, reverse complementary).
- #Reads PM
- the number of sequence reads that mapped perfectly (perfect matches)
- read count PM
- the sum of the number of copies of all reads that mapped perfectly
- #Reads MM1
- the number of sequence reads which mapped with 1 mismatch
- read count MM1
- the sum of the number of copies of all reads that mapped with 1 mismatch
- #Reads MM2
- the number of sequence reads which mapped with 1 mismatch
- read count MM2
- the sum of the number of copies of all reads that mapped with 1 mismatch
- target genes
- link to the detailed description of the target genes
- Launch AM
- link to analyze the target genes through the Annotation-Modules algorithm
Alignment to other libraries of transcribed sequences
- read sequence
- the sequence of the read
- read count
- the number of times this sequence has been observed in the experiments
- norm expr. all
- the normalized read count: the read count divided by the total number of detected reads
- mRNAs
- the mRNAs where the read sequence matched
- mRNA count
- the number of sequences where the read matched
Prediction of candidate microRNAs
Figure 10: Detailed output page for predicted microRNAs.
Figure 11: Predicted structure and detected reads.
- candidate name
- link to another page that describes all the details of the candidate microRNA (see Figure 11).
- chromosome
- chromosome where the read cluster is located
- chromo start
- first base chromosomal coordinate of the candidate sequence
- chromo end
- last base chromosomal coordinate of the candidate sequence
- strand
- the strand where the microRNA was transcribed from
- #unique reads
- number of unique reads which map to the precursor of this putative microRNA
- Read count
- sum of the number of copies of all unique reads that could mapped to this putative microRNAs
- Norm. read count
- the read count divided by the total number of reads in the experiment (sum of all read counts)
- UCSC
- link to UCSC Genome Bioinformatics Browser depicting the predicted region ´
The Perl script
A Brief description
About 1 to 3 Gbp of data can be generated per Genome Analyzer (Illumina Inc.) experiment. The number of times a given sequence read occurs corresponds directly to its expression level. However, not all the information needs to be transferred via the internet (big files!) since the Genome Analyzer output files can be processed locally generating a single file which holds just the sequence read and its count (number of times it has been found in the sample). With this purpose in mind, we developed a small perl script (GroupGAReads.pl) which reads the s_L_sequence.txt files (being L the lane on the flowcell) and counts the sequence reads. The script implements two quality measures to filter out low quality reads. It also allows the comparison of two samples/conditions (case/control).
Input file (located within the Run-folder-->Data-->Firecrest-->Bustard-->Gerald):
The s_L_sequence.txt file has the following structure (see here for a example file). 4 lines correspond to each sequence read.
@0801-1150:6:1:135:472
AAAAAAAAAAAAAAAA
+0801-1150:6:1:135:472
YY[]]^V]Y]]]W^]]
The first line is an identifier which records the lane, tile and cluster coordinates. The second line is the sequence read. The third line is again the identifier and finally the fourth line contain the quality values for each base in the sequence read. The quality value can be calculated by Q = (ASCII character code) – 64 From Q15 on, the values are roughly the same as Phred scores. Q10 means aprox. 1 error each 10 bases, Q20 means 1 error each 100 bases, and Q30 means 1 error each 1000 bases.
Based on this quality scores, we implemented two quality thresholds:
- minimum quality: the user specifies a Q value threshold that each base in the sequence read must reach, otherwise the read will be filtered out.
- mean quality: the mean Q is calculated for the read and the read is filtered out if the mean quality is below the user given threshold.
We implemented two basic modes. First, a simple count mode, in which the script just counts the number of copies of each sequence read over 1 or more lanes. And second, a comparative mode, in which the script calculates additionally log2-ratios based on the expression counts in two groups (two conditions).
The command line parameters
The options are generally given like this "option=value"
The following options are implemented and mandatory:
input=file1_group1,file2_group1,etc#file1_group2,file2_group2,etc.
This would be the option for a comparison (group 1 vs. group2), in case of a normal "count" it would be
input=file1,file2,etc
outfile=’the output file’
minNr=value: the minimum number of copies a read must have to be written out. Note that due to sequencing errors, many reads exist with just a few copies.
qual=type:value: type can be either ‘min’ or ‘mean’ (see above for explanation of minimum and mean based quality parameters). Value will be a Q threshold (probably values between 10 and 20 will be most appropriate).
Output Files
Normal count mode:
sequence read count
AGCTA 456
ATTGC 300
…
Comparative mode:
(like above but the third column is the log2-ratio group1/group2). If the log2-ratios are positive, the sequence read expression is higher in the first group, and vice versa.
AGCTA 465 0.34
ATTGC 277 2,29
CGTTA 299 -1.3