Recent Changes - Search:

HomePage

DREAM - Detecting RNA editing associated with miRNAs

Based on the Following Publication:

Alon S, Erew M, Eisenberg E (2015). DREAM: a webserver for the identification of editing sites in mature miRNAs using deep sequencing data. Bioinformatics, 31:2568-2570.

Overview of the Web Server

Deep sequencing has many possible applications; one of them is the identification and quantification of RNA editing sites. The most common type of RNA editing is adenosine to inosine (A-to-I) editing. A prerequisite for this editing process is a double-stranded RNA (dsRNA) structure. Such dsRNAs are formed as part of the microRNA (miRNA) maturation process, and it is therefore expected that miRNAs are affected by A-to-I editing. Indeed, tens of editing sites were found in miRNAs, some of which change the miRNA binding specificity. This web server was designed for the identification of RNA editing sites in mature miRNAs using deep sequencing data.

Upload Fastq Sequencing File of Mature miRNA

The input for the web server is sequencing reads of mature miRNAs in a Fastq format. This web server supports only data obtained using Illumina platform (in Phred+33 format). The preferred method to supply the sequencing data to the web server is to provide a direct link to the data by choosing the 'Direct link' option. This is done by uploading the data to a public directory in a Dropbox account, then click 'Copy public link' and paste the result in the web server 'Direct link' text box. In addition to the Fastq format, using the direct link the following compressed file formats are allowed: zip, tar, tar.gz, tar.bz2 and rar. Note that the name of the compressed file must match the name of the fastq file, for example: 'a_data.fastq' should be zipped to 'a_data.zip'.

Fastq files can be also uploaded directly to the web server, but this option is not recommended as upload times can be quite long for large files (typically ~5GB/hour) and network interruptions may prevent the completion of the upload. To upload the Fastq file, use the 'Search in folder (Fastq only)' option and press the 'Choose Files' button to locate the Fastq file in your computer. Multiple files can be uploaded simultaneously, use the Control key to select the required files. Compressed Fastq files can also be uploaded using the 'Search in folder (compressed Fastq only)' option. The following compressed file formats are allowed: zip, tar, tar.gz, tar.bz2 and rar. The name of the compressed file must match the name of the fastq file, for example: 'a_data.fastq' should be zipped to 'a_data.zip'. Again, multiple files can be uploaded simultaneously, use the Control key to select the required files. The progress of the upload is given by a progress bar at the bottom of the page which will be displayed once the 'Start Analyzing' button is pressed. Make sure not to disconnect while the upload is in progress. Once the upload is over a notification email will be sent to you. Make sure to use a unique file name for every uploaded file.

Alternatively, you may want to use a test dataset by choosing the 'Use test dataset' option. The test dataset is of mature miRNAs from mouse cerebellum (accession number SRR346417). If you choose the test dataset the parameters are automatically adjusted, just enter your email address and click 'Start Analyzing'.

Pre-Processing of the Fastq File

The Fastq file of the sequencing reads could be the raw reads without any filter, or filtered reads using Illumina software tools. In the latter case, check 'No, its already pre-processed'. Otherwise, check 'Remove Sequence Adapter' as raw sequencing reads likely contain parts of the adapter sequence. The web server trim these sequences if they are supplied by the user in the fields '5 Adaptor' and '3 Adaptor'. Whereas 5 and 3 are the 5' and 3' ends of the mature miRNA sequence. As the expected length of mature miRNAs is ~21 bases, reads after trimming with length longer than 28 bases or shorter than 15 bases are discarded. Moreover, low-quality reads (as defined by the read quality score) are unlikely to be informative and therefore should be removed. The web server will remove reads with low sequencing quality in more than three positions. Low sequencing quality is defined by Phred quality score (which ranges from 0 to 40) lower than 20 (Alon et al., Genome Research, 2012).

Email for Results

Make sure to enter your email address, without it the run can't start. Only when the email is entered the 'Start Analyzing' button is activated. Two emails should be received, the first is to indicate that the upload of the Fastq file is over and to give an estimate on the process time, based on the size of your data and the files in queue. The second email will give the results of the analysis in a attached text file. Spreadsheet software, like Microsoft Excel, can be used to open the text file. The second email will also contain explanation of the output (see also below). Note that when using the test dataset only the results email will be sent as there is no need to upload the Fastq file. Please avoid multiple uploads of the same data, unless the upload itself was faulty (based on the first email you got).

Multiple-Correction Procedure

The search for possible RNA editing sites is performed for every position in every mature miRNA separately. As multiple tests are performed, the resulting P-value for each position must be corrected accordingly. The multiple testing correction can be either 'Bonferroni' or 'Benjamini–Hochberg'. Note that Bonferroni correction is more stringent, but might lead to missing true editing sites. Using Bonferroni correction the reported positions will be only those with P-value less than the supplied P-value divided by the number of tests. If Benjamini–Hochberg correction is chosen, the supplied P-value is the False Discovery Rate (FDR). Note that in the web site the default P-value is set to a high value (0.2). The results file contains all the P-values (including the Raw P-values, the Benjamini-Hochberg-corrected P-values and the Bonferonni-corrected P-values), allowing the user to see directly the effect of various cutoffs on the number of detected sites, and filter the results file accordingly. For users who want to start the process with a more stringent P-value, the P-value can be easily modified in the web server.

Overview of the Analysis Process

The filtered and trimmed reads are aligned against the genome of interest using Bowtie (Langmead et al., Genome Biology, 2009). The UCSC hg19 and mm9 versions of the genome are used for human and mouse, respectively. We require unique best alignment, that is, the reads cannot be aligned to other locations in the genome with the same number of mismatches. Only alignments with up to one mismatch are used. These steps taken together solve, by and large, the cross mapping problem that significantly hinders identification of true editing sites in mature miRNAs. The last two bases (the 3' end) of mature miRNA undergo extensive adenylation and uridylation (Burroughs et al., Genome Research, 2010). Therefore, these bases are not considered in the alignment. Naturally, doing so prevents detection of editing in these locations. However, not taking this measure and still demanding low number of mismatches will severely reduce the number of alignments obtained.

Next, the reads aligned to the genome are converted to counts of each of the four possible nucleotides at each position along the pre-miRNA sequence (miRBase, release 21; http://www.mirbase.org/), for all the pre-miRNAs. Performing this transformation allows focusing the analysis on bona fide miRNA only. In the following step, binomial statistics is applied to distinguish significant modifications in these regions from sequencing errors. The binomial statistics requires the number of mismatches of a given type in a given position, the number of total reads in the given position, and the a-priori sequencing error probability. As only mismatches with Phred score of 30 are allowed (Alon et al., Genome Research, 2012), we use 0.1% as the expected base call error rate. Importantly, binomial statistics do not require any arbitrary expression level filter. It is well suited even for low-expressed miRNAs with low number of sequencing reads, and the P-values computed reflect the absolute number of reads detected, small or large as the case may be. This analysis is performed for every position (except the last two positions of the miRNA due to the extensive adenylation and uridylation) in every mature miRNA separately. As multiple tests are performed, the resulting P-value for each position are corrected by using either Bonferroni or Benjamini–Hochberg corrections. Lastly, known SNPs are filtered from the statistically significant modifications detected by the web server, using the lists of common SNPs, dbSNP builds 138 and 142 for mouse and human, respectively.

The Output File

The output text file gives the locations of the significant modification as well as the statistical description of the modifications. The following fields should appear in the text file:

CHROM = The chromosome number

START = The location of the position before the modification site

END = The location of the modification site

STRAND = The strand in which the modification is located

miRNA name = The miRNA with the modification

Location inside pre-miRNA = The position of the modification inside the pre-miRNA (according to miRBase definitions, release 21)

Modification type = The modification type

Number of reads with the modification = The number of reads which show the modified base

Total number of reads in this position = Total number of reads in this position

Location inside mature sequence = The position of the modification inside the mature miRNA (according to miRBase definitions, release 21)

Raw P-value = The P-value obtained from the binomial test

Bonferroni P-value = The Bonferroni corrected P-value

BH P-value = The Benjamini-Hochberg corrected P-value

Output Interpretation

Most of the modifications are expected to be A-to-G, which can be a result of A-to-I editing. What can be the nature of the other types of modifications? Among the possible explanations are the following: (a) rare SNPs or somatic mutations, (b) 5' adenylation and uridylation, (c) problems in the definition of the miRNA, (d) sequencing artifacts, (e) C-to-U RNA editing, or (f) non-canonical RNA editing events:

(a) Known SNPs are filtered from the statistically significant modifications detected by the web server, using the lists of common SNPs, dbSNP builds 138 and 142 for mouse and human, respectively. However, the modifications detected can be previously undetected SNPs, rare SNPs, or somatic mutations.

(b) Some low-abundance isomirs display 5' sequence modifications similar to the biological modifications reported at the 3' of mature miRNAs (Burroughs et al., Genome Research, 2010). We previously identified several isomirs that start one or two bases upstream from a different isomir and display sequence modifications in the 5' end in the form of adenylation and uridylation. In all these events the abundance of the modified isomir was significantly lower than the unmodified isomir. The web server automatically identifies these events and discards them from the analysis if: (1) the expression of the modified isomir is more than an order of magnitude lower compared to the expression levels of the main isomir, and (2) the position to discard is in the first two 5' bases of the miRNA. However, there may be adenylation and uridylation in the 5' that slip through these filters.

(c) The definitions of the bona fide miRNAs are constantly improving in each new release of miRBase. However, artifacts (for example, fragments of rRNA that are identified as miRNA) can still be present in the database.

(d) We observed several Fastq files of mature miRNAs which show a clear preference for one type of mismatch, for example A-to-C. Interestingly, this mismatch type occurred predominantly in one specific position along the mature miRNA (for example, position 6 counting from the 5' end). Therefore, in these cases there seems to be a specific mismatch in one cycle of the sequencing process. These mismatches tend to have lower sequence quality compared to known editing events. However, it is possible to detect these mismatches even when using Phred score of 30 as a cutoff.

(e) Cytosine-to-uracil (C-to-U) editing can also happen in several mammalian tissues (see for example: Rosenberg et al., Nature Structural & Molecular Biology, 2011).

(f) It has been suggested that non-canonical types of RNA editing may affect RNA (Li et al.,Science, 2011).

References for the Analysis Process

Alon S, Mor E, Vigneault F, Gallo A, Locatelli F, Church GM, Shomron N, Eisenberg E (2012). Systematic identification of edited microRNAs in the human brain. Genome Research, 22:1533-1540.

Alon S, Eisenberg E (2013). Identifying RNA editing sites in miRNAs by deep sequencing. Methods in Molecular Biology, 1038:159-170.

Edit - History - Print - Recent Changes - Search
Page last modified on July 30, 2015, at 11:08 PM EST