Diament et al. ChimeraUGEM: unsupervised gene expression modeling in any given organism. Bioinformatics, btz080, 2019.
The ChimeraUGEM binaries and source code are provided freely for non-commercial use.
run_ChimeraUGEM.sh
with the location of MATLAB Runtime (which was shown during the installation) as an argument.ChimeraUGEM provides tools for the analysis of gene sequences (coding and non-coding), as well as the design of protein coding sequences for optimized expression, based on the Chimera algorithms (Zur and Tuller, 2015; Diament et al., 2019) and codon usage optimization. The following tutorial will guide you step-by-step through the analysis and design pipelines. The application also provides online hints and status messages for the user. For an in-depth description of the methods (and some interesting results!) see the original paper.
Abbreviation | Description | Reference |
---|---|---|
CAI | Codon Adaptation Index | (Sharp and Li, 1987) |
cARS | ChimeraARS (part of the analysis pipeline) | (Zur and Tuller, 2015) |
cMap | ChimeraMap (part of the design pipeline) | (Zur and Tuller, 2015) |
PScARS | Position-Specific cARS | (Diament et al., 2019) |
PScMap | Position-Specific cMap | (Diament et al., 2019) |
ChimeraUGEM supports fasta files containing nucleotide or amino acid sequences. Select the appropriate alphabet type (AA seq / NT seq), and then click on 'Browse' to select your file. When the fasta file contains multiple proteins, a window will pop-up in order to select the desired ones for analysis or optimization. You may select a single or multiple proteins. While most of the following steps are applicable to both scenarios, batch processing will be discussed in detail at the end of the tutorial.
The program will prompt when illegal chars appear in the sequence(s), it contains partial codons, or missing a stop codon. It will also alert when some of the chars do not appear in the reference set at all.
As long as a target has yet to be selected, an arrow will point at the target protein field and a 'target missing' status will appear at the bottom right corner of the window.
Note, that NT seq targets can be used to define - in addition to the amino acid sequence - regions where the given sequence will be copied into the designed gene (additional details appear below).
ChimeraUGEM supports fasta files containing multiple nucleotide or amino acid sequences. (The latter alphabet is only applicable to the analysis pipeline.) The given reference sequences may be used in 2 ways:
The program will prompt when illegal chars appear in the sequences, when they contain partial codons, or when target chars are missing from the reference.
As long as a reference set has yet to be selected, an arrow will point at the reference seqs field and a 'reference missing' status will appear at the bottom right corner of the window.
An index of the reference sequence will be generated when analysis or design are initiated. Note, that this may take a few minutes for a very large reference. (For example, ~20min for 85k human transcripts.)
The analysis pipeline computes 2 global scores over the entire target sequence (ARS, and CAI), as well as 1 (optional) ARS score that is calculated over a defined "Chimera region" in the gene.
The design pipeline enables the user to combine the 3 approaches described above (ChimeraMap, codon usage bias, and pre-defined NT sequences) by associating regions in the target protein with each method.
When a target sequence is first selected, a "Chimera region" is automatically set to apply to the entire target. If you are only interested in using any of the Chimera algorithms, there is no need for further action.
The bottom half of the window displays region definitions for the currently analyzed or designed target. Region coordinates are measured in codons. (Please note this when analyzing non-coding sequences.) Regions are listed in three separated columns, according to the associated method. At the bottom, a horizontal bar displays a map of the target gene, with positions colored according to the method that will apply to them (green: Chimera, blue: codon usage bias, red: pre-defined sequence).
To add a new region, edit the text boxes in the respective column with the desired codon coordinates, and click on the '+' button. Note, that positive coordinates (such as 60cod in the example below) are measured relatively to the 5'-end of the sequence, while negative coordinates (such as -60cod in the example) are measured relatively to the 3'-end of the sequence (-1 being the last codon).
In practice, the design pipeline handles multiple region types by optimizing the complete target gene accroding to each method. Next, sub-sequences are copied from the two optimized genes (and the third source, if applicable). Therefore, it is possible that a ChimeraMap block will be splitted during this process, if it overlaps with a region boundary.
The program will prompt for user action when an added region is overlapping with an existing region. In the example above, the previously defined Chimera region from 1 to 1749 was split, and a codon optimization region was inserted from 60 to 1690 (the position that is 60cod from the target 3'-end).
To remove a region, edit the text boxes, and click on the '-' button. You may also click the list boxes and the regions map to select a region and update the respective text boxes with its boundaries. A removed region will be defined as 'default' (copied from a pre-defined NT sequence, see the example below).
Note, that when a default region is defined (and a target NT sequence is required), the program will prompt for missing target sequences next to the Design button. However, the analysis pipeline will remain available. In the example above we provided an AA sequence for the target, so when a default region was created from 600 to 800, an arrow appeared next to the target protein field, and the status below showed 'default NT seq missing'. This can be ammended either by selecting a target NT seq, or by replacing the default region with either a ChimeraMap or a CUB region.
This feature allows one to skip the manual definition of regions (as described above). ChimeraUGEM supports fasta files (denoted with the extension '.reg') containing sequences comprised of the letters B (for Chimera Blocks), C (for CUB), and D (for Default). The header (as well as length) of each record in the regions file should match one of the records in the target protein fasta. Each letter defines which of the methods will be applied to the corresponding codon position in the analyzed or designed gene.
Chimera parameters are summarized in a panel above the Chimera regions column. The default parameters were selected in the original paper, and generally do not strongly affect the final result (see the paper of analysis and discussion). However, they can be edited by clicking on the 'options' button, which opens a new window.
The top half of the parameters window contains parameters that affect the search window of the Chimera algorithms. There, parameters can be edited and their effect on the search window is visualized (at the bottom). In order to use the traditional (non-position-specific) cARS or cMap algorithms, one can enter 0 in the window size field.
The 2 homolog filters at the center of the window enable optional heuristic-based filters. Their objective is to ignore reference sequences that are too similar to the target, and therefore may introduce bias into the repetitive substring search. For example, when the filters are off, if a sequence in the reference set were to be almost identical to the target, the computed ARS score would be very high without capturing any reference-wide information content that may relate to gene regulation. Moreover, the design pipeline would duplicate the gene almost in its entirety. Therefore, in a scenario where this behavior is not desireable, the two filters may be calibrated by the user according to the analysis context. It is worth noting that reference sequences detected by the filters are only ignored during the processing of a particular target, and are still a valid reference for the rest of the targets. (Additional details appear in the Supplementary Methods of the original paper.)
cARS and PScARS can be computed using different alphabet representations of the given sequences. When the analysis process begins, and if alphabets other than amino acids (AA) are availbale (based on the given target and reference representations), the user will be prompted to select one.
Codon parameters are set in a panel above the codon regions column. Codons can be scored and optimized based on two souces:
For the second method, ChimeraUGEM supports comma-separated (.csv) files. During analysis, the geometric mean of the codon weights (scores) is computed. During design, synonymous codons with the highest optimality score will be used to encode the target protein. Codons with equal maixmal score will be drawn uniformly. The program will prompt when some AAs are not represented in the given table.
ChimeraUGEM generates two types of outputs:
ChimeraUGEM generates two types of outputs:
pos_s | pos_e | type | gene | gene_loc | block |
---|---|---|---|---|---|
1 | 10 | codon: eColi.fasta | NaN | ||
9 | 12 | cMap | MEPS | 14 | GGGATTCCCGCG |
The table reports that the first 10 codons were optimized based on codon frequencies in E. coli, while codons 11-12 were taken from a cMap block beginning at codon 9 in the target protein (and beginning at codon 14 in the reference gene MEPS).
ChimeraUGEM supports batch processing of multiple target sequences. One may select any number of targets from the fasta file in section 1. The rest of the workflow is similar, with the exception that the regions editor is not enabled when multiple targets are given. If the analysis or design is based exclusively on Chimera methods, there is no need for further action. Otherwise, region definitions must be provided via a regions file as described in section 3.2.