GenoMS : Sequencing whole proteins using a template databases


Creating template databases

The template database construction is a crucial part of the GenoMS
algorithm. The database contains sequences, called 'templates' which by a
process of mutation, insertion, and deletion may be present in our
biological sample. In addition, constraints on the templates specify how
the templates may be used together.

There are two types of contraints which can be specified.

GenoMS is able to infer the constraints from different 3 types of input
sequence files; Genome sequences, template class files, user-created
constraint files. Exactly one of these types of databases must be input to
GenoMS.

Genome sequences

In some cases, protein sequences may not be available. The genomic
sequence can be used as a template by GenoMS. The genome sequence is
translated in all six-frames, and each translated portion between stop
codons becomes a template. Templates are mutually exclusive if their
genomic coordinates overlap, or are different strands. The order of two
templates is determined by the order of their genomic coordinates. The
genome sequence can be specified as a FASTA in the config file (See Usage).

NOTE: This is not recommended for whole genome studies. Your best chance
is to use the locus of the protein of interest in the genome in either the
organism of interest's or a related organism's genome


Template classes

Some types of protein sequences can be divided into several segments.
For example, immunoglobulin heavy chain proteins are composed of 4 distinct
segments; the variable region, the diverging region, the joining region, and
the constant region. Each protein contains 1 of each of these segments, but
there are hundreds of choices for each segment. Each segment of a given type
is mutually exclusive (i.e. only one variable region can be present in the
heavy chain). In addition, the segment types appear in order in the final
protein (i.e. the variable region always precedes the diversity region).
We call each of these segment types 'Template classes'. Specifically,
protein sequences in the same template class are mutually exclusive.
Protein sequences in a template class which precedes another template
class, must precede the protein sequences in the second template class.

Protein sequences can be submitted to GenoMS in the form of template
classes, allowing the constraints to be inferred. This is done by using the
'dbrootname' line in the config file. The usage can best be explained by an
example.

Suppose we are trying to sequence the heavy chain of an immunoglobulin.
There are 4 template classes, and we create 4 FASTA files. Each FASTA file
contains all of the protein sequences for each template class. One file
contains all possible variable region sequences, another file contains all
possible joining region sequences, etc. Each file is named with the same
root name, but the extension indicates the order. In this case, we may have
HeavyChainSequences.1 containing the variable region sequences,
HeavyChainSequences.2 containing the diversity region sequences,
HeavyChainSequences.3 containing the joining region sequences, and
HeavyChainSequences.4 containing the joining region sequences. The extension
numbers must be consecutive and begin with the number 1. In the config file,
we would specify

        dbrootname,HeavyChainSequences
    

Suppose we are trying to sequence 2 proteins at the same time; the heavy
chain and the light chain of an immunoglobulin. The selection of light chain
templates is completely independent of the heavy chain templates, so they
form their own set of template classes which we divide into 3 FASTA files; LightChainSequences.1,
LightChainSequences.2, LightChainSequences.3. In the config file we will
have 2 lines

        dbrootname,HeavyChainSequences
        dbrootname,LightChainSequences
    
While constraints are created between heavy chain proteins and heavy chain
proteins, and between light chain proteins and light chain proteins,
no constraints are created between heavy chain proteins and light chain
proteins.


Creating your own template constraint file

If your experiment does not fit into one of the constraint inference
frameworks described above, you can create your own constraint file. The
template sequence file is a standard FASTA formatted file. The constraints
file is 3-columns tab-delimmited with a line for each constraint. The first
column of a line specifying an order constraint is a capital letter 'O'.
The first column of a line specifying a mutual exclusion constraint is a
capital letter 'F'. The second two columns specify the pair of template IDs on
which the constraint is imposed. The template ID is the order of the protein
in the FASTA file (i.e. the first sequence in the FASTA file has ID 0). Here
is an example constraint file.

        O   0   1
        O   1   2
        O   0   2
        F   0   3
        F   3   4
        F   0   4
    
The file specifies 6 constraints, 3 order constraints, and 3 mutual exclusion
constraints. The first constraint specifies that template 0 may directly
precede template 1. The second constraint specifies that template 1 may
directly precede template 2. The third constraint specifies that template 0
may directly precede template 2. This set of constraints may arise if each
template represents an exon and alternative splicing is possible. The final
3 lines specify the mutual exclusion constraints. Note that the constraints
are not commutative, so it is not sufficient to simply include the first 2
constraints. Also, the order of template IDs for mutual exclusion
constraints does not matter. There is no need to specify both

        F   3   4
        F   4   3
    

Once the template sequence database and the constraint file have been built, they are specified in the config file as

        dbcombined,MyDatabase.fasta
        templateconstraintfile,MyTemplateConstraints.txt