The template database construction is a crucial part of the GenoMS
algorithm. The database contains sequences, called 'templates' which by a
process of mutation, insertion, and deletion may be present in our
biological sample. In addition, constraints on the templates specify how
the templates may be used together.
There are two types of contraints which can be specified.
GenoMS is able to infer the constraints from different 3 types of input
sequence files; Genome sequences, template class files, user-created
constraint files. Exactly one of these types of databases must be input to
GenoMS.
In some cases, protein sequences may not be available. The genomic
sequence can be used as a template by GenoMS. The genome sequence is
translated in all six-frames, and each translated portion between stop
codons becomes a template. Templates are mutually exclusive if their
genomic coordinates overlap, or are different strands. The order of two
templates is determined by the order of their genomic coordinates. The
genome sequence can be specified as a FASTA in the config file (See Usage).
Some types of protein sequences can be divided into several segments.
For example, immunoglobulin heavy chain proteins are composed of 4 distinct
segments; the variable region, the diverging region, the joining region, and
the constant region. Each protein contains 1 of each of these segments, but
there are hundreds of choices for each segment. Each segment of a given type
is mutually exclusive (i.e. only one variable region can be present in the
heavy chain). In addition, the segment types appear in order in the final
protein (i.e. the variable region always precedes the diversity region).
We call each of these segment types 'Template classes'. Specifically,
protein sequences in the same template class are mutually exclusive.
Protein sequences in a template class which precedes another template
class, must precede the protein sequences in the second template class.
Protein sequences can be submitted to GenoMS in the form of template
classes, allowing the constraints to be inferred. This is done by using the
'dbrootname' line in the config file. The usage can best be explained by an
example.
Suppose we are trying to sequence the heavy chain of an immunoglobulin.
There are 4 template classes, and we create 4 FASTA files. Each FASTA file
contains all of the protein sequences for each template class. One file
contains all possible variable region sequences, another file contains all
possible joining region sequences, etc. Each file is named with the same
root name, but the extension indicates the order. In this case, we may have
HeavyChainSequences.1 containing the variable region sequences,
HeavyChainSequences.2 containing the diversity region sequences,
HeavyChainSequences.3 containing the joining region sequences, and
HeavyChainSequences.4 containing the joining region sequences. The extension
numbers must be consecutive and begin with the number 1. In the config file,
we would specify
dbrootname,HeavyChainSequences
Suppose we are trying to sequence 2 proteins at the same time; the heavy
chain and the light chain of an immunoglobulin. The selection of light chain
templates is completely independent of the heavy chain templates, so they
form their own set of template classes which we divide into 3 FASTA files; LightChainSequences.1,
LightChainSequences.2, LightChainSequences.3. In the config file we will
have 2 lines
dbrootname,HeavyChainSequences dbrootname,LightChainSequencesWhile constraints are created between heavy chain proteins and heavy chain
If your experiment does not fit into one of the constraint inference
frameworks described above, you can create your own constraint file. The
template sequence file is a standard FASTA formatted file. The constraints
file is 3-columns tab-delimmited with a line for each constraint. The first
column of a line specifying an order constraint is a capital letter 'O'.
The first column of a line specifying a mutual exclusion constraint is a
capital letter 'F'. The second two columns specify the pair of template IDs on
which the constraint is imposed. The template ID is the order of the protein
in the FASTA file (i.e. the first sequence in the FASTA file has ID 0). Here
is an example constraint file.
O 0 1 O 1 2 O 0 2 F 0 3 F 3 4 F 0 4The file specifies 6 constraints, 3 order constraints, and 3 mutual exclusion
F 3 4 F 4 3
Once the template sequence database and the constraint file have been built, they are specified in the config file as
dbcombined,MyDatabase.fasta templateconstraintfile,MyTemplateConstraints.txt