Maya Gorel, Kirzhner V.M.
Degenerate Coding and Sequence Compacting
Preprint series: ESI preprints
MSC:
92D20 Protein sequences, DNA sequences
Abstract: Superposition of signals in DNA molecule is a sufficiently general principle
of information coding. The degeneracy of the code is necessary for the
effective realization of this principle, with which the possibility to
transmit differently certain information also allows to place some other
information on the same DNA fragment. Code words that are equivalent in the
information sense -- synonyms -- form synonymous group; and the entire set
of code words is broken into the synonymous groups. This paper is dedicated
to construction and analysis of the model of synonymous coding. Sequence
containing one word from each synonymous group is selected as the model
signal. Meaningfully such sequences can be examined, for example, with
respect to the triplet code. Cartesian synonymous groups are selected as a
model of synonymous partition. Namely, assume that there is given a basic
alphabet $A_{ s}$ = {\{}$a_{1}$, {\ldots},$ a_{s}${\}}. Let us examine $q$ of its,
generally speaking, different partitions\textbf{\textit{ A}}$_{i}$ =
{\{}\textbf{a}$_{1,i}$, ..., \textbf{a}$_{ri,i}${\}} where$ i = $1\textit{,{\ldots},q}. Then, by
definition, the set of the words from \textbf{a}$_{j1,1}\times
$\textbf{a}$_{j2,2}\times ${\ldots}$\times $\textbf{a}$_{jq,q}$ forms a
Cartesian synonymous group.


We define and examine Cartesian synonymous partitions and their simplest
properties. It is shown, that the task of constructing the model signal can
be represented as the task of constructing a certain sequence in the special
alphabet, letters of which form the subsets of the initial alphabet. The
necessary concepts of dense sequence in the terms of this alphabet are
introduced and their connection with the dense sequences of the letters of
the basic alphabet is investigated. Transition to the new alphabet removes
some combinatory difficulties related to the selection of the representative
of a synonymous group. This makes possible to apply effectively the
modification of the De Bruijn graph when examining the existence of a dense
signal. It is convenient to investigate the concepts appearing in this case
within the framework of a more general system - the system of abstract
alphabets. This system is a straight generalization of sufficiently real
biological models on the one side. On the other side it makes possible to
describe more complex hypothetical rules of coding. Various conditions of
existence of dense coding are obtained for a such system of abstract
alphabets, based, in particular, on the theorem of Hall about perfect
matchings in a bipartite graph and on the theory of flows in the networks.
It is shown also that the sequences over abstract alphabets cannot always be
transferred into one another by transposition of fragments, in contrast to
the well known property of sequences of letters of a regular alphabet.


The obtained general results are applied to the case of Cartesian synonymous
partition for the words of length 2 and to the triplet code of amino acids.
In the latter case it is proved that the synonymous groups of the amino-acid
code are not always Cartesian, but they can be represented as an association
of a small number of Cartesian synonymous groups. Therefore the analysis of
the amino-acid code is reduced to the independent analysis of 16 versions of
Cartesian synonymous partitions. This decomposition of the amino-acid code
into a series of Cartesian codes made possible not only to analyze an
existence of dense sequences in the triplet code, but also to calculate them
all using a De Bruijn graph.