After working with the intron/exon boundary problem for a while,
and examining different approaches to the boundary prediction problem,
we became increasingly curious about the existence of uniqueness at
such boundaries. In other words, are the sequences around the boundary
site any different than the sequences found elsewhere in the gene? Are
they just random sequences? A computational experiment is designed to
validate the uniqueness of such boundaries using sequence reconstruction
method.
-- Kelly
==========================================================================
Step 1: Two sets of data should be obtained for any given gene sequence.
Set 1: CONTROL Group
cut the sequence at known intron/exon boundary into k-tuples (k=constant)
repeat this step for all the known boundaries.
Set 2: RANDOMNESS Introduced
cut the sequence at random initial positions into k-tuples
repeat this step for [known boundaries] []=cardinality
Example:
seq = AATTGGCCATGCATGC
known boundaries = TT/GG, TG/GC
let k = 4
Set 1:
after cutting #1: k-tuples = (AATT, GGCC, ATGC, ATGC)
after cutting #2: k-tuples = append(AATTG, GCCA, TGCA, TGC)
Set 2:
let random boundaries = AT/TG, AA/TT
after cutting #1: k-tuples = (AAT, TGGC, CATG, CATG, C)
after cutting #2: k-tuples = append(AA, TTGG, CCAT, GCAT, GC)
--------------------------------------------------------------------------
Step 2: The Reconstruction Step for both data sets
Reconstruction of the k-tuples into possible sequences (Euler Path)
--------------------------------------------------------------------------
Step 3: Interpretation of Result
+: a reconstruction is obtained
-: a reconstruction is not obtained
Interpretation:
+/+: insignificant
(maybe hypothesis is wrong)
-/-: insignificant
(inability to obtain a solution may be due to algorithm problems)
+/-: significant
(significant because the difference is caused solely by the introduction
of randomness)
============================================================================
Progress so far:
I spent a lot of time reading other people's approaches. After going through
many papers/websites/textbooks, this experiment gradually becomes clear in
my mind. After many revisions to the design of this experiment, I came up with
the above steps. I know it is far from thorough, and I am continually thinking
and revising it as of now. I have a general purpose sequencing by hybridization
file written (sbh.c), and tried out some sequences manually. Although exciting,
but no conclusive result has been found so far. A fully automated program is
more desirable. But given the time constraint, it is still not available yet.
=============================================================================
Source Code:
sbh.c in the Documentation -- Source Code section
type gcc sbh.c to compile (Unix machine)
and usage instruction follows
code is well commented
=============================================================================
Design & Algorithm:
By reading in a sequence and the length of the oligomers
on the sequencing chip, the program attempts to reconstruct
the sequence.
Reconstructing the sequence is not a trivial task--even
assuming perfect and unique hybridization. So the problem
is: given a list of which k-tuples hybridized, how does one
solve for the sequence? Assuming perfect and unique
hybridization, there will be k-1 nucleotide overlapping among
the probes.
[Construction of the Hamiltonian Path]
An intuitive way of solving this problem is to create a
graph, where the each vertex corresponds to a specific k-tuple,
and arcs map the overlappy among the k-tuples. If a k-1 nucleotide
overlapping is the suffix (last k-1 nucleotides) of a k-tuple t1,
and it is also the prefix (first k-1 nucleotides) of a k-tuple t2,
then an arc is drawn from t1 to t2. Since all the k-tuples are
found in the hybridization, all the vetices of the graph must be
visited, therefore mathematically sequence reconstruction is
equivalent to Hamiltonian path problem, which has been studied
intensively. Once we found the Hamiltonian path in the graph, we
can reconstruct the sequence easily. However, such a path may or
may not be unique even if the hybridization is perfect and unique.
Moreover, for large DNA sequences the overlap graph could become
very complicated. If 3-tuple is to be used, there would be at most
4^3=64 vertices in the graph. However, 3-tuple is too weak for
sequence reconstruction because the overlaps, which have lenth of
2 and thus only 4^2=16 variations, have a great chance to be the
same. So what we plan to use is tuple size of 8. However the graph
would have up to 4^8=65,536 vetices and probably more arcs
(the simplest Hamiltonian path requires n-1 arcs, where n is the
number of vertices, to join all the vertices together, and the real
case of SBH would be much more complicated). Hamiltonian path is
proved to be NP-complete, so it is not possible to apply for
fragments that have hundreds of nucleotides.
[Construction of the Eularian Path]
A similar problem of Hamiltonian path is Eulerian path, where the
goal is to find a path that goes through all the arcs instead of
vertices. Simple linear algorithm for Eulerian path exists and the
reduction of SBH to Eulerian path problem has been devised. In this
approach the overlap graph has all the k-1 tuples (the first and
last k-1 nucleotides of k-tuple) as its vertices. A (k-1)-tuple t1
is jointed by an arc to another (k-1)-tuple t2 if the spectrum
contains a k-tuple whose last k-1 nucleotides and first k-1 nucleotides
coincide with t1 and t2 respectively. Since each arc in the graph
corresponds to an ologonucleotide from the spectrum, we need to find
an Eulerian path, which contains all the arcs. The algorithm to find
an Eulerian path is also well studied and it is linear with respect
to the number of arcs. The Eulerian path is much more efficient and
feasible.