After working with the intron/exon boundary problem for a while, and examining different approaches to the boundary prediction problem, we became increasingly curious about the existence of uniqueness at such boundaries. In other words, are the sequences around the boundary site any different than the sequences found elsewhere in the gene? Are they just random sequences? A computational experiment is designed to validate the uniqueness of such boundaries using sequence reconstruction method. -- Kelly ========================================================================== Step 1: Two sets of data should be obtained for any given gene sequence. Set 1: CONTROL Group cut the sequence at known intron/exon boundary into k-tuples (k=constant) repeat this step for all the known boundaries. Set 2: RANDOMNESS Introduced cut the sequence at random initial positions into k-tuples repeat this step for [known boundaries] []=cardinality Example: seq = AATTGGCCATGCATGC known boundaries = TT/GG, TG/GC let k = 4 Set 1: after cutting #1: k-tuples = (AATT, GGCC, ATGC, ATGC) after cutting #2: k-tuples = append(AATTG, GCCA, TGCA, TGC) Set 2: let random boundaries = AT/TG, AA/TT after cutting #1: k-tuples = (AAT, TGGC, CATG, CATG, C) after cutting #2: k-tuples = append(AA, TTGG, CCAT, GCAT, GC) -------------------------------------------------------------------------- Step 2: The Reconstruction Step for both data sets Reconstruction of the k-tuples into possible sequences (Euler Path) -------------------------------------------------------------------------- Step 3: Interpretation of Result +: a reconstruction is obtained -: a reconstruction is not obtained Interpretation: +/+: insignificant (maybe hypothesis is wrong) -/-: insignificant (inability to obtain a solution may be due to algorithm problems) +/-: significant (significant because the difference is caused solely by the introduction of randomness) ============================================================================ Progress so far: I spent a lot of time reading other people's approaches. After going through many papers/websites/textbooks, this experiment gradually becomes clear in my mind. After many revisions to the design of this experiment, I came up with the above steps. I know it is far from thorough, and I am continually thinking and revising it as of now. I have a general purpose sequencing by hybridization file written (sbh.c), and tried out some sequences manually. Although exciting, but no conclusive result has been found so far. A fully automated program is more desirable. But given the time constraint, it is still not available yet. ============================================================================= Source Code: sbh.c in the Documentation -- Source Code section type gcc sbh.c to compile (Unix machine) and usage instruction follows code is well commented ============================================================================= Design & Algorithm: By reading in a sequence and the length of the oligomers on the sequencing chip, the program attempts to reconstruct the sequence. Reconstructing the sequence is not a trivial task--even assuming perfect and unique hybridization. So the problem is: given a list of which k-tuples hybridized, how does one solve for the sequence? Assuming perfect and unique hybridization, there will be k-1 nucleotide overlapping among the probes. [Construction of the Hamiltonian Path] An intuitive way of solving this problem is to create a graph, where the each vertex corresponds to a specific k-tuple, and arcs map the overlappy among the k-tuples. If a k-1 nucleotide overlapping is the suffix (last k-1 nucleotides) of a k-tuple t1, and it is also the prefix (first k-1 nucleotides) of a k-tuple t2, then an arc is drawn from t1 to t2. Since all the k-tuples are found in the hybridization, all the vetices of the graph must be visited, therefore mathematically sequence reconstruction is equivalent to Hamiltonian path problem, which has been studied intensively. Once we found the Hamiltonian path in the graph, we can reconstruct the sequence easily. However, such a path may or may not be unique even if the hybridization is perfect and unique. Moreover, for large DNA sequences the overlap graph could become very complicated. If 3-tuple is to be used, there would be at most 4^3=64 vertices in the graph. However, 3-tuple is too weak for sequence reconstruction because the overlaps, which have lenth of 2 and thus only 4^2=16 variations, have a great chance to be the same. So what we plan to use is tuple size of 8. However the graph would have up to 4^8=65,536 vetices and probably more arcs (the simplest Hamiltonian path requires n-1 arcs, where n is the number of vertices, to join all the vertices together, and the real case of SBH would be much more complicated). Hamiltonian path is proved to be NP-complete, so it is not possible to apply for fragments that have hundreds of nucleotides. [Construction of the Eularian Path] A similar problem of Hamiltonian path is Eulerian path, where the goal is to find a path that goes through all the arcs instead of vertices. Simple linear algorithm for Eulerian path exists and the reduction of SBH to Eulerian path problem has been devised. In this approach the overlap graph has all the k-1 tuples (the first and last k-1 nucleotides of k-tuple) as its vertices. A (k-1)-tuple t1 is jointed by an arc to another (k-1)-tuple t2 if the spectrum contains a k-tuple whose last k-1 nucleotides and first k-1 nucleotides coincide with t1 and t2 respectively. Since each arc in the graph corresponds to an ologonucleotide from the spectrum, we need to find an Eulerian path, which contains all the arcs. The algorithm to find an Eulerian path is also well studied and it is linear with respect to the number of arcs. The Eulerian path is much more efficient and feasible.