GUI:
(DW)
Typical gene finders and splice sites locators out there (i.e. Genscan, GlimmerM, etc.)
are websites with simple forms, but SSS is a Java application.
The original purpose of writing an application, as opposed to an applet or simply using cgi scripts,
was to allow for expansion of features. Unfortunately, we were unable to add those non-vital functions
(except the About window =D )--the GUI would be able to analyze other annotated files, but the algorithm has not been added.
The main parts are operational. The monkey on the project main page is supposed
to part of the Splash window. The splash window was never incorporated because
it slowed the program down and the analysis time was short--it is the repainting
that the jar file seems to have trouble with.
Algorithm:
(KH)
PROGRAM ARCHITECTURE -- coded by Kelly Han for Part I
(Source code by Kelly for Part I and Part II are in the Documentation section:
parse.pl, search.pl, show.pl, sbh.c)
---------
PARSE.PL
---------
This program takes in a gene file (which contains all the nucleotides and
annotated exon indices) and an integer number (which specifies the number
of nucleotides at the intron/exon boundary). It parses the gene file to
find out the nucleotide patterns at each intron-exon and exon-intron
boundary, and saves into another file the nucleotide patterns with their
occurrences. Optionally user can specify a maximum number of patterns to
be saved, so that only the most occurred boundary patterns will be saved.
The boundary pattern include n nucleotides before and n nucleotides after
the boundary, so its length is actually 2n (where n is the given integer).
In addition to saving the boundary patterns, the program will also
calculate the composition of A, T, C, G in all the patterns. The composition
is attached to the end of the pattern file.
The algorithm for the program is fairly simple. It consists of several
steps:
1. get the header of a gene
2. parse the header to get the beginning and end positions of the gene
3. parse the header to get the position of each exon
4. for each exon, get from the gene file the nucleotide pattern at its
preceding and subsequent intron/exon boundaries.
5. for each pattern found in step 4, accumulate its occurrence
6. repeat steps 1-5 until the end of the gene file
7. sort all the patterns according to their occurrence
8. print out all the patterns and their occurrence
9. calculate and print out the composition of A, T, C, G in all the patterns
Since the gene sequence can be big, reading all the nucleotides into memory
and then searches for exons may not be very efficient. So the program
does the search on the fly. A subroutine 'get_seq' is written for this
purpose. It is very much like substr() except that it works on a string
that is in the file. It takes three arguments, $from for the start position
of the sequence, $to for the end position of the sequence, and $total for
the total length of the gene. It maintains a pointer $cur to keep track the
current position within the gene. $cur is initialized to 0 and will
eventually hit $total. get_seq() first jumps to $from by advancing $cur,
then it keeps getting nucleotides until $cur becomes equal to $to.
The gene sequence can be in either ascending (i.e. beginning position less
than ending position) or descending order (i.e. beginning position greater
than ending position). The program works with both.
I noticed that in some genes' header, the position of the exons are not
in order, so I have to sort them first before calling get_seq(), because
I don't want to move file pointer back and forth.
I also noticed that in some genes' header, the exon positions are out of
the total length of the gene. These exons are simply tossed out, because
they cannot be found in the sequence anyway.
===========================================================================
----------
SEARCH.PL
----------
This program uses the pattern file generated by parse to search a gene
file. The gene file may contain one or multiple genes. Moreover, the
gene may or may not have a header. This way an unknown sequence can be
searched to see if there is any possible intron/exon boundaries. For known
genes, the program will compare the search result with the exon positions
listed in the header and report all false positives and false negatives.
The program can be summarized in following steps:
1. read the pattern file to form a perl search pattern
2. if header exists, get the header and parse it to find out all the exons
3. get the entire gene sequence
4. search the gene sequence for the pattern formed in step 1
5. if header exists, offset the boundary position with the gene's beginning
position
6. print out the boundary position
7. if header exists, compare the boundary position with the known exon
positions and report any discrepancies.
8. repeat steps 4-7 until the end of the gene
9. repeat steps 2-8 until the end of the gene file