wu :: forums (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
riddles >> cs >> Re: substring search
(Message started by: sameer_r on Apr 25th, 2003, 2:31pm)

Title: Re: substring search
Post by sameer_r on Apr 25th, 2003, 2:31pm
Hi everyone,

I didn't see any message posted regarding the substring search problem.  Pardon me if a thread already exists for it.

I was thinking about a solution to the problem and came up with the following algorithm (only time efficient)

Say A is the string to be searched in the string B.
for e.g.
A = "xyxz"
B = "xyzxyxz"

Sort A and Sort B keeping track of the original indices of each character.
A = "xxyz" with original indices 1,3,2,4
B = "xxxyyzz" with original indices 1,4,6,2,5,3,7

Allocate an integer array (counter) equal to the size of B.

Now for every character in B, do a binary search to locate the corresponding character in A (there could be more than one occurence of the same character).

Say for the first x in B, we locate the first x in A with original index 1. (since its sorted the other x's would be easy to find by just increasing the index until we came to a character > x)

Now if this character in B were to be a part of the substring A, then the beginning of the string would have been (original index of this character in B - original index of this character in A + 1).

In the case of the first x in A: 1-1+1
or in the case of second x in A : 1-3+1<0 (hence cannot be)

This means that if the first x in B were a part of the substring A, then it could be the first character of A and not the 3rd.

Now if this x is the first character of substring A, then 1 should be the index from which the substring starts in B and hence in Count array we increment Count[1] by 1.

Finally when we are done looking at all the elements of B, then we scan through the Count array and all those offsets that have the value equal to the length of substring A, are in fact the beginning locations of the substring A in B.

Thanks,
Sameer.

Title: Re: substring search
Post by Papa Homer on Apr 25th, 2003, 3:06pm
I am feeling a bit slow right now, so could you provide some more details for you solution?  Specifically, the runtime complexity for each step as well as an explaination of why it actually works.

Thanks.

Title: Re: substring search
Post by sameer_r on Apr 25th, 2003, 3:13pm
The algorithm has the following steps:

sort A and B
 sorting A theta(m)
 sorting B theta(n)
 assuming that we know in advance the range of characters that the strings can contain.

go thru each element of B (n)
 search that element in A theta(lg(n))
 go thru the matching element in A (at worst m) (this part I am not sure of. looks like an average case analysis like in quick sort is needed)

finally go thru the count array and find the offsets where your substring starts in B. theta(n)

so i guess the complexity in worst case comes out to be theta(nm), though I think it should be better than that.

Title: Re: substring search
Post by Papa Homer on Apr 25th, 2003, 3:34pm
Yeah, that's the complexity of a brute force method: for every character in B, see if the next m characters make up A.

Title: Re: substring search
Post by sameer_r on Apr 25th, 2003, 4:06pm
For Brute force method, the run-time is always theta(nm), but for this algo, the worst case is theta(nm).  It could perform better.  Even for quick-sort the worst case run-time is theta(n^2), but average case is theta(nlgn), which is pretty good, in fact the theoritically the best possible for comparison sort.

I still hope (fingers crossed) that this algorithm is better than brute force.

Title: Re: substring search
Post by Papa Homer on Apr 25th, 2003, 4:18pm
That's not entire correct.  For the brute force method the best case is m (if A appears in the beginning of B).  However, the worst and AVERAGE cases are still mn.  So, the question is, what is the average performance of your algorithm on randomk data.

Title: Re: substring search
Post by towr on Apr 26th, 2003, 5:25am
there is a sublinear search algorithm for this problem..
You need a little preprocessing (linear time) and a few simple tricks..
Of course without my book I can't really describe it adequatly..

Title: Re: substring search
Post by towr on Apr 27th, 2003, 1:37pm
I'll just give an example for now.. As I said, there are a few tricks you need to do it efficiently, but basicly here's how it works (after preprocessing A in linear time/space)

[hide]Suppose we start with two strings

A = abbac
B = abcaccbcabcbabbaccbabcccabababbabbcc
start comparing from the back of A

abcaccbcabcbabbaccbabcccabababbabbcc
abbac
 xxx
since it doesn't match, and ac doesn't occur in A again, you can shift A past the first 5 characters

Compare at that point again

abcaccbcabcbabbaccbabcccabababbabbcc
    abbac
 xxx    x
it doesn't match, but b does occur in A, so shift A to match to the first b and compare from there again

abcaccbcabcbabbaccbabcccabababbabbcc
      abbac
 xxx    x x
same again as before

abcaccbcabcbabbaccbabcccabababbabbcc
        abbac
 !xx    x x x
and again, compare, you can skip the substring (one character in this case) you allready know is there

abcaccbcabcbabbaccbabcccabababbabbcc
           abbac
 xxx    x xXxXXX
done

best case when there is a match m; best case when there isn't n/m; on avarage  < n, worst case n+m (with preprocessing)
[/hide]

Title: Re: substring search
Post by akasina9 on Nov 17th, 2012, 6:30am

on 04/27/03 at 13:37:55, towr wrote:
I'll just give an example for now.. As I said, there are a few tricks you need to do it efficiently, but basicly here's how it works (after preprocessing A in linear time/space)

[hide]Suppose we start with two strings

A = abbac
B = abcaccbcabcbabbaccbabcccabababbabbcc
start comparing from the back of A

abcaccbcabcbabbaccbabcccabababbabbcc
abbac
 xxx
since it doesn't match, and ac doesn't occur in A again, you can shift A past the first 5 characters

Compare at that point again

abcaccbcabcbabbaccbabcccabababbabbcc
    abbac
 xxx    x
it doesn't match, but b does occur in A, so shift A to match to the first b and compare from there again

abcaccbcabcbabbaccbabcccabababbabbcc
      abbac
 xxx    x x
same again as before

abcaccbcabcbabbaccbabcccabababbabbcc
        abbac
 !xx    x x x
and again, compare, you can skip the substring (one character in this case) you allready know is there

abcaccbcabcbabbaccbabcccabababbabbcc
           abbac
 xxx    x xXxXXX
done

best case when there is a match m; best case when there isn't n/m; on avarage  < n, worst case n+m (with preprocessing)
[/hide]


I think towr was referring to the Boyer Moore algorithm.http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm.

Title: Re: substring search
Post by towr on Nov 17th, 2012, 7:35am
Probably, yeah.
God, that was long ago... Did we even have wikipedia back then?

Title: Re: substring search
Post by TenaliRaman on Nov 23rd, 2012, 11:13am
We did have wikipedia back then, but it wasn't popular as it is now nor was it as rich in information as it is today. The oldest wikipedia article on Boyer Moore that wayback machine has is around 2007. So, I guess its reasonable enough to say that Boyer Moore article wasn't around when you wrote your answer :D

Title: Re: substring search
Post by birbal on Dec 13th, 2012, 10:39am

on 11/17/12 at 07:35:08, towr wrote:
Probably, yeah.
God, that was long ago... Did we even have wikipedia back then?

Isn't this algo kmp ?
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

Title: Re: substring search
Post by towr on Dec 13th, 2012, 11:19am
No, I don't think it's that one.
I can't really be bothered to get into it, but Boyer Moore seems more like it.



Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board