wu :: forums « wu :: forums - Re: substring search » Welcome, Guest. Please Login or Register. May 19th, 2024, 11:23am RIDDLES SITE WRITE MATH! Home Help Search Members Login Register
 wu :: forums    riddles    cs (Moderators: ThudnBlunder, SMQ, Icarus, Eigenray, william wu, Grimbal, towr)    Re: substring search « Previous topic | Next topic »
 Pages: 1 Reply Notify of replies Send Topic Print
 Author Topic: Re: substring search  (Read 9255 times)
sameer_r
Guest

 Re: substring search   « on: Apr 25th, 2003, 2:31pm » Quote Modify Remove

Hi everyone,

I didn't see any message posted regarding the substring search problem.  Pardon me if a thread already exists for it.

I was thinking about a solution to the problem and came up with the following algorithm (only time efficient)

Say A is the string to be searched in the string B.
for e.g.
A = "xyxz"
B = "xyzxyxz"

Sort A and Sort B keeping track of the original indices of each character.
A = "xxyz" with original indices 1,3,2,4
B = "xxxyyzz" with original indices 1,4,6,2,5,3,7

Allocate an integer array (counter) equal to the size of B.

Now for every character in B, do a binary search to locate the corresponding character in A (there could be more than one occurence of the same character).

Say for the first x in B, we locate the first x in A with original index 1. (since its sorted the other x's would be easy to find by just increasing the index until we came to a character > x)

Now if this character in B were to be a part of the substring A, then the beginning of the string would have been (original index of this character in B - original index of this character in A + 1).

In the case of the first x in A: 1-1+1
or in the case of second x in A : 1-3+1<0 (hence cannot be)

This means that if the first x in B were a part of the substring A, then it could be the first character of A and not the 3rd.

Now if this x is the first character of substring A, then 1 should be the index from which the substring starts in B and hence in Count array we increment Count[1] by 1.

Finally when we are done looking at all the elements of B, then we scan through the Count array and all those offsets that have the value equal to the length of substring A, are in fact the beginning locations of the substring A in B.

Thanks,
Sameer.
 IP Logged
Papa Homer
Guest

 Re: substring search   « Reply #1 on: Apr 25th, 2003, 3:06pm » Quote Modify Remove

I am feeling a bit slow right now, so could you provide some more details for you solution?  Specifically, the runtime complexity for each step as well as an explaination of why it actually works.

Thanks.
 IP Logged
sameer_r
Guest

 Re: substring search   « Reply #2 on: Apr 25th, 2003, 3:13pm » Quote Modify Remove

The algorithm has the following steps:

sort A and B
sorting A theta(m)
sorting B theta(n)
assuming that we know in advance the range of characters that the strings can contain.

go thru each element of B (n)
search that element in A theta(lg(n))
go thru the matching element in A (at worst m) (this part I am not sure of. looks like an average case analysis like in quick sort is needed)

finally go thru the count array and find the offsets where your substring starts in B. theta(n)

so i guess the complexity in worst case comes out to be theta(nm), though I think it should be better than that.
 IP Logged
Papa Homer
Guest

 Re: substring search   « Reply #3 on: Apr 25th, 2003, 3:34pm » Quote Modify Remove

Yeah, that's the complexity of a brute force method: for every character in B, see if the next m characters make up A.
 IP Logged
sameer_r
Guest

 Re: substring search   « Reply #4 on: Apr 25th, 2003, 4:06pm » Quote Modify Remove

For Brute force method, the run-time is always theta(nm), but for this algo, the worst case is theta(nm).  It could perform better.  Even for quick-sort the worst case run-time is theta(n^2), but average case is theta(nlgn), which is pretty good, in fact the theoritically the best possible for comparison sort.

I still hope (fingers crossed) that this algorithm is better than brute force.
 IP Logged
Papa Homer
Guest

 Re: substring search   « Reply #5 on: Apr 25th, 2003, 4:18pm » Quote Modify Remove

That's not entire correct.  For the brute force method the best case is m (if A appears in the beginning of B).  However, the worst and AVERAGE cases are still mn.  So, the question is, what is the average performance of your algorithm on randomk data.
 IP Logged
towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender:
Posts: 13730
 Re: substring search   « Reply #6 on: Apr 26th, 2003, 5:25am » Quote Modify

there is a sublinear search algorithm for this problem..
You need a little preprocessing (linear time) and a few simple tricks..
Of course without my book I can't really describe it adequatly..
 IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender:
Posts: 13730
 Re: substring search   « Reply #7 on: Apr 27th, 2003, 1:37pm » Quote Modify

I'll just give an example for now.. As I said, there are a few tricks you need to do it efficiently, but basicly here's how it works (after preprocessing A in linear time/space)

`  A = abbac  B = abcaccbcabcbabbaccbabcccabababbabbcc  `
start comparing from the back of A
`  abcaccbcabcbabbaccbabcccabababbabbcc  abbac    xxx  `
since it doesn't match, and ac doesn't occur in A again, you can shift A past the first 5 characters

Compare at that point again
`  abcaccbcabcbabbaccbabcccabababbabbcc       abbac    xxx    x  `
it doesn't match, but b does occur in A, so shift A to match to the first b and compare from there again
`  abcaccbcabcbabbaccbabcccabababbabbcc         abbac    xxx    x x  `
same again as before
`  abcaccbcabcbabbaccbabcccabababbabbcc           abbac    !xx    x x x  `
and again, compare, you can skip the substring (one character in this case) you allready know is there
`  abcaccbcabcbabbaccbabcccabababbabbcc              abbac    xxx    x xXxXXX  `
done

best case when there is a match m; best case when there isn't n/m; on avarage  < n, worst case n+m (with preprocessing)
 « Last Edit: Apr 27th, 2003, 1:46pm by towr » IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
akasina9
Newbie

x = 0x2B | ~ 0x2B. x is the question

Gender:
Posts: 9
 Re: substring search   « Reply #8 on: Nov 17th, 2012, 6:30am » Quote Modify

on Apr 27th, 2003, 1:37pm, towr wrote:
 I'll just give an example for now.. As I said, there are a few tricks you need to do it efficiently, but basicly here's how it works (after preprocessing A in linear time/space)   Suppose we start with two strings` A = abbac B = abcaccbcabcbabbaccbabcccabababbabbcc `start comparing from the back of A` abcaccbcabcbabbaccbabcccabababbabbcc abbac   xxx `since it doesn't match, and ac doesn't occur in A again, you can shift A past the first 5 characters   Compare at that point again` abcaccbcabcbabbaccbabcccabababbabbcc      abbac   xxx    x `it doesn't match, but b does occur in A, so shift A to match to the first b and compare from there again` abcaccbcabcbabbaccbabcccabababbabbcc        abbac   xxx    x x `same again as before` abcaccbcabcbabbaccbabcccabababbabbcc          abbac   !xx    x x x `and again, compare, you can skip the substring (one character in this case) you allready know is there` abcaccbcabcbabbaccbabcccabababbabbcc             abbac   xxx    x xXxXXX `done   best case when there is a match m; best case when there isn't n/m; on avarage  < n, worst case n+m (with preprocessing)

I think towr was referring to the Boyer Moore algorithm.http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm .
 IP Logged
towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender:
Posts: 13730
 Re: substring search   « Reply #9 on: Nov 17th, 2012, 7:35am » Quote Modify

Probably, yeah.
God, that was long ago... Did we even have wikipedia back then?
 IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
TenaliRaman
Uberpuzzler

I am no special. I am only passionately curious.

Gender:
Posts: 1001
 Re: substring search   « Reply #10 on: Nov 23rd, 2012, 11:13am » Quote Modify

We did have wikipedia back then, but it wasn't popular as it is now nor was it as rich in information as it is today. The oldest wikipedia article on Boyer Moore that wayback machine has is around 2007. So, I guess its reasonable enough to say that Boyer Moore article wasn't around when you wrote your answer
 IP Logged

Self discovery comes when a man measures himself against an obstacle - Antoine de Saint Exupery
birbal
Full Member

Gender:
Posts: 250
 Re: substring search   « Reply #11 on: Dec 13th, 2012, 10:39am » Quote Modify

on Nov 17th, 2012, 7:35am, towr wrote:
 Probably, yeah.   God, that was long ago... Did we even have wikipedia back then?

Isn't this algo kmp ?
http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorith m
 IP Logged

The only thing we have to fear is fear itself!
towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender:
Posts: 13730
 Re: substring search   « Reply #12 on: Dec 13th, 2012, 11:19am » Quote Modify

No, I don't think it's that one.
I can't really be bothered to get into it, but Boyer Moore seems more like it.
 IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
 Pages: 1 Reply Notify of replies Send Topic Print

 Forum Jump: ----------------------------- riddles -----------------------------  - easy   - medium   - hard   - what am i   - what happened   - microsoft => cs   - putnam exam (pure math)   - suggestions, help, and FAQ   - general problem-solving / chatting / whatever ----------------------------- general -----------------------------  - guestbook   - truth   - complex analysis   - wanted   - psychology   - chinese « Previous topic | Next topic »