wu :: forums (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
riddles >> cs >> RegEx: ab, ba
(Message started by: william wu on Feb 24th, 2003, 2:39pm)

Title: RegEx: ab, ba
Post by william wu on Feb 24th, 2003, 2:39pm
Write a regular expression (or equivalently, construct a deterministic finite automaton) that accepts strings over the alphabet { a, b } in which the number of occurrences of 'ab' = number of occurrences of 'ba'.



Note 1: What's a regular expression? Basically, it matches character strings that fit a certain pattern. You are restricted to the operators OR, CONCATENATE, and Kleene Star. And you can use all the parentheses you want. http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=regular+expressions

Note 2: This problem often trips people up because regular expressions are not supposed to have unbounded counting ability. There's a one-to-one correspondence between a regular expression and finite state machines, and the ability to count arbitrarily high requires an unbounded amount of state space. It follows that you cannot make a regular expression that matches all 0-1 strings of the form 0n1n, where n is an integer, and the notation 0n refers to the string of n consecutive zeroes. However, in this case there's something slightly cunning you can observe which will allow you to realize a Regular Expression.

Title: Re: RegEx: ab, ba
Post by Icarus on Feb 24th, 2003, 7:26pm
I am not particularly knowledgable about CS topics so I usually leave this forum alone, but this one touches on something I use all the time!

A couple of questions though:

Is the "Kleene star" * the star in most shells (matches any pattern), or the star of most UNIX tools, PERL, etc (matches the preceding pattern any number of times)?

When you say that only the Kleene star, OR and CONCATENATE operators are allowed, are you also disallowing anchors? Or is the RE always required to match the entire pattern, not just some part of it?

If I can force the RE to match the entire pattern, then I know the solution.

Title: Re: RegEx: ab, ba
Post by Jeremiah Smith on Feb 24th, 2003, 8:46pm

on 02/24/03 at 19:26:49, Icarus wrote:
Is the "Kleene star" * the star in most shells (matches any pattern), or the star of most UNIX tools, PERL, etc (matches the preceding pattern any number of times)?


The Perl one (matches the preceding pattern any number of times).

Title: Re: RegEx: ab, ba
Post by william wu on Feb 24th, 2003, 10:21pm
I should probably add that by definition, a string is a finite sequence of characters.


on 02/24/03 at 19:26:49, Icarus wrote:
Is the "Kleene star" * the star in most shells (matches any pattern), or the star of most UNIX tools, PERL, etc (matches the preceding pattern any number of times)?

When you say that only the Kleene star, OR and CONCATENATE operators are allowed, are you also disallowing anchors? Or is the RE always required to match the entire pattern, not just some part of it?


> Kleene star = Matches the preceding pattern any number of times.
> No anchors are needed ... I don't see how they apply here. I presume by anchors you mean the operators ^ $ \b \B, for beginning and ends of strings, and word boundaries. Since the problem statement says the alphabet consists only of the letters a and b, there are no space characters to create word boundaries.

So basically the universe of strings you have to worry about are strings of a's and b's. If the number of occurrences of ab in the whole string is equal to the number of occurrences of ba, your RegEx should accept that string.

Title: Re: RegEx: ab, ba
Post by william wu on Feb 24th, 2003, 10:25pm
Follow up Question:

Prove that it is impossible to write a regular expression (or equivalently, construct a deterministic finite automaton) that accepts strings over the alphabet { a, b, c } in which the number of occurrences of 'ab' = number of occurrences of 'ba'.  


Note: This might be too hard to do if you haven't heard of something in CS theory called the [hide]pumping lemma[/hide], but it'd be interesting to see if an unacquainted mind ends up rediscovering it while thinking about this problem.

Title: Re: RegEx: ab, ba
Post by James Fingas on Feb 25th, 2003, 9:01am
Original Problem:
[hide]The only reason you can solve the problem when the alphabet is {a,b} is that the occurrances of ab and the occurrances of ba must alternate. The regexp is simple:

a*a or b*b[/hide]

However, in the case where the language is {a,b,c}, then your automaton must be able to count indefinitely high (and no finite automaton can do this).

Proof:
Consider the string abcabcabc ... cbacbacba, where there are m repetitions of "abc" and n repetitions of "cba". When the automaton has received all the "abc"s, it must be able to wait for m repetitions of "cba", accept the string if it ends there, or reject the string if it doesn't. As each of these m "cba"s passes, it must enter a different state (this should be obvious). However, there is no limit to how high m can be, so there is no limit to the number of states that such a successful automaton would need (ie. a successful automaton would not be finite).

Title: Re: RegEx: ab, ba
Post by Icarus on Feb 25th, 2003, 3:50pm
Given the definition of the Kleene Star, the solution would have to be [hide] (a(a|b)*a)|(b(a|b)*b)    ( | = OR )[/hide].

As for the need for anchors: most implementations I know of, this will still match "abab" since there is nothing to require it to match the whole string. It will match the "aba" portion and ignore the rest. I don't see how this can be solved without some means of forcing a full string match. Either by (1) having your REs always required to match the whole string, or (2) allowing anchors:[hide] ^(a(a|b)*a)|(b(a|b)*b)$ [/hide]

Title: Re: RegEx: ab, ba
Post by TimMann on Feb 26th, 2003, 12:18am
In the theory of regular expressions, the language defined by a regular expression is the set of strings that match the regular expression. That is, the whole string matches the regular expression, not a substring. William was assuming people knew that for this problem.

Grep's regular expressions are an extension of the kind usually used in the theory. One way to look at the "anchoring" issue is that grep gives you an implicit ".*" at the beginning and end of the pattern unless you start it with ^ and/or end it with $. Here "." itself is a grep extension, meaning of course the disjunction of all characters in the alphabet; that is, (a|b|...).

Title: Re: RegEx: ab, ba
Post by towr on Feb 26th, 2003, 12:36am

on 02/25/03 at 15:50:48, Icarus wrote:
Given the definition of the Kleene Star, the solution would have to be [hide] (a(a|b)*a)|(b(a|b)*b)    ( | = OR )[/hide].

This doesn't match "a", or "b" or "" for that matter..

so how about: [hide](a (b+ a+)* | b (a+ b+)* |)[/hide]

here's an interesting link if you like regular expressions:
http://odur.let.rug.nl/~vannoord/fsademo/
watch the syntax though, ab => [a,b] and a|b => {a,b} and the empty string is {}
It also functions as tranducer, in case you want to translate..

Title: Re: RegEx: ab, ba
Post by william wu on Feb 26th, 2003, 12:40am
Regarding Icarus's solution: Almost ... forgetting some special cases. For instance, the string "a" has zero occurrences of both 'ab' and 'ba', and thus should be in our language. Likewise, the empty string should be accepted. The empty string is usually referred to in RegEx theory as epsilon, and is another symbol available to us which I forgot to mention.

The intuition is right though -- the key observation is [hide]every string in the language must start with the same letter it ends with.[/hide]

Title: Re: RegEx: ab, ba
Post by william wu on Feb 26th, 2003, 12:44am
Oops. towr sneaked in while I was posting :)

Title: Re: RegEx: ab, ba
Post by Icarus on Feb 26th, 2003, 8:03pm

on 02/26/03 at 00:18:57, TimMann wrote:
In the theory of regular expressions, the language defined by a regular expression is the set of strings that match the regular expression. That is, the whole string matches the regular expression, not a substring. William was assuming people knew that for this problem.


As I said, I don't know much CS theory. As far as actual RE implementations go, grep is hardly alone. About the only things I've seen that do require matching the whole string are shells.

Title: Re: RegEx: ab, ba
Post by rmsgrey on May 15th, 2003, 9:36pm

on 02/26/03 at 00:36:17, towr wrote:
[hide](a (b+ a+)* | b (a+ b+)* |)[/hide]

doesn't work - rejects aaba

quick fix:
[hide] (a+(b*a+)*|b+(a*b+)*|e)[/hide]
using (e) to denote empty string and noting (a+) is shorthand for (aa*) (the original problem ruled out using anything but the three basic operations)

Title: Re: RegEx: ab, ba
Post by S. Owen on May 26th, 2003, 7:41am
I think that answer is right... I worked out the state machine for this and came up with this alternate regular expression, which may be a little simpler:

[hide](   e   |   a ( a | b+ a )*   |   b  ( b | a+ b )*   )[/hide]

Title: Re: RegEx: ab, ba
Post by rmsgrey on Aug 14th, 2003, 3:32pm

on 05/26/03 at 07:41:45, S. Owen wrote:
I think that answer is right... I worked out the state machine for this and came up with this alternate regular expression, which may be a little simpler:
*'hide'd portion removed*


::[hide]
(e|a(b*a)*|b(a*b)*)[/hide]
::
seems simpler still. In fact, I'm going to go out on a limb and say I reckon it's the simplest possible - you need to cover the three cases ([hide]e, a..., b...[/hide]) and in the latter two, you need to ensure that each run of the wrong letter is followed by one of the right one.



Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board