wu :: forums - Print Page


    
      
        wu :: forums
        (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
      

        riddles >> cs >> RegEx: ab, ba
        
(Message started by: william wu on Feb 24^th, 2003, 2:39pm)

Title: RegEx: ab, ba
Post by william wu on Feb 24^th, 2003, 2:39pm

Write a regular expression (or equivalently, construct a deterministic finite automaton) that accepts strings over the alphabet { a, b } in which the number of occurrences of 'ab' = number of occurrences of 'ba'.

Note 1: What's a regular expression? Basically, it matches character strings that fit a certain pattern. You are restricted to the operators OR, CONCATENATE, and Kleene Star. And you can use all the parentheses you want. http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=regular+expressions

Note 2: This problem often trips people up because regular expressions are not supposed to have unbounded counting ability. There's a one-to-one correspondence between a regular expression and finite state machines, and the ability to count arbitrarily high requires an unbounded amount of state space. It follows that you cannot make a regular expression that matches all 0-1 strings of the form 0ⁿ1ⁿ, where n is an integer, and the notation 0ⁿ refers to the string of n consecutive zeroes. However, in this case there's something slightly cunning you can observe which will allow you to realize a Regular Expression.

Title: Re: RegEx: ab, ba
Post by Icarus on Feb 24^th, 2003, 7:26pm

I am not particularly knowledgable about CS topics so I usually leave this forum alone, but this one touches on something I use all the time!

A couple of questions though:

Is the "Kleene star" * the star in most shells (matches any pattern), or the star of most UNIX tools, PERL, etc (matches the preceding pattern any number of times)?

When you say that only the Kleene star, OR and CONCATENATE operators are allowed, are you also disallowing anchors? Or is the RE always required to match the entire pattern, not just some part of it?

If I can force the RE to match the entire pattern, then I know the solution.

Title: Re: RegEx: ab, ba
Post by Jeremiah Smith on Feb 24^th, 2003, 8:46pm

on 02/24/03 at 19:26:49, Icarus wrote:

Is the "Kleene star" * the star in most shells (matches any pattern), or the star of most UNIX tools, PERL, etc (matches the preceding pattern any number of times)?

The Perl one (matches the preceding pattern any number of times).

Title: Re: RegEx: ab, ba
Post by william wu on Feb 24^th, 2003, 10:21pm

I should probably add that by definition, a string is a finite sequence of characters.

on 02/24/03 at 19:26:49, Icarus wrote:

Is the "Kleene star" * the star in most shells (matches any pattern), or the star of most UNIX tools, PERL, etc (matches the preceding pattern any number of times)?

When you say that only the Kleene star, OR and CONCATENATE operators are allowed, are you also disallowing anchors? Or is the RE always required to match the entire pattern, not just some part of it?

> Kleene star = Matches the preceding pattern any number of times.
> No anchors are needed ... I don't see how they apply here. I presume by anchors you mean the operators ^ $ \b \B, for beginning and ends of strings, and word boundaries. Since the problem statement says the alphabet consists only of the letters a and b, there are no space characters to create word boundaries.

So basically the universe of strings you have to worry about are strings of a's and b's. If the number of occurrences of ab in the whole string is equal to the number of occurrences of ba, your RegEx should accept that string.

Title: Re: RegEx: ab, ba
Post by william wu on Feb 24^th, 2003, 10:25pm

Follow up Question:

Prove that it is impossible to write a regular expression (or equivalently, construct a deterministic finite automaton) that accepts strings over the alphabet { a, b, c } in which the number of occurrences of 'ab' = number of occurrences of 'ba'.

Note: This might be too hard to do if you haven't heard of something in CS theory called the [hide]pumping lemma[/hide], but it'd be interesting to see if an unacquainted mind ends up rediscovering it while thinking about this problem.

Title: Re: RegEx: ab, ba
Post by James Fingas on Feb 25^th, 2003, 9:01am

Original Problem:
[hide]The only reason you can solve the problem when the alphabet is {a,b} is that the occurrances of ab and the occurrances of ba must alternate. The regexp is simple:

a*a or b*b[/hide]

However, in the case where the language is {a,b,c}, then your automaton must be able to count indefinitely high (and no finite automaton can do this).

Proof:
Consider the string abcabcabc ... cbacbacba, where there are m repetitions of "abc" and n repetitions of "cba". When the automaton has received all the "abc"s, it must be able to wait for m repetitions of "cba", accept the string if it ends there, or reject the string if it doesn't. As each of these m "cba"s passes, it must enter a different state (this should be obvious). However, there is no limit to how high m can be, so there is no limit to the number of states that such a successful automaton would need (ie. a successful automaton would not be finite).

Title: Re: RegEx: ab, ba
Post by Icarus on Feb 25^th, 2003, 3:50pm

Given the definition of the Kleene Star, the solution would have to be [hide] (a(a|b)*a)|(b(a|b)*b) ( | = OR )[/hide].

As for the need for anchors: most implementations I know of, this will still match "abab" since there is nothing to require it to match the whole string. It will match the "aba" portion and ignore the rest. I don't see how this can be solved without some means of forcing a full string match. Either by (1) having your REs always required to match the whole string, or (2) allowing anchors:[hide] ^(a(a|b)*a)|(b(a|b)*b)$ [/hide]

Title: Re: RegEx: ab, ba
Post by TimMann on Feb 26^th, 2003, 12:18am

In the theory of regular expressions, the language defined by a regular expression is the set of strings that match the regular expression. That is, the whole string matches the regular expression, not a substring. William was assuming people knew that for this problem.

Grep's regular expressions are an extension of the kind usually used in the theory. One way to look at the "anchoring" issue is that grep gives you an implicit ".*" at the beginning and end of the pattern unless you start it with ^ and/or end it with $. Here "." itself is a grep extension, meaning of course the disjunction of all characters in the alphabet; that is, (a|b|...).

Title: Re: RegEx: ab, ba
Post by towr on Feb 26^th, 2003, 12:36am

on 02/25/03 at 15:50:48, Icarus wrote:

Given the definition of the Kleene Star, the solution would have to be [hide] (a(a|b)*a)|(b(a|b)*b) ( | = OR )[/hide].

This doesn't match "a", or "b" or "" for that matter..

so how about: [hide](a (b+ a+)* | b (a+ b+)* |)[/hide]

here's an interesting link if you like regular expressions:
http://odur.let.rug.nl/~vannoord/fsademo/
watch the syntax though, ab => [a,b] and a|b => {a,b} and the empty string is {}
It also functions as tranducer, in case you want to translate..

Title: Re: RegEx: ab, ba
Post by william wu on Feb 26^th, 2003, 12:40am

Regarding Icarus's solution: Almost ... forgetting some special cases. For instance, the string "a" has zero occurrences of both 'ab' and 'ba', and thus should be in our language. Likewise, the empty string should be accepted. The empty string is usually referred to in RegEx theory as epsilon, and is another symbol available to us which I forgot to mention.

The intuition is right though -- the key observation is [hide]every string in the language must start with the same letter it ends with.[/hide]

Title: Re: RegEx: ab, ba
Post by william wu on Feb 26^th, 2003, 12:44am

Oops. towr sneaked in while I was posting :)

Title: Re: RegEx: ab, ba
Post by Icarus on Feb 26^th, 2003, 8:03pm

on 02/26/03 at 00:18:57, TimMann wrote:

As I said, I don't know much CS theory. As far as actual RE implementations go, grep is hardly alone. About the only things I've seen that do require matching the whole string are shells.

Title: Re: RegEx: ab, ba
Post by rmsgrey on May 15^th, 2003, 9:36pm

on 02/26/03 at 00:36:17, towr wrote:

[hide](a (b+ a+)* | b (a+ b+)* |)[/hide]

doesn't work - rejects aaba

quick fix:
[hide] (a+(b*a+)*|b+(a*b+)*|e)[/hide]
using (e) to denote empty string and noting (a+) is shorthand for (aa*) (the original problem ruled out using anything but the three basic operations)

Title: Re: RegEx: ab, ba
Post by S. Owen on May 26^th, 2003, 7:41am

I think that answer is right... I worked out the state machine for this and came up with this alternate regular expression, which may be a little simpler:

[hide]( e | a ( a | b+ a )* | b ( b | a+ b )* )[/hide]

Title: Re: RegEx: ab, ba
Post by rmsgrey on Aug 14^th, 2003, 3:32pm

on 05/26/03 at 07:41:45, S. Owen wrote:

I think that answer is right... I worked out the state machine for this and came up with this alternate regular expression, which may be a little simpler:
*'hide'd portion removed*

::[hide]
(e|a(b*a)*|b(a*b)*)[/hide]
::
seems simpler still. In fact, I'm going to go out on a limb and say I reckon it's the simplest possible - you need to cover the three cases ([hide]e, a..., b...[/hide]) and in the latter two, you need to ensure that each run of the wrong letter is followed by one of the right one.