wu :: forums (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
riddles >> cs >> Dirty Comment Regex
(Message started by: playful on Nov 27th, 2011, 7:01pm)

Title: Dirty Comment Regex
Post by playful on Nov 27th, 2011, 7:01pm
Here is a little puzzle I gave myself this morning. It was a lot of fun so I thought I would share it.

The goal is to make a regular expression. The regex has to do with comments delimited with /* and */, as in C and PHP. Some of the comments are "dirty" and you have been asked to clean them up.
Here is an example of a dirty comment embedded in a line of text:
asdf///*_**\\\*****Get Me!*--///*/****//asdf

And here are the rules.

1. The regex must match the whole comment. For instance, for string 1:
I went up the hill /*it was steep*/ last night.
the regex must match "/*it was steep*/" (without the quotes).

2. The regex must capture the comment itself as group 1. For instance, in string 1, capture group 1 must be "it was steep" (without the quotes).

3. The programmer has made a lot of comments where, just after the /* or just before the */, there are a number of special characters, for instance, lots of stars, as in "/***hello***/". These extra characters help her read the code, but your boss thinks they're dirty. You have to remove all leading or trailing "special" characters in this group: {*$ _/+\-} (not including the curly braces). Note that the third character is a space.

This means that in string 2:
asdf///*_**\\\*****Get Me!*--///*/****//asdf
group 1 will only capture "Get Me!" (without the quotes).
In string 3:
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***
there is no capture at all, though there is a match.

4. There is no capture group beyond group 1.

If anything is unclear, please ask and I will try to clarify.
I will check all answers to the best of my ability.
Enjoy!!!

;D

Title: Re: Dirty Comment Regex
Post by towr on Nov 27th, 2011, 10:37pm

on 11/27/11 at 19:01:57, playful wrote:
In string 3:
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***
there is no capture at all, though there is a match.
Why wouldn't there be a capture here?
Shouldn't it break down as follows:
   asdf//
   /*
   _**\\\***/-**
   Don't Get-Me!
   *--///
   */
   ***

The match is parts 2-6, the leading/trailing characters are part 3 and 5, and the captured, clean comment is part 4.

Or do you want an ungreedy match?

Title: Re: Dirty Comment Regex
Post by playful on Nov 27th, 2011, 11:18pm
Hi Towr,

Great to hear from you.


Quote:
Or do you want an ungreedy match?


Yes, that's right.

More precisely, I want the match to agree with how a compiler (or code-aware text editor) looks at /**/ comments.

To clarify for anyone else on the same track, on string3:
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***

for the compiler / interpreter or code-aware text editor, the only /**/ comment is:
/*_**\\\***/

That is the sought match, with no capture.

After checking on two of my editors, I confirm that is how they treat comments on the /**/*/ pattern (they take the leftmost comment).

The thing that may throw off your editor (as it did mine) is that string 3 includes a double slash, which comments out the rest of the line on most editors. But that is not part of the puzzle. ;)

Title: Re: Dirty Comment Regex
Post by playful on Nov 27th, 2011, 11:49pm
Removing all comments, my regex pattern has 57 characters. I wouldn't be surprised if someone here got it tighter.

Title: Re: Dirty Comment Regex
Post by towr on Nov 28th, 2011, 12:16am
I think the following should work
//\*[*$ _/+\-]*(.*)[*$ _/+\-]*\*\//U
(where the U is the modifier for ungreedy and the first and last / delimit the regexp)

Might depend a bit on how the regexp is compiled; since there's some ambiguity in what might be captured.

Title: Re: Dirty Comment Regex
Post by playful on Nov 28th, 2011, 12:36am
Hi Towr,

Hey, that was fast!

Here is your string without the delimiters.
/\*[*$ _/+\-]*(.*)[*$ _/+\-]*\*\/

To which part(s) of the expression do you want to apply the ungreedy modifiers?

I see three expanding patterns in there.
Just place a U after any of the greedy stars and I will test it for you.

For the record  in PCRE syntax and several others the lazy modifier is a question mark, as in .*?

Title: Re: Dirty Comment Regex
Post by playful on Nov 28th, 2011, 12:40am
For the record, if we made all of your stars lazy:
/\*[*$ _/+\-]*?(.*?)[*$ _/+\-]*?\*\/

Then your regex would match all three strings!

However, group 1 would capture:
string1: it was steep  (correct)
string2: _**\\\*****Get Me!  (too much)
string3: _**\\\  (we don't want any capture)


Quote:
Might depend a bit on how the regexp is compiled


No worries, I have a tool to check how the expression performs for a variety of regex engines: .NET, Java, Perl, PCRE, JavaScript, Python, Ruby, Tcl ARE, XPath. :)

If anyone's in need of testing the match and capture of an expression, I believe this should get you started on PHP with the above expression (didn't test it though):


Code:
$string1="I went up the hill /*it was steep*/ last night.";
if (preg_match('%/\*[*$ _/+\-]*?(.*?)[*$ _/+\-]*?\*/%', $string1, $groups)) {

echo "match:".$groups[0]."</br>";

echo "capture:".$groups[1]."</br>";
} else {

echo "sorry no match";
}

Title: Re: Dirty Comment Regex
Post by towr on Nov 28th, 2011, 12:58am
Only the * of the capturing subexpression one should be lazy.

Title: Re: Dirty Comment Regex
Post by playful on Nov 28th, 2011, 1:18am
Hi Towr,

With
/\*[*$ _/+\-]*(.*?)[*$ _/+\-]*\*\/

We get
match1: /*it was steep*/ (great)
capture1: it was steep (great)

match2: /*_**\\\*****Get Me!*--///*/****/ (too much)
capture2:  \\\*****Get Me!

match3: /*_**\\\***/ (great)
capture3:  \\\ (too much)

Looks like your left star is too greedy!

Here is (now tested) code to check a pattern:


Code:
<?php
$pattern="%/\*[*$ _/+\-]*(.*?)[*$ _/+\-]*\*/%";
$string[0]="I went up the hill /*it was steep*/ last night.";
$string[1]="asdf///*_**\\\*****Get Me!*--///*/****//asdf";
$string[2]="asdf///*_**\\\***/-**Don't Get-Me!*--///*/***";
for ($i=0;$i<3;$i++) {
echo "Test String #"; echo $i+1; echo ": ".$string[$i]."<br />";
if (preg_match($pattern, $string[$i], $groups)) {

echo "match: ".$groups[0]."</br>";

echo "capture:".$groups[1]."</br><br />";
} else {

echo "sorry no match<br />";
}
}
?>


Sorry, no time to upload it with a form field, quite late in New Zealand and need to get off the screen. :)

Wishing you a beautiful day


Title: Re: Dirty Comment Regex
Post by towr on Nov 28th, 2011, 2:23am
There's a number of problems here; because you're using php strings to represent the pattern, \'s need to be escaped. Unfortunately that wouldn't help me much because it needs to be escaped in the regexp too (at least for php), so [*$ _/+\-]* should have been [*$ _/+\\\\-]* to do what I expect it to do.

It still wouldn't give the desired result, unfortunately, but at least then it'd be entirely my fault it doesn't work :P

Title: Re: Dirty Comment Regex
Post by towr on Nov 28th, 2011, 3:27am
Ok, here's another attempt:
/\*(?:[$ _/+\\-]|\*(?!/))*(.*?)[*$ _/+\\-]*?\*/
47 characters (but a bit more when you escape it to put it into a php string)

Title: Re: Dirty Comment Regex
Post by playful on Nov 28th, 2011, 11:22am
Hi towr,

You win!
Shorter than mine too. ;)
Well done!

On the left side of the capture, we have the exact same number of characters, but we picked a different strategy (or came up with, in my case---I didn't "pick" as the other one didn't occur to me).

Where you use an alternation:
|\*(?!/) (or a star not followed by a slash)
I bundled the star in the lookahead:
(?:(?!\*/)[*$ _/+\\-])*
(any of these characters, not followed by a */)

On the right side of the capture, I was lazy and reused the code from the left:
(?:(?!\*/)[*$ _/+\\-])*
But, as you showed, that was not needed, and true laziness (expressed by a *?) was enough! I am so glad your expression brought that out.

Very cool too to see the other idea on the left side.


Quote:
There's a number of problems here


Yes I was a bit tired last night.


Well, I think I'll go run some quick benchmarks just for the fun of it and report back. Congratulations.  ;D

Title: Re: Dirty Comment Regex
Post by playful on Nov 28th, 2011, 11:35am
EDIT: A full walk-through of the solution now lives on my page about greedy and lazy regex (http://www.asiteaboutnothing.net/regexp/regex-greed.html). Ctrl + F and look for Dirty Comments.

Okay, the benchmarks are in.

You win again! :D

Here are the number of steps to find a match for each string according to my debugger.
Three expressions were tested: yours, mine, and a third one that uses my lookahead on the left with your lazy quantifier on the right.

String 1:
towr: 74 steps
playful: 108 steps
towr meets playful: 91 steps

String 2:
towr: 102 steps
playful: 142 steps
towr meets playful: 126 steps

String 3:
towr: 39 steps
playful: 49 steps
towr meets playful: 46 steps
 
The relative performance of the expressions couldn't be more clear.

So what's happening here?

For the left side, I have a negative lookahead at each step, whereas you greedily eat up every special character, only jumping to the other side of the OR when we meet a star. Then only you have a lookahead.

For the right side, you lazily expand, so at each step you have to at least check ahead if there is a star and you might be able to stop. Whereas I check if there are two characters ahead, the star and the slash.
At least that's what I think is going on. :)

Thank you for playing.

Wishing you a fun day,



Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board