wu :: forums - Dirty Comment Regex

wu :: forums « wu :: forums - Dirty Comment Regex » Welcome, Guest. Please Login or Register. Apr 29^th, 2024, 7:48am
RIDDLES SITE WRITE MATH! Home Help Search Members Login Register

   wu :: forums
   riddles
   cs (Moderators: Icarus, Eigenray, william wu, ThudnBlunder, Grimbal, towr, SMQ)
   Dirty Comment Regex

« Previous topic | Next topic »

Pages: 1

Notify of replies

Send Topic

Author

Topic: Dirty Comment Regex (Read 2418 times)

playful
Junior Member

meet me at the second fork in the road

Gender: male

Posts: 100

Dirty Comment Regex
« on: Nov 27^th, 2011, 7:01pm »

Quote

Modify

Here is a little puzzle I gave myself this morning. It was a lot of fun so I thought I would share it.

The goal is to make a regular expression. The regex has to do with comments delimited with /* and */, as in C and PHP. Some of the comments are "dirty" and you have been asked to clean them up.
Here is an example of a dirty comment embedded in a line of text:
asdf///*_**\\\*****Get Me!*--///*/****//asdf

And here are the rules.

1. The regex must match the whole comment. For instance, for string 1:
I went up the hill /*it was steep*/ last night.
the regex must match "/*it was steep*/" (without the quotes).

2. The regex must capture the comment itself as group 1. For instance, in string 1, capture group 1 must be "it was steep" (without the quotes).

3. The programmer has made a lot of comments where, just after the /* or just before the */, there are a number of special characters, for instance, lots of stars, as in "/***hello***/". These extra characters help her read the code, but your boss thinks they're dirty. You have to remove all leading or trailing "special" characters in this group: {*$ _/+\-} (not including the curly braces). Note that the third character is a space.

This means that in string 2:
asdf///*_**\\\*****Get Me!*--///*/****//asdf
group 1 will only capture "Get Me!" (without the quotes).
In string 3:
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***
there is no capture at all, though there is a match.

4. There is no capture group beyond group 1.

If anything is unclear, please ask and I will try to clarify.
I will check all answers to the best of my ability.
Enjoy!!!

Grin

IP Logged

my favorite puzzles

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Dirty Comment Regex
« Reply #1 on: Nov 27^th, 2011, 10:37pm »

Quote

Modify

on Nov 27^th, 2011, 7:01pm, playful wrote:

In string 3:
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***
there is no capture at all, though there is a match.

Why wouldn't there be a capture here?
Shouldn't it break down as follows:
   asdf//
   /*
   _**\\\***/-**
   Don't Get-Me!
   *--///
   */
   ***

The match is parts 2-6, the leading/trailing characters are part 3 and 5, and the captured, clean comment is part 4.

Or do you want an ungreedy match?

« Last Edit: Nov 27^th, 2011, 10:40pm by towr »

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

playful
Junior Member

meet me at the second fork in the road

Gender: male

Posts: 100

Re: Dirty Comment Regex
« Reply #2 on: Nov 27^th, 2011, 11:18pm »

Quote

Modify

Hi Towr,

Great to hear from you.

Quote:

Or do you want an ungreedy match?

Yes, that's right.

More precisely, I want the match to agree with how a compiler (or code-aware text editor) looks at /**/ comments.

To clarify for anyone else on the same track, on string3:
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***

for the compiler / interpreter or code-aware text editor, the only /**/ comment is:
/*_**\\\***/

That is the sought match, with no capture.

After checking on two of my editors, I confirm that is how they treat comments on the /**/*/ pattern (they take the leftmost comment).

The thing that may throw off your editor (as it did mine) is that string 3 includes a double slash, which comments out the rest of the line on most editors. But that is not part of the puzzle. Wink

« Last Edit: Nov 27^th, 2011, 11:44pm by playful »

IP Logged

my favorite puzzles

playful
Junior Member

meet me at the second fork in the road

Gender: male

Posts: 100

Re: Dirty Comment Regex
« Reply #3 on: Nov 27^th, 2011, 11:49pm »

Quote

Modify

Removing all comments, my regex pattern has 57 characters. I wouldn't be surprised if someone here got it tighter.

IP Logged

my favorite puzzles

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Dirty Comment Regex
« Reply #4 on: Nov 28^th, 2011, 12:16am »

Quote

Modify

I think the following should work
//\*[*$ _/+\-]*(.*)[*$ _/+\-]*\*\//U
(where the U is the modifier for ungreedy and the first and last / delimit the regexp)

Might depend a bit on how the regexp is compiled; since there's some ambiguity in what might be captured.

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

playful
Junior Member

meet me at the second fork in the road

Gender: male

Posts: 100

Re: Dirty Comment Regex
« Reply #5 on: Nov 28^th, 2011, 12:36am »

Quote

Modify

Hi Towr,

Hey, that was fast!

Here is your string without the delimiters.
/\*[*$ _/+\-]*(.*)[*$ _/+\-]*\*\/

To which part(s) of the expression do you want to apply the ungreedy modifiers?

I see three expanding patterns in there.
Just place a U after any of the greedy stars and I will test it for you.

For the record in PCRE syntax and several others the lazy modifier is a question mark, as in .*?

IP Logged

my favorite puzzles

playful
Junior Member

meet me at the second fork in the road

Gender: male

Posts: 100

Re: Dirty Comment Regex
« Reply #6 on: Nov 28^th, 2011, 12:40am »

Quote

Modify

For the record, if we made all of your stars lazy:
/\*[*$ _/+\-]*?(.*?)[*$ _/+\-]*?\*\/

Then your regex would match all three strings!

However, group 1 would capture:
string1: it was steep (correct)
string2: _**\\\*****Get Me! (too much)
string3: _**\\\ (we don't want any capture)

Quote:

Might depend a bit on how the regexp is compiled

No worries, I have a tool to check how the expression performs for a variety of regex engines: .NET, Java, Perl, PCRE, JavaScript, Python, Ruby, Tcl ARE, XPath.

If anyone's in need of testing the match and capture of an expression, I believe this should get you started on PHP with the above expression (didn't test it though):

Code:

$string1="I went up the hill /*it was steep*/ last night.";
if (preg_match('%/\*[*$ _/+\-]*?(.*?)[*$ _/+\-]*?\*/%', $string1, $groups)) {

echo "match:".$groups[0]."";

echo "capture:".$groups[1]."";
} else {

echo "sorry no match";
}

« Last Edit: Nov 28^th, 2011, 12:55am by playful »

IP Logged

my favorite puzzles

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Dirty Comment Regex
« Reply #7 on: Nov 28^th, 2011, 12:58am »

Quote

Modify

Only the * of the capturing subexpression one should be lazy.

« Last Edit: Nov 28^th, 2011, 12:59am by towr »

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

playful
Junior Member

meet me at the second fork in the road

Gender: male

Posts: 100

Re: Dirty Comment Regex
« Reply #8 on: Nov 28^th, 2011, 1:18am »

Quote

Modify

Hi Towr,

With
/\*[*$ _/+\-]*(.*?)[*$ _/+\-]*\*\/

We get
match1: /*it was steep*/ (great)
capture1: it was steep (great)

match2: /*_**\\\*****Get Me!*--///*/****/ (too much)
capture2: \\\*****Get Me!

match3: /*_**\\\***/ (great)
capture3: \\\ (too much)

Looks like your left star is too greedy!

Here is (now tested) code to check a pattern:

Code:

<?php
$pattern="%/\*[*$ _/+\-]*(.*?)[*$ _/+\-]*\*/%";
$string[0]="I went up the hill /*it was steep*/ last night.";
$string[1]="asdf///*_**\\\*****Get Me!*--///*/****//asdf";
$string[2]="asdf///*_**\\\***/-**Don't Get-Me!*--///*/***";
for ($i=0;$i<3;$i++) {
echo "Test String #"; echo $i+1; echo ": ".$string[$i]." ";
if (preg_match($pattern, $string[$i], $groups)) {

echo "match: ".$groups[0]."";

echo "capture:".$groups[1]." ";
} else {

echo "sorry no match ";
}
}
?>

Sorry, no time to upload it with a form field, quite late in New Zealand and need to get off the screen.

Wishing you a beautiful day

« Last Edit: Nov 28^th, 2011, 1:22am by playful »

IP Logged

my favorite puzzles

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Dirty Comment Regex
« Reply #9 on: Nov 28^th, 2011, 2:23am »

Quote

Modify

There's a number of problems here; because you're using php strings to represent the pattern, \'s need to be escaped. Unfortunately that wouldn't help me much because it needs to be escaped in the regexp too (at least for php), so [*$ _/+\-]* should have been [*$ _/+\\\\-]* to do what I expect it to do.

It still wouldn't give the desired result, unfortunately, but at least then it'd be entirely my fault it doesn't work Tongue

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: Dirty Comment Regex
« Reply #10 on: Nov 28^th, 2011, 3:27am »

Quote

Modify

Ok, here's another attempt:
/\*(?:[$ _/+\\-]|\*(?!/))*(.*?)[*$ _/+\\-]*?\*/
47 characters (but a bit more when you escape it to put it into a php string)

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

playful
Junior Member

meet me at the second fork in the road

Gender: male

Posts: 100

Re: Dirty Comment Regex
« Reply #11 on: Nov 28^th, 2011, 11:22am »

Quote

Modify

Hi towr,

You win!
Shorter than mine too. Wink

Well done!

On the left side of the capture, we have the exact same number of characters, but we picked a different strategy (or came up with, in my case---I didn't "pick" as the other one didn't occur to me).

Where you use an alternation:
|\*(?!/) (or a star not followed by a slash)
I bundled the star in the lookahead:
(?:(?!\*/)[*$ _/+\\-])*
(any of these characters, not followed by a */)

On the right side of the capture, I was lazy and reused the code from the left:
(?:(?!\*/)[*$ _/+\\-])*
But, as you showed, that was not needed, and true laziness (expressed by a *?) was enough! I am so glad your expression brought that out.

Very cool too to see the other idea on the left side.

Quote:

There's a number of problems here

Yes I was a bit tired last night.

Well, I think I'll go run some quick benchmarks just for the fun of it and report back. Congratulations. Grin

IP Logged

my favorite puzzles

playful
Junior Member

meet me at the second fork in the road

Gender: male

Posts: 100

Re: Dirty Comment Regex
« Reply #12 on: Nov 28^th, 2011, 11:35am »

Quote

Modify

EDIT: A full walk-through of the solution now lives on my page about greedy and lazy regex. Ctrl + F and look for Dirty Comments.

Okay, the benchmarks are in.

You win again! Cheesy

Here are the number of steps to find a match for each string according to my debugger.
Three expressions were tested: yours, mine, and a third one that uses my lookahead on the left with your lazy quantifier on the right.

String 1:
towr: 74 steps
playful: 108 steps
towr meets playful: 91 steps

String 2:
towr: 102 steps
playful: 142 steps
towr meets playful: 126 steps

String 3:
towr: 39 steps
playful: 49 steps
towr meets playful: 46 steps

The relative performance of the expressions couldn't be more clear.

So what's happening here?

For the left side, I have a negative lookahead at each step, whereas you greedily eat up every special character, only jumping to the other side of the OR when we meet a star. Then only you have a lookahead.

For the right side, you lazily expand, so at each step you have to at least check ahead if there is a star and you might be able to stop. Whereas I check if there are two characters ahead, the star and the slash.
At least that's what I think is going on.

Thank you for playing.

Wishing you a fun day,

« Last Edit: Mar 8^th, 2012, 1:23pm by playful »

IP Logged

my favorite puzzles

Pages: 1

Notify of replies

Send Topic


« Previous topic \| Next topic »