wu :: forums
« wu :: forums - Dirty Comment Regex »

Welcome, Guest. Please Login or Register.
Apr 29th, 2024, 7:48am

RIDDLES SITE WRITE MATH! Home Home Help Help Search Search Members Members Login Login Register Register
   wu :: forums
   riddles
   cs
(Moderators: Icarus, Eigenray, william wu, ThudnBlunder, Grimbal, towr, SMQ)
   Dirty Comment Regex
« Previous topic | Next topic »
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: Dirty Comment Regex  (Read 2418 times)
playful
Junior Member
**




meet me at the second fork in the road

   


Gender: male
Posts: 100
Dirty Comment Regex  
« on: Nov 27th, 2011, 7:01pm »
Quote Quote Modify Modify

Here is a little puzzle I gave myself this morning. It was a lot of fun so I thought I would share it.
 
The goal is to make a regular expression. The regex has to do with comments delimited with /* and */, as in C and PHP. Some of the comments are "dirty" and you have been asked to clean them up.  
Here is an example of a dirty comment embedded in a line of text:
asdf///*_**\\\*****Get Me!*--///*/****//asdf
 
And here are the rules.
 
1. The regex must match the whole comment. For instance, for string 1:
I went up the hill /*it was steep*/ last night.
the regex must match "/*it was steep*/" (without the quotes).
 
2. The regex must capture the comment itself as group 1. For instance, in string 1, capture group 1 must be "it was steep" (without the quotes).
 
3. The programmer has made a lot of comments where, just after the /* or just before the */, there are a number of special characters, for instance, lots of stars, as in "/***hello***/". These extra characters help her read the code, but your boss thinks they're dirty. You have to remove all leading or trailing "special" characters in this group: {*$ _/+\-} (not including the curly braces). Note that the third character is a space.  
 
This means that in string 2:
asdf///*_**\\\*****Get Me!*--///*/****//asdf
group 1 will only capture "Get Me!" (without the quotes).
In string 3:  
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***
there is no capture at all, though there is a match.
 
4. There is no capture group beyond group 1.  
 
If anything is unclear, please ask and I will try to clarify.
I will check all answers to the best of my ability.
Enjoy!!!
 
Grin
IP Logged

my favorite puzzles
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Dirty Comment Regex  
« Reply #1 on: Nov 27th, 2011, 10:37pm »
Quote Quote Modify Modify

on Nov 27th, 2011, 7:01pm, playful wrote:
In string 3:  
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***
there is no capture at all, though there is a match.
Why wouldn't there be a capture here?
Shouldn't it break down as follows:
    asdf//
    /*
    _**\\\***/-**
    Don't Get-Me!
    *--///
    */
    ***
 
The match is parts 2-6, the leading/trailing characters are part 3 and 5, and the captured, clean comment is part 4.
 
Or do you want an ungreedy match?
« Last Edit: Nov 27th, 2011, 10:40pm by towr » IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
playful
Junior Member
**




meet me at the second fork in the road

   


Gender: male
Posts: 100
Re: Dirty Comment Regex  
« Reply #2 on: Nov 27th, 2011, 11:18pm »
Quote Quote Modify Modify

Hi Towr,
 
Great to hear from you.  
 
Quote:
Or do you want an ungreedy match?

 
Yes, that's right.  
 
More precisely, I want the match to agree with how a compiler (or code-aware text editor) looks at /**/ comments.
 
To clarify for anyone else on the same track, on string3:
asdf///*_**\\\***/-**Don't Get-Me!*--///*/***
 
for the compiler / interpreter or code-aware text editor, the only /**/ comment is:
/*_**\\\***/
 
That is the sought match, with no capture.  
 
After checking on two of my editors, I confirm that is how they treat comments on the /**/*/ pattern (they take the leftmost comment).
 
The thing that may throw off your editor (as it did mine) is that string 3 includes a double slash, which comments out the rest of the line on most editors. But that is not part of the puzzle. Wink
« Last Edit: Nov 27th, 2011, 11:44pm by playful » IP Logged

my favorite puzzles
playful
Junior Member
**




meet me at the second fork in the road

   


Gender: male
Posts: 100
Re: Dirty Comment Regex  
« Reply #3 on: Nov 27th, 2011, 11:49pm »
Quote Quote Modify Modify

Removing all comments, my regex pattern has 57 characters. I wouldn't be surprised if someone here got it tighter.  
IP Logged

my favorite puzzles
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Dirty Comment Regex  
« Reply #4 on: Nov 28th, 2011, 12:16am »
Quote Quote Modify Modify

I think the following should work
//\*[*$ _/+\-]*(.*)[*$ _/+\-]*\*\//U
(where the U is the modifier for ungreedy and the first and last / delimit the regexp)
 
Might depend a bit on how the regexp is compiled; since there's some ambiguity in what might be captured.
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
playful
Junior Member
**




meet me at the second fork in the road

   


Gender: male
Posts: 100
Re: Dirty Comment Regex  
« Reply #5 on: Nov 28th, 2011, 12:36am »
Quote Quote Modify Modify

Hi Towr,
 
Hey, that was fast!
 
Here is your string without the delimiters.
/\*[*$ _/+\-]*(.*)[*$ _/+\-]*\*\/
 
To which part(s) of the expression do you want to apply the ungreedy modifiers?
 
I see three expanding patterns in there.
Just place a U after any of the greedy stars and I will test it for you.  
 
For the record  in PCRE syntax and several others the lazy modifier is a question mark, as in .*?
IP Logged

my favorite puzzles
playful
Junior Member
**




meet me at the second fork in the road

   


Gender: male
Posts: 100
Re: Dirty Comment Regex  
« Reply #6 on: Nov 28th, 2011, 12:40am »
Quote Quote Modify Modify

For the record, if we made all of your stars lazy:
/\*[*$ _/+\-]*?(.*?)[*$ _/+\-]*?\*\/
 
Then your regex would match all three strings!
 
However, group 1 would capture:  
string1: it was steep  (correct)
string2: _**\\\*****Get Me!  (too much)
string3: _**\\\  (we don't want any capture)
 
Quote:
Might depend a bit on how the regexp is compiled

 
No worries, I have a tool to check how the expression performs for a variety of regex engines: .NET, Java, Perl, PCRE, JavaScript, Python, Ruby, Tcl ARE, XPath. Smiley
 
If anyone's in need of testing the match and capture of an expression, I believe this should get you started on PHP with the above expression (didn't test it though):
 
Code:

$string1="I went up the hill /*it was steep*/ last night.";
if (preg_match('%/\*[*$ _/+\-]*?(.*?)[*$ _/+\-]*?\*/%', $string1, $groups)) {
 
echo "match:".$groups[0]."</br>";
 
echo "capture:".$groups[1]."</br>";
} else {
 
echo "sorry no match";
}
 
« Last Edit: Nov 28th, 2011, 12:55am by playful » IP Logged

my favorite puzzles
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Dirty Comment Regex  
« Reply #7 on: Nov 28th, 2011, 12:58am »
Quote Quote Modify Modify

Only the * of the capturing subexpression one should be lazy.
« Last Edit: Nov 28th, 2011, 12:59am by towr » IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
playful
Junior Member
**




meet me at the second fork in the road

   


Gender: male
Posts: 100
Re: Dirty Comment Regex  
« Reply #8 on: Nov 28th, 2011, 1:18am »
Quote Quote Modify Modify

Hi Towr,
 
With  
/\*[*$ _/+\-]*(.*?)[*$ _/+\-]*\*\/
 
We get  
match1: /*it was steep*/ (great)
capture1: it was steep (great)
 
match2: /*_**\\\*****Get Me!*--///*/****/ (too much)
capture2:  \\\*****Get Me!
 
match3: /*_**\\\***/ (great)
capture3:  \\\ (too much)
 
Looks like your left star is too greedy!  
 
Here is (now tested) code to check a pattern:
 
Code:

<?php  
$pattern="%/\*[*$ _/+\-]*(.*?)[*$ _/+\-]*\*/%";
$string[0]="I went up the hill /*it was steep*/ last night.";
$string[1]="asdf///*_**\\\*****Get Me!*--///*/****//asdf";
$string[2]="asdf///*_**\\\***/-**Don't Get-Me!*--///*/***";
for ($i=0;$i<3;$i++) {
echo "Test String #"; echo $i+1; echo ": ".$string[$i]."<br />";
if (preg_match($pattern, $string[$i], $groups)) {
 
echo "match: ".$groups[0]."</br>";
 
echo "capture:".$groups[1]."</br><br />";
} else {
 
echo "sorry no match<br />";
}
}
?>

 
Sorry, no time to upload it with a form field, quite late in New Zealand and need to get off the screen. Smiley
 
Wishing you a beautiful day
 
« Last Edit: Nov 28th, 2011, 1:22am by playful » IP Logged

my favorite puzzles
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Dirty Comment Regex  
« Reply #9 on: Nov 28th, 2011, 2:23am »
Quote Quote Modify Modify

There's a number of problems here; because you're using php strings to represent the pattern, \'s need to be escaped. Unfortunately that wouldn't help me much because it needs to be escaped in the regexp too (at least for php), so [*$ _/+\-]* should have been [*$ _/+\\\\-]* to do what I expect it to do.
 
It still wouldn't give the desired result, unfortunately, but at least then it'd be entirely my fault it doesn't work Tongue
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Dirty Comment Regex  
« Reply #10 on: Nov 28th, 2011, 3:27am »
Quote Quote Modify Modify

Ok, here's another attempt:
/\*(?:[$ _/+\\-]|\*(?!/))*(.*?)[*$ _/+\\-]*?\*/
47 characters (but a bit more when you escape it to put it into a php string)
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
playful
Junior Member
**




meet me at the second fork in the road

   


Gender: male
Posts: 100
Re: Dirty Comment Regex  
« Reply #11 on: Nov 28th, 2011, 11:22am »
Quote Quote Modify Modify

Hi towr,  
 
You win!
Shorter than mine too. Wink
Well done!
 
On the left side of the capture, we have the exact same number of characters, but we picked a different strategy (or came up with, in my case---I didn't "pick" as the other one didn't occur to me).  
 
Where you use an alternation:
|\*(?!/) (or a star not followed by a slash)
I bundled the star in the lookahead:
(?:(?!\*/)[*$ _/+\\-])*
(any of these characters, not followed by a */)
 
On the right side of the capture, I was lazy and reused the code from the left:
(?:(?!\*/)[*$ _/+\\-])*
But, as you showed, that was not needed, and true laziness (expressed by a *?) was enough! I am so glad your expression brought that out.
 
Very cool too to see the other idea on the left side.
 
Quote:
There's a number of problems here

 
Yes I was a bit tired last night.  
 
 
Well, I think I'll go run some quick benchmarks just for the fun of it and report back. Congratulations.  Grin
IP Logged

my favorite puzzles
playful
Junior Member
**




meet me at the second fork in the road

   


Gender: male
Posts: 100
Re: Dirty Comment Regex  
« Reply #12 on: Nov 28th, 2011, 11:35am »
Quote Quote Modify Modify

EDIT: A full walk-through of the solution now lives on my page about greedy and lazy regex. Ctrl + F and look for Dirty Comments.
 
Okay, the benchmarks are in.
 
You win again! Cheesy
 
Here are the number of steps to find a match for each string according to my debugger.  
Three expressions were tested: yours, mine, and a third one that uses my lookahead on the left with your lazy quantifier on the right.
 
String 1:
towr: 74 steps
playful: 108 steps
towr meets playful: 91 steps
 
String 2:
towr: 102 steps
playful: 142 steps
towr meets playful: 126 steps
 
String 3:
towr: 39 steps
playful: 49 steps
towr meets playful: 46 steps
  
The relative performance of the expressions couldn't be more clear.
 
So what's happening here?
 
For the left side, I have a negative lookahead at each step, whereas you greedily eat up every special character, only jumping to the other side of the OR when we meet a star. Then only you have a lookahead.
 
For the right side, you lazily expand, so at each step you have to at least check ahead if there is a star and you might be able to stop. Whereas I check if there are two characters ahead, the star and the slash.  
At least that's what I think is going on. Smiley
 
Thank you for playing.  
 
Wishing you a fun day,
« Last Edit: Mar 8th, 2012, 1:23pm by playful » IP Logged

my favorite puzzles
Pages: 1  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board