Author |
Topic: Dirty Comment Regex (Read 2418 times) |
|
playful
Junior Member
meet me at the second fork in the road
Gender:
Posts: 100
|
|
Dirty Comment Regex
« on: Nov 27th, 2011, 7:01pm » |
Quote Modify
|
Here is a little puzzle I gave myself this morning. It was a lot of fun so I thought I would share it. The goal is to make a regular expression. The regex has to do with comments delimited with /* and */, as in C and PHP. Some of the comments are "dirty" and you have been asked to clean them up. Here is an example of a dirty comment embedded in a line of text: asdf///*_**\\\*****Get Me!*--///*/****//asdf And here are the rules. 1. The regex must match the whole comment. For instance, for string 1: I went up the hill /*it was steep*/ last night. the regex must match "/*it was steep*/" (without the quotes). 2. The regex must capture the comment itself as group 1. For instance, in string 1, capture group 1 must be "it was steep" (without the quotes). 3. The programmer has made a lot of comments where, just after the /* or just before the */, there are a number of special characters, for instance, lots of stars, as in "/***hello***/". These extra characters help her read the code, but your boss thinks they're dirty. You have to remove all leading or trailing "special" characters in this group: {*$ _/+\-} (not including the curly braces). Note that the third character is a space. This means that in string 2: asdf///*_**\\\*****Get Me!*--///*/****//asdf group 1 will only capture "Get Me!" (without the quotes). In string 3: asdf///*_**\\\***/-**Don't Get-Me!*--///*/*** there is no capture at all, though there is a match. 4. There is no capture group beyond group 1. If anything is unclear, please ask and I will try to clarify. I will check all answers to the best of my ability. Enjoy!!!
|
|
IP Logged |
my favorite puzzles
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Dirty Comment Regex
« Reply #1 on: Nov 27th, 2011, 10:37pm » |
Quote Modify
|
on Nov 27th, 2011, 7:01pm, playful wrote:In string 3: asdf///*_**\\\***/-**Don't Get-Me!*--///*/*** there is no capture at all, though there is a match. |
| Why wouldn't there be a capture here? Shouldn't it break down as follows: asdf// /* _**\\\***/-** Don't Get-Me! *--/// */ *** The match is parts 2-6, the leading/trailing characters are part 3 and 5, and the captured, clean comment is part 4. Or do you want an ungreedy match?
|
« Last Edit: Nov 27th, 2011, 10:40pm by towr » |
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
playful
Junior Member
meet me at the second fork in the road
Gender:
Posts: 100
|
|
Re: Dirty Comment Regex
« Reply #2 on: Nov 27th, 2011, 11:18pm » |
Quote Modify
|
Hi Towr, Great to hear from you. Quote:Or do you want an ungreedy match? |
| Yes, that's right. More precisely, I want the match to agree with how a compiler (or code-aware text editor) looks at /**/ comments. To clarify for anyone else on the same track, on string3: asdf///*_**\\\***/-**Don't Get-Me!*--///*/*** for the compiler / interpreter or code-aware text editor, the only /**/ comment is: /*_**\\\***/ That is the sought match, with no capture. After checking on two of my editors, I confirm that is how they treat comments on the /**/*/ pattern (they take the leftmost comment). The thing that may throw off your editor (as it did mine) is that string 3 includes a double slash, which comments out the rest of the line on most editors. But that is not part of the puzzle.
|
« Last Edit: Nov 27th, 2011, 11:44pm by playful » |
IP Logged |
my favorite puzzles
|
|
|
playful
Junior Member
meet me at the second fork in the road
Gender:
Posts: 100
|
|
Re: Dirty Comment Regex
« Reply #3 on: Nov 27th, 2011, 11:49pm » |
Quote Modify
|
Removing all comments, my regex pattern has 57 characters. I wouldn't be surprised if someone here got it tighter.
|
|
IP Logged |
my favorite puzzles
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Dirty Comment Regex
« Reply #4 on: Nov 28th, 2011, 12:16am » |
Quote Modify
|
I think the following should work //\*[*$ _/+\-]*(.*)[*$ _/+\-]*\*\//U (where the U is the modifier for ungreedy and the first and last / delimit the regexp) Might depend a bit on how the regexp is compiled; since there's some ambiguity in what might be captured.
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
playful
Junior Member
meet me at the second fork in the road
Gender:
Posts: 100
|
|
Re: Dirty Comment Regex
« Reply #5 on: Nov 28th, 2011, 12:36am » |
Quote Modify
|
Hi Towr, Hey, that was fast! Here is your string without the delimiters. /\*[*$ _/+\-]*(.*)[*$ _/+\-]*\*\/ To which part(s) of the expression do you want to apply the ungreedy modifiers? I see three expanding patterns in there. Just place a U after any of the greedy stars and I will test it for you. For the record in PCRE syntax and several others the lazy modifier is a question mark, as in .*?
|
|
IP Logged |
my favorite puzzles
|
|
|
playful
Junior Member
meet me at the second fork in the road
Gender:
Posts: 100
|
|
Re: Dirty Comment Regex
« Reply #6 on: Nov 28th, 2011, 12:40am » |
Quote Modify
|
For the record, if we made all of your stars lazy: /\*[*$ _/+\-]*?(.*?)[*$ _/+\-]*?\*\/ Then your regex would match all three strings! However, group 1 would capture: string1: it was steep (correct) string2: _**\\\*****Get Me! (too much) string3: _**\\\ (we don't want any capture) Quote:Might depend a bit on how the regexp is compiled |
| No worries, I have a tool to check how the expression performs for a variety of regex engines: .NET, Java, Perl, PCRE, JavaScript, Python, Ruby, Tcl ARE, XPath. If anyone's in need of testing the match and capture of an expression, I believe this should get you started on PHP with the above expression (didn't test it though): Code: $string1="I went up the hill /*it was steep*/ last night."; if (preg_match('%/\*[*$ _/+\-]*?(.*?)[*$ _/+\-]*?\*/%', $string1, $groups)) { echo "match:".$groups[0]."</br>"; echo "capture:".$groups[1]."</br>"; } else { echo "sorry no match"; } |
|
|
« Last Edit: Nov 28th, 2011, 12:55am by playful » |
IP Logged |
my favorite puzzles
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Dirty Comment Regex
« Reply #7 on: Nov 28th, 2011, 12:58am » |
Quote Modify
|
Only the * of the capturing subexpression one should be lazy.
|
« Last Edit: Nov 28th, 2011, 12:59am by towr » |
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
playful
Junior Member
meet me at the second fork in the road
Gender:
Posts: 100
|
|
Re: Dirty Comment Regex
« Reply #8 on: Nov 28th, 2011, 1:18am » |
Quote Modify
|
Hi Towr, With /\*[*$ _/+\-]*(.*?)[*$ _/+\-]*\*\/ We get match1: /*it was steep*/ (great) capture1: it was steep (great) match2: /*_**\\\*****Get Me!*--///*/****/ (too much) capture2: \\\*****Get Me! match3: /*_**\\\***/ (great) capture3: \\\ (too much) Looks like your left star is too greedy! Here is (now tested) code to check a pattern: Code: <?php $pattern="%/\*[*$ _/+\-]*(.*?)[*$ _/+\-]*\*/%"; $string[0]="I went up the hill /*it was steep*/ last night."; $string[1]="asdf///*_**\\\*****Get Me!*--///*/****//asdf"; $string[2]="asdf///*_**\\\***/-**Don't Get-Me!*--///*/***"; for ($i=0;$i<3;$i++) { echo "Test String #"; echo $i+1; echo ": ".$string[$i]."<br />"; if (preg_match($pattern, $string[$i], $groups)) { echo "match: ".$groups[0]."</br>"; echo "capture:".$groups[1]."</br><br />"; } else { echo "sorry no match<br />"; } } ?> |
| Sorry, no time to upload it with a form field, quite late in New Zealand and need to get off the screen. Wishing you a beautiful day
|
« Last Edit: Nov 28th, 2011, 1:22am by playful » |
IP Logged |
my favorite puzzles
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Dirty Comment Regex
« Reply #9 on: Nov 28th, 2011, 2:23am » |
Quote Modify
|
There's a number of problems here; because you're using php strings to represent the pattern, \'s need to be escaped. Unfortunately that wouldn't help me much because it needs to be escaped in the regexp too (at least for php), so [*$ _/+\-]* should have been [*$ _/+\\\\-]* to do what I expect it to do. It still wouldn't give the desired result, unfortunately, but at least then it'd be entirely my fault it doesn't work
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Dirty Comment Regex
« Reply #10 on: Nov 28th, 2011, 3:27am » |
Quote Modify
|
Ok, here's another attempt: /\*(?:[$ _/+\\-]|\*(?!/))*(.*?)[*$ _/+\\-]*?\*/ 47 characters (but a bit more when you escape it to put it into a php string)
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
playful
Junior Member
meet me at the second fork in the road
Gender:
Posts: 100
|
|
Re: Dirty Comment Regex
« Reply #11 on: Nov 28th, 2011, 11:22am » |
Quote Modify
|
Hi towr, You win! Shorter than mine too. Well done! On the left side of the capture, we have the exact same number of characters, but we picked a different strategy (or came up with, in my case---I didn't "pick" as the other one didn't occur to me). Where you use an alternation: |\*(?!/) (or a star not followed by a slash) I bundled the star in the lookahead: (?:(?!\*/)[*$ _/+\\-])* (any of these characters, not followed by a */) On the right side of the capture, I was lazy and reused the code from the left: (?:(?!\*/)[*$ _/+\\-])* But, as you showed, that was not needed, and true laziness (expressed by a *?) was enough! I am so glad your expression brought that out. Very cool too to see the other idea on the left side. Quote:There's a number of problems here |
| Yes I was a bit tired last night. Well, I think I'll go run some quick benchmarks just for the fun of it and report back. Congratulations.
|
|
IP Logged |
my favorite puzzles
|
|
|
playful
Junior Member
meet me at the second fork in the road
Gender:
Posts: 100
|
|
Re: Dirty Comment Regex
« Reply #12 on: Nov 28th, 2011, 11:35am » |
Quote Modify
|
EDIT: A full walk-through of the solution now lives on my page about greedy and lazy regex. Ctrl + F and look for Dirty Comments. Okay, the benchmarks are in. You win again! Here are the number of steps to find a match for each string according to my debugger. Three expressions were tested: yours, mine, and a third one that uses my lookahead on the left with your lazy quantifier on the right. String 1: towr: 74 steps playful: 108 steps towr meets playful: 91 steps String 2: towr: 102 steps playful: 142 steps towr meets playful: 126 steps String 3: towr: 39 steps playful: 49 steps towr meets playful: 46 steps The relative performance of the expressions couldn't be more clear. So what's happening here? For the left side, I have a negative lookahead at each step, whereas you greedily eat up every special character, only jumping to the other side of the OR when we meet a star. Then only you have a lookahead. For the right side, you lazily expand, so at each step you have to at least check ahead if there is a star and you might be able to stop. Whereas I check if there are two characters ahead, the star and the slash. At least that's what I think is going on. Thank you for playing. Wishing you a fun day,
|
« Last Edit: Mar 8th, 2012, 1:23pm by playful » |
IP Logged |
my favorite puzzles
|
|
|
|