Quantcast
Channel: Regex Guru » Regex Philosophy
Viewing all articles
Browse latest Browse all 3

Escape Characters Only When Necessary

$
0
0

A lot of people seem to have a habit of escaping all non-alphanumeric characters that they want to treat as literals in their regular expressions. E.g. to match #1+1=2 they’ll write \#1\+1\=2 instead of #1\+1=2. Though these regexes are equivalent in all modern regex flavors, the extraneous backslashes don’t exactly make the pattern more readable. And when formatted as a C++ string, "\\#1\\+1\\=2" is definitely a step back from "#1\\+1=2".

Beyond redability, needlessly escaping characters can also lead to subtle problems. In most flavors, < and \< both match a literal <. But in some flavors, like the GNU flavors, < is a literal and \< is a word boundary.

Similarly, _ and \_ usually simply match _. But the .NET framework treats \_ as an error, just like most modern flavors treat escaped letters that don’t form a regex token, like \j, as an error. This is done to reserve these letters for future expansion. I recommend that you treat non-alphanumerics the same, and escape only metacharacters.

Modern regex flavors have 11 metacharacters outside character classes: the opening square bracket [, the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening round bracket ( and the closing round bracket ).

The closing square bracket and the curly braces are indeed not in this list. The closing square bracket is an ordinary character outside character classes. Sometimes I do escape it for readability, e.g. when using a regex like \[[0-9a-f]\] to match [a]. The opening curly brace only needs to be escaped if it would otherwise form a quantifier like {3}. An exception to this rule is Java, which always requires { to be escaped.

Inside character classes, different metacharacters apply. Namely, the caret ^, the hyphen -, the closing bracket ] are the backslash itself are metacharacters. You can actually avoid escaping these, except for the backslash, by positioning them so that their special meaning cannot apply. You can place ] right after the opening bracket, - right before the closing bracket and ^ anywhere except right after the opening bracket. So []^\\-] matches any of the 3 metacharacters inside character classes. Again, one flavor has to deviate from normal practice. The JavaScript standard treats [] as an empty character class. This is not very useful, as it can never match anything. No surprise that the Internet Explorer developers got this wrong, and follow the usual practice of treating ] after [ as a literal. I recommend that you escape the 4 metacharacters inside character classes for maximum compatibility with various flavors, and to make your regex easier to understand by other developers who may be confused by something like []^\\-]. But don’t needlessly add backslashes to a regex like [*/._] which is perfectly fine without.


Viewing all articles
Browse latest Browse all 3

Latest Images

Trending Articles



Latest Images