Perl CookbookPerl CookbookSearch this book

6.4. Commenting Regular Expressions

6.4.1. Problem

You want to make your complex regular expressions understandable and maintainable.

6.4.2. Solution

You have several techniques at your disposal: electing alternate delimiters to avoid so many backslashes, placing comments outside the pattern or inside it using the /x modifier, and building up patterns piecemeal in named variables.

6.4.3. Discussion

The piece of sample code in Example 6-1 uses the first couple techniques, and its initial comment describes the overall intent of the regular expression. For simple patterns, this may be all that is needed. More complex patterns, as in the example, require more documentation.

Example 6-1. resname

  #!/usr/bin/perl -p
  # resname - change all "foo.bar.com" style names in the input stream
  # into "foo.bar.com [204.148.40.9]" (or whatever) instead
  use Socket;                 # load inet_addr
  s{
      (                       # capture the hostname in $1
          (?:                 # these parens for grouping only
              (?! [-_]  )     # lookahead for neither underscore nor dash
              [\w-] +         # hostname component
              \.              # and the domain dot
          ) +                 # now repeat that whole thing a bunch of times
          [A-Za-z]            # next must be a letter
          [\w-] +             # now trailing domain part
      )                       # end of $1 capture
  }{                          # replace with this:
      "$1 " .                 # the original bit, plus a space
             ( ($addr = gethostbyname($1))   # if we get an addr
              ? "[" . inet_ntoa($addr) . "]" #        format it
              : "[???]"                      # else mark dubious
             )
  }gex;               # /g for global
                      # /e for execute
                      # /x for nice formatting

For aesthetics, the example uses alternate delimiters. When you split your match or substitution over multiple lines, using matching braces aids readability. A more common use of alternate delimiters is for patterns and replacements that themselves contain slashes, such as in s/\/\//\/..\//g. Alternate delimiters, as in s!//!/../!g or s{//}{/../}g, avoid escaping the non-delimiting slashes with backslashes, again improving legibility.

The /x pattern modifier makes Perl ignore whitespace in the pattern (outside a character class) and treat # characters and their following text as comments. The /e modifier changes the replacement portion from a string into code to run. Since it's code, you can put regular comments there, too.

To include literal whitespace or # characters in a pattern to which you've applied /x, escape them with a backslash:

s/                  # replace
  \#                #   a pound sign
  (\w+)             #   the variable name
  \#                #   another pound sign
/${$1}/xg;          # with the value of the global variable

Remember that comments should explain what you're doing and why, not merely restate the code. Using "$i++ # add one to i" is apt to lose points in your programming course or at least get you talked about in substellar terms by your coworkers.

The last technique for rendering patterns more legible (and thus, more maintainable) is to place each semantic unit into a variable given an appropriate name. We use single quotes instead of doubles so backslashes don't get lost.

$optional_sign      = '[-+]?';
$mandatory_digits   = '\d+';
$decimal_point      = '\.?';
$optional_digits    = '\d*';

$number = $optional_sign    
        . $mandatory_digits       
        . $decimal_point          
        . $optional_digits;

Then use $number in further patterns:

if (/($number)/) {      # parse out one
    $found = $1;
}

@allnums = /$number/g;  # parse all out

unless (/^$number$/) {  # any extra?
    print "need a number, just a number\n";
}

We can even combine all of these techniques:

# check for line of whitespace-separated numbers
m{
    ^ \s *              # optional leading whitespace
    $number             # at least one number
    (?:                 # begin optional cluster
        \s +            # must have some separator
        $number         # more the next one
    ) *                 # repeat at will
    \s * $              # optional trailing whitespace
}x

which is certainly a lot better than writing:

/^\s*[-+]?\d+\.?\d*(?:\s+[-+]?\d+\.?\d*)*\s*/

Patterns that you put in variables should probably not contain capturing parentheses or backreferences, since a capture in one variable could change the numbering of those in others.

Clustering parentheses—that is, /(?:...)/ instead of /(...)/—though, are fine. Not only are they fine, they're necessary if you want to apply a quantifier to the whole variable. For example:

$number = "(?:" 
        .   $optional_sign    
        .   $mandatory_digits       
        .   $decimal_point          
        .   $optional_digits
        . ")";

Now you can say /$number+/ and have the plus apply to the whole number group. Without the grouping, the plus would have shown up right after the last star, which would have been illegal.

One more trick with clustering parentheses is that you can embed a modifier switch that applies only to that cluster. For example:

$hex_digit = '(?i:[0-9a-z])';
$hdr_line  = '(?m:[^:]*:.*)';

The qr// construct does this automatically using cluster parentheses, enabling any modifiers you specified and disabling any you didn't for that cluster:

$hex_digit = qr/[0-9a-z]/i;
$hdr_line  = qr/^[^:]*:.*/m;

print "hex digit is: $hex_digit\n";
print "hdr line is: $hdr_line\n";

hex digit is: (?i-xsm:[0-9a-z])
hdr line is: (?m-xis:^[^:]*:.*)

It's probably a good idea to use qr// in the first place:

$optional_sign      = qr/[-+]?/;
$mandatory_digits   = qr/\d+/;
$decimal_point      = qr/\.?/;
$optional_digits    = qr/\d*/;

$number = qr{
                 $optional_sign    
                 $mandatory_digits       
                 $decimal_point          
                 $optional_digits
          }x;

Although the output can be a bit odd to read:

print "Number is $number\n";

Number is (?x-ism:
                     (?-xism:[-+]?) 
                     (?-xism:\d+)   
                     (?-xism:\.?)   
                     (?-xism:\d*)   
              ) 

6.4.4. See Also

The /x modifier in perlre(1) and Chapter 5 of Programming Perl; the "Comments Within a Regular Expression" section of Chapter 7 of Mastering Regular Expressions



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.