Book HomeLearning Perl, 3rd EditionSearch this book

9.5. The Match Variables

Do you remember the regular expression memories, which we used with backreferences in the previous chapter? Those memories are also available after the pattern match is done, after we return to Perl. They're strings, so they are kept in scalar variables with names like $1 and $2. There are as many of these variables as there are pairs of memory parentheses in the pattern. As you'd expect, $4 means the string matched by the fourth set of parentheses. This is the same string that \4 referred to inside the pattern match.

Why are there two different ways to refer to that same string? They're not really referring to the same string at the same time; $4 means the fourth memory of an already completed pattern match, while \4 is a backreference referring back to the fourth memory of the currently matching regular expression. Besides, backreferences work inside regular expressions only; once we're back in the world of Perl, we'll use $4.

These match variables are a big part of the power of regular expressions, because they let us pull out the parts of a string:

$_ = "Hello there, neighbor";
if (/\s(\w+),/) {             # memorize the word between space and comma
  print "the word was $1\n";  # the word was there
}

Or you could use more than one memory at once:

$_ = "Hello there, neighbor";
if (/(\S+) (\S+), (\S+)/) {
  print "words were $1 $2 $3\n";
}

That tells us that the words were Hello there neighbor. Notice that there's no comma in the output (because the comma is outside of the memory parentheses). That leaves the comma out of memory two. Using this technique, we can choose exactly what we want in the memories, as well as what we want to leave out.

You could even have an empty match variable,[199] if that part of the pattern might be empty. That is, a match variable may contain the empty string:

[199]As opposed to an undefined one. If you have three or fewer sets of parentheses in the pattern, $4 will be undef.

my $dino = "I fear that I'll be extinct after 1000 years.";
if ($dino =~ /(\d*) years/) {
  print "That said '$1' years.\n";  # 1000
}

$dino = "I fear that I'll be extinct after a few million years.";
if ($dino =~ /(\d*) years/) {
  print "That said '$1' years.\n";  # empty string
}

9.5.1. The Persistence of Memory

These match variables generally stay around until the next successful pattern match.[200] That is, an unsuccessful match leaves the previous memories intact, but a successful one resets them all. But this correctly implies that you shouldn't use these match variables unless the match succeeded; otherwise, you could be seeing a memory from some previous pattern. The following (bad) example is supposed to print a word matched from $_. But if the match fails, it's using whatever leftover string happens to be found in $1:

[200]The actual scoping rule is much more complex (see the documentation if you need it), but as long as you don't expect the match variables to be untouched many lines after a pattern match, you shouldn't have problems.

$wilma =~ /(\w+)/;  # BAD! Untested match result
print "Wilma's word was $1... or was it?\n";

This is another reason that a pattern match is almost always found in the conditional expression of an if or while:

if ($wilma =~ /(\w+)/) {
  print "Wilma's word was $1.\n";
} else {
  print "Wilma doesn't have a word.\n";
}

Since these memories don't stay around forever, you shouldn't use a match variable like $1 more than a few lines after its pattern match. If your maintenance programmer adds a new regular expression between your regular expression and your use of $1, you'll be getting the value of $1 for the second match, rather than the first. For this reason, if you need a memory for more than a few lines, it's generally best to copy it into an ordinary variable. Doing this helps make the code more readable at the same time:

if ($wilma =~ /(\w+)/) {
  my $wilma_word = $1;
  ...
}

Later, in Chapter 14, "Process Management", we'll see how to get the memory value directly into the variable at the same time as the pattern match happens, without having to use $1 explicitly.

9.5.2. The Automatic Match Variables

There are three more match variables that you get for free,[201] whether the pattern has memory parentheses or not. That's the good news; the bad news is that these variables have weird names.

[201]Yeah, right. There's no such thing as a free match. These are "free" only in the sense that they don't require match parentheses. Don't worry; we'll mention their real cost a little later, though.

Now, Larry probably would have been happy enough to call these by slightly-less-weird names, like perhaps $gazoo or $ozmodiar. But those are names that you just might want to use in your own code. To keep ordinary Perl programmers from having to memorize the names of all of Perl's special variables before choosing their first variable names in their first programs,[202] Larry has given strange names to many of Perl's builtin variables, names that "break the rules." In this case, the names are punctuation marks: $&, $`, and $'. They're strange, ugly, and weird, but those are their names.[203]

[202]You should still avoid a few classical variable names like $ARGV, but these few are all in all-caps. All of Perl's builtin variables are documented in the perlvar manpage.

[203]If you really can't stand these names, check out the English module, which attempts to give all of Perl's strangest variables nearly normal names. But the use of this module has never really caught on; instead, Perl programmers have grown to love the punctuation-mark variable names, strange as they are.

The part of the string that actually matched the pattern is automatically stored in $&:

if ("Hello there, neighbor" =~ /\s(\w+),/) {
  print "That actually matched '$&'.\n";
}

That tells us that the part that matched was " there," (with a space, a word, and a comma). Memory one, in $1, has just the five-letter word there, but $& has the entire matched section.

Whatever came before the matched section is in $`, and whatever was after it is in $'. Another way to say that is that $` holds whatever the regular expression engine had to skip over before it found the match, and $' has the remainder of the string that the pattern never got to. If you glue these three strings together in order, you'll always get back the original string:

if ("Hello there, neighbor" =~ /\s(\w+),/) {
  print "That was ($`)($&)($').\n";
}

The message shows the string as (Hello)( there,)( neighbor), showing the three automatic match variables in action. This may seem familiar, and for good reason: These automatic memory variables are what the pattern test program (from Chapter 7, "Concepts of Regular Expressions") was using in its line of "mystery" code, to show what part of the string was being matched by the pattern:

print "Matched: |$`<$&>$'|\n";  # The three automatic match variables

Any or all of these three automatic match variables may be empty, of course, just like the numbered match variables. And they have the same scope as the numbered match variables. Generally, that means that they'll stay around until the next successful pattern match.

Now, we said earlier that these three are "free." Well, freedom has its price. In this case, the price is that once you use any one of these automatic match variables anywhere in your entire program, other regular expressions will run a little more slowly. Now, this isn't a giant slowdown, but it's enough of a worry that many Perl programmers will simply never use these automatic match variables.[204] Instead, they'll use a workaround. For example, if the only one you need is $&, just put parentheses around the whole pattern and use $1 instead (you may need to renumber the pattern's memories, of course).

[204]Most of these folks haven't actually benchmarked their programs to see whether their workarounds actually save time, though; it's as though these variables were poisonous or something. But we can't blame them for not benchmarking -- many programs that could benefit from these three variables take up only a few minutes of CPU time in a week, so benchmarking and optimizing would be a waste of time. But in that case, why fear a possible extra millisecond? By the way, the Perl developers are working on this problem, but there will probably be no solution before Perl 6.

Match variables (both the automatic ones and the numbered ones) are most often used in substitutions, which are the topic of the next section.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.