Book HomeLearning Perl, 3rd EditionSearch this book

8.4. Memory Parentheses

You remember that parentheses ("( )") may be used for grouping together parts of a pattern. They also have a second function: they tell the regular expression engine to remember what was in the substring matched by the pattern in the parentheses. That is to say, it doesn't remember what was in the pattern itself; it remembers what was in the corresponding part of the string. Whenever you use parentheses for grouping, they automatically work as memory parentheses as well.

So, if you use /./, you'll match any single character (except newline); if you use /(.)/, you'll still match any single character, but now it will be kept in a regular expression memory. For each pair of parentheses in the pattern, you'll have one regular expression memory.

8.4.1. Backreferences

A backreference refers back to a memory that was saved earlier in the current pattern's processing. Backreferences are made with a backslash, which is easy to remember. For example, \1 contains the first regular expression memory (that is, the part of the string matched by the first pair of parentheses).

Backreferences are used to go back and match the exact same[182] string that was matched earlier in the pattern. So, /(.)\1/ means to match any one character, remember it as memory one, then match memory one again. In other words, match any character, followed by the same character. So, this pattern will match strings with doubled-letters, as in bamm-bamm and betty. Of course, the dot will match characters other than letters, so if a string has two spaces in a row, two tabs in a row, or two asterisks in a row, it will match.

[182]Well, if the pattern is case-insensitive, as we'll learn in the next chapter, the capitalization doesn't have to match. Other than that, though, the string must be the same.

That's not the same as the pattern /../, which will match any character followed by any character -- those two could be the same, or they could be different. /(.)\1/ means to match any character followed by the same character.

A typical usage of these memories might be if you have some HTML-like[183] text to process. For example, maybe you want to match a tag like these two, which may use either single quotes or double quotes:

[183]These examples are intentionally not HTML, because there are too many tricky things that crop up in real HTML, or any similar markup language like XML or SGML. If you need to work with HTML, don't use simple patterns like these. Get a robust module from CPAN, so that you can start with code that's already written and debugged. If you don't, we promise that you'll be sorry. Don't say we didn't warn you.

<image source='fred.png'>
<image source="fred's-birthday.png">

The tag may have either single quotes or double quotes, since the quoted data may include the other kind of mark (as with the apostrophe in the second example tag). So the pattern might look like this: /<image source=(['"]).*\1>/. That says that the opening quote mark may be of either type, but there must be a matching mark at the end of the quote.[184]

[184]If you realize that there may be problems with using this pattern on a markup language like HTML, that's okay. There are lots of problems with that! This is just an example to illustrate a use of a backreference. You shouldn't use simple patterns to parse anything as complex as HTML anyway.

If you have more sets of parentheses, you can have more backreferences. As you might guess, \17 is the contents of the seventeenth regular expression memory, if you have at least that many sets of parentheses.[185]

[185]If you don't have that many sets of parentheses before that point in the pattern, backreferences \10 and beyond will be treated as octal character escapes. To keep an octal character escape like \12 from accidentally meaning a backreference, just use a leading zero: \012 is always a character, never a backreference.

In numbering backreferences, you can just count the left (opening) parentheses. The pattern/((fred|wilma) (flintstone)) \1/ says to match strings like fred flintstone fred flintstone, since the first opening parenthesis and its corresponding closing parenthesis hold a pattern that matches fred flintstone.[186]

[186]This pattern would also match wilma flintstone wilma flintstone.

If we wrote /((fred|wilma) (flintstone)) \2/ instead, we would match strings like fred flintstone fred; memory two is the choice of fred or wilma. (Notice that it wouldn't match fred flintsone wilma, since the backreference can match only the same name that was matched earlier: either fred or wilma. But it could match wilma flintstone wilma, since that one uses the same name.) And the pattern /((fred|wilma) (flintstone)) \3/ would match strings like fred flintstone flintstone. It's uncommon to have a literal string like flintstone in memory parentheses, though; we did that one just to have a third example.

8.4.2. Memory Variables

When we get to the next chapter and back into the world of Perl, we'll see that the contents of these regular expression memories are available to us in special variables like $1 after the pattern match is done. We mention this here just so you'll know that the memories aren't merely used for backreferences; if you see what seem to be unnecessary parentheses in a pattern, they may actually be setting up those memories.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.