Book HomeLearning Perl, 3rd EditionSearch this book

7.2. Using Simple Patterns

To compare a pattern (regular expression) to the contents of $_, simply put the pattern between a pair of forward slashes (/), like we do here:

$_ = "yabba dabba doo";
if (/abba/) {
  print "It matched!\n";
}

The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value. In this case, it's found more than once, but that doesn't make any difference. If it's found at all, it's a match; if it's not in there at all, it fails.

Because the pattern match is generally being used to return a true or false value, it is almost always found in the conditional expression of if or while.

All of the usual backslash escapes that you can put into double-quoted strings are available in patterns, so you could use the pattern /coke\tsprite/ to match the eleven characters of coke, a tab, and sprite.

7.2.1. About Metacharacters

Of course, if patterns matched only simple literal strings, they wouldn't be very useful. That's why there are a number of special characters, called metacharacters, that have special meanings in regular expressions.

For example, the dot (.) is a wildcard character -- it matches any single character except a newline (which is represented by "\n"). So, the pattern /bet.y/ would match betty. Or it would match betsy, or bet=y, or bet.y, or any other string that has bet, followed by any one character (except a newline), followed by y. It wouldn't match bety or betsey, though, since those don't have exactly one character between the t and the y. The dot always matches exactly one character.

So, if you wanted to match a period in the string, you could use the dot. But that would match any possible character (except a newline), which might be more than you wanted. If you wanted the dot to match just a period, you can simply backslash it. In fact, that rule goes for all of Perl's regular expression metacharacters: a backslash in front of any metacharacter makes it nonspecial. So, the pattern /3\.14159/ doesn't have a wildcard character.

So the backslash is our second metacharacter. If you mean a real backslash, just use a pair of them -- a rule that applies just as well everywhere else in Perl.

7.2.2. Simple Quantifiers

It often happens that you'll need to repeat something in a pattern. The star (*) means to match the preceding item zero or more times. So, /fred\t*barney/ matches any number of tab characters between fred and barney. That is, it matches "fred\tbarney" with one tab, or "fred\t\tbarney" with two tabs, or "fred\t\t\tbarney" with three tabs, or even "fredbarney" with nothing in between at all. That's because the star means "zero or more" -- so you could even have hundreds of tab characters in between, but nothing other than tabs. You may find it helpful to think of star as saying, "that previous thing, any number of times, even zero times."

What if you wanted to allow something besides tab characters? The dot matches any character[167], so .* will match any character, any number of times. That means that the pattern /fred.*barney/ matches "any old junk" between fred and barney. Any line that mentions fred and (somewhere later) barney will match that pattern. We often call .* the "any old junk" pattern, because it can match any old junk in your strings.

[167]Except newline. But we're going to stop reminding you of that so often, because you know it by now. Most of the time it doesn't matter, anyway, because your strings will most-often not have newlines. But don't forget this detail, because someday a newline will sneak into your string and you'll need to remember that the dot doesn't match newline.

The star is formally called a quantifier, meaning that it specifies a quantity of the preceding item. But it's not the only quantifier; the plus ("+") is another. The plus means to match the preceding item one or more times: /fred +barney/ matches if fred and barney are separated by spaces and only spaces. (The space is not a metacharacter.) This won't match fredbarney, since the plus means that there must be one or more spaces between the two names, so at least one space is required. It may be helpful to think of the plus as saying, "that last thing, plus any number more of the same thing."

There's a third quantifier like the star and plus, but more limited. It's the question mark ("?"), which means that the preceding item is optional. That is, the preceding item may occur once or not at all. Like the other two quantifiers, the question mark means that the preceding item appears a certain number of times. It's just that in this case the item may match one time (if it's there) or zero times (if it's not). There aren't any other possibilities. So, /bamm-?bamm/ matches either spelling: bamm-bamm or bammbamm. This is easy to remember, since it's saying "that last thing, maybe? Or maybe not?"

All three of these quantifiers must follow something, since they tell how many times the previous item may repeat.

7.2.3. Grouping in Patterns

As in mathematics, parentheses ("( )") may be used for grouping. So, parentheses are also metacharacters. As an example, the pattern /fred+/ matches strings like freddddddddd, but strings like that don't show up often in real life. But the pattern /(fred)+/ matches strings like fredfredfred, which is more likely to be what you wanted. And what about the pattern /(fred)*/? That matches strings like hello, world.[168]

[168]The star means to match zero or more repetitions of fred. When you're willing to settle for zero, it's hard to be disappointed! That pattern will match any string, even the empty string.

7.2.4. Alternatives

The vertical bar (|), often pronounced "or" in this usage, means that either the left side may match, or the right side. That is, if the part of the pattern on the left of the bar fails, the part on the right gets a chance to match. So, /fred|barney|betty/ will match any string that mentions fred, or barney, or betty.

Now we can make patterns like /fred( |\t)+barney/, which matches if fred and barney are separated by spaces, tabs, or a mixture of the two. The plus means to repeat one or more times; each time it repeats, the ( |\t) has the chance to match either a space or a tab.[169] There must be at least one character between the two names.

[169]This particular match would normally be done more efficiently with a character class, as we'll see in the next chapter.

If you wanted the characters between fred and barney to all be the same, you could rewrite that pattern as /fred( +|\t+)barney/. In this case, the separators must be all spaces, or all tabs.

The pattern /fred (and|or) barney/ matches any string containing either of the two possible strings: fred and barney, or fred or barney.[170] We could match the same two strings with the pattern /fred and barney|fred or barney/, but that would be too much typing. It would probably also be less efficient, depending upon what optimizations are built into the regular expression engine.

[170]Note that the words and and or are not operators in regular expressions! They are shown here in a fixed-width typeface because they're part of the strings.



Library Navigation Links

Copyright © 2002 O'Reilly & Associates. All rights reserved.