Book HomePHP CookbookSearch this book

13.3. Matching Words

13.3.1. Problem

You want to pull out all words from a string.

13.3.2. Solution

The key to this is carefully defining what you mean by a word. Once you've created your definition, use the special character types to create your regular expression:

/\S+/         // everything that isn't whitespace
/[A-Z'-]+/i   // all upper and lowercase letters, apostrophes, and hyphens

13.3.3. Discussion

The simple question "what is a word?" is surprisingly complicated. While the Perl compatible regular expressions have a built-in word character type, specified by \w, it's important to understand exactly how PHP defines a word. Otherwise, your results may not be what you expect.

Normally, because it comes directly from Perl's definition of a word, \w encompasses all letters, digits, and underscores; this means a_z is a word, but the email address php@example.com is not.

In this recipe, we only consider English words, but other languages use different alphabets. Because Perl-compatible regular expressions use the current locale to define its settings, altering the locale can switch the definition of a letter, which then redefines the meaning of a word.

To combat this, you may want to explicitly enumerate the characters belonging to your words inside a character class. To add a nonstandard character, use \ddd , where ddd is a character's octal code.

13.3.4. See Also

Recipe 16.3 for information about setting locales.



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.