Book HomePHP CookbookSearch this book

12.5. Parsing XML with SAX

12.5.1. Problem

You want to parse an XML document and format it on an event basis, such as when the parser encounters a new opening or closing element tag. For instance, you want to turn an RSS feed into HTML.

12.5.2. Solution

Use the parsing functions in PHP's XML extension:

$xml = xml_parser_create();
$obj = new Parser_Object;  // a class to assist with parsing

xml_set_object($xml,$obj);
xml_set_element_handler($xml, 'start_element', 'end_element');
xml_set_character_data_handler($xml, 'character_data');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, false);

$fp = fopen('data.xml', 'r') or die("Can't read XML data.");
while ($data = fread($fp, 4096)) {
  xml_parse($xml, $data, feof($fp)) or die("Can't parse XML data");
}       
fclose($fp);

xml_parser_free($xml);

12.5.3. Discussion

These XML parsing functions require the expat library. However, because Apache 1.3.7 and later is bundled with expat, this library is already installed on most machines. Therefore, PHP enables these functions by default, and you don't need to explicitly configure PHP to support XML.

expat parses XML documents and allows you to configure the parser to call functions when it encounters different parts of the file, such as an opening or closing element tag or character data (the text between tags). Based on the tag name, you can then choose whether to format or ignore the data. This is known as event-based parsing and contrasts with DOM XML, which use a tree-based parser.

A popular API for event-based XML parsing is SAX: Simple API for XML. Originally developed only for Java, SAX has spread to other languages. PHP's XML functions follow SAX conventions. For more on the latest version of SAX — SAX2 — see SAX2 by David Brownell (O'Reilly).

PHP supports two interfaces to expat: a procedural one and an object-oriented one. Since the procedural interface practically forces you to use global variables to accomplish any meaningful task, we prefer the object-oriented version. With the object-oriented interface, you can bind an object to the parser and interact with the object while processing XML. This allows you to use object properties instead of global variables.

Here's an example application of expat that shows how to process an RSS feed and transform it into HTML. For more on RSS, see Recipe 12.12. The script starts with the standard XML processing code, followed by the objects created to parse RSS specifically:

$xml = xml_parser_create( );
$rss = new pc_RSS_parser;

xml_set_object($xml, $rss);
xml_set_element_handler($xml, 'start_element', 'end_element');
xml_set_character_data_handler($xml, 'character_data');
xml_parser_set_option($xml, XML_OPTION_CASE_FOLDING, false);

$feed = 'http://pear.php.net/rss.php';
$fp = fopen($feed, 'r') or die("Can't read RSS data.");
while ($data = fread($fp, 4096)) {
  xml_parse($xml, $data, feof($fp)) or die("Can't parse RSS data");
}       
fclose($fp);

xml_parser_free($xml);

After creating a new XML parser and an instance of the pc_RSS_parser class, configure the parser. First, bind the object to the parser; this tells the parser to call the object's methods instead of global functions. Then call xml_set_element_handler( ) and xml_set_character_data_handler( ) to specify the method names the parser should call when it encounters elements and character data. The first argument to both functions is the parser instance; the other arguments are the function names. With xml_set_element_handler( ), the middle and last arguments are the functions to call when a tag opens and closes, respectively. The xml_set_character_data_handler( ) function takes only one additional argument — the function to call when it processes character data.

Because an object has been associated with our parser, when that parser finds the string <tag>data</tag>, it calls $rss->start_element( ) when it reaches <tag>; $rss->character_data( ) when it reaches data; and $rss->end_element( ) when it reaches </tag>. The parser can't be configured to automatically call individual methods for each specific tag; instead, you must handle this yourself. However, the PEAR package XML_Transform provides an easy way to assign handlers on a tag-by-by basis.

The last XML parser configuration option tells the parser not to automatically convert all tags to uppercase. By default, the parser folds tags into capital letters, so <tag> and <TAG> both become the same element. Since XML is case-sensitive, and most feeds use lowercase element names, this feature should be disabled.

With the parser configured, feed the data to the parser:

$feed = 'http://pear.php.net/rss.php';
$fp = fopen($feed, 'r') or die("Can't read RSS data.");
while ($data = fread($fp, 4096)) {
  xml_parse($xml, $data, feof($fp)) or die("Can't parse RSS data");
}       
fclose($fp);

In order to curb memory usage, load the file in 4096-byte chunks, and feed each piece to the parser one at a time. This requires you to write the handler functions that will accommodate text arriving in multiple calls and not assume the entire string comes in all at once.

Last, while PHP cleans up any open parsers when the request ends, you can also manually close the parser by calling xml_parser_free( ) .

Now that the generic parsing is properly set up, add the pc_RSS_item and pc_RSS_parser classes, as shown in Examples Example 12-1 and Example 12-2, to handle a RSS document.

Example 12-1. pc_RSS_item

class pc_RSS_item {

  var $title = '';
  var $description = '';
  var $link = '';

  function display() {
    printf('<p><a href="%s">%s</a><br />%s</p>',
            $this->link,htmlspecialchars($this->title),
            htmlspecialchars($this->description));
  }
}

Example 12-2. pc_RSS_parser

class pc_RSS_parser {
  
  var $tag;
  var $item;
  
  function start_element($parser, $tag, $attributes) {
    if ('item' == $tag) {
      $this->item = new pc_RSS_item;
    } elseif (!empty($this->item)) {
      $this->tag = $tag;
    }
  }
  
  function end_element($parser, $tag) {
    if ('item' == $tag) {
      $this->item->display();
      unset($this->item); 
    }
  }
  
  function character_data($parser, $data) {
    if (!empty($this->item)) {
      if (isset($this->item->{$this->tag})) {
        $this->item->{$this->tag} .= trim($data);
      }
    }
  }
}  

The pc_RSS_item class provides an interface to an individual feed item. This removes the details of displaying each item from the general parsing code and makes it easy to reset the data for a new item by calling unset( ).

The pc_RSS_item::display( ) method prints out an HTML-formatted RSS item. It calls htmlspecialchars( ) to reencode any necessary entities, because expat decodes them into regular characters while parsing the document. This reencoding, however, breaks on feeds that place HTML in the title and description instead of plaintext.

Within pc_RSS_parser( ), the start_element( ) method takes three parameters: the XML parser, the name of the tag, and an array of attribute/value pairs (if any) from the element. PHP automatically supplies these values to the handler as part of the parsing process.

The start_element( ) method checks the value of $tag. If it's item, the parser's found a new RSS item, and a new pc_RSS_item object is instantiated. Otherwise, it checks to see if $this->item is empty( ); if it isn't, the parser is inside an item element. It's then necessary to record the tag's name, so that the character_data( ) method knows which property to assign its value to. If it is empty, this part of the RSS feed isn't necessary for our application, and it's ignored.

When the parser finds a closing item tag, the corresponding end_element( ) method first prints the RSS item, then cleans up by deleting the object.

Finally, the character_data( ) method is responsible for assigning the values of title, description, and link to the RSS item. After making sure it's inside an item element, it checks that the current tag is one of the properties of pc_RSS_item. Without this check, if the parser encountered an element other than those three, its value would also be assigned to the object. The { } s are needed to set the object property dereferencing order. Notice how trim($data) is appended to the property instead of a direct assignment. This is done to handle cases in which the character data is split across the 4096-byte chunks retrieved by fread( ); it also removes the surrounding whitespace found in the RSS feed.

If you run the code on this sample RSS feed:

<?xml version="1.0"?>
<rss version="0.93">
<channel>
  <title>PHP Announcements</title>
  <link>http://www.php.net/</link>
  <description>All the latest information on PHP.</description>

  <item>
    <title>PHP 5.0 Released!</title>
    <link>http://www.php.net/downloads.php</link>
    <description>The newest version of PHP is now available.</description>
  </item>
</channel>
</rss>

It produces this HTML:

<p><a href="http://www.php.net/downloads.php">PHP 5.0 Released!</a><br />
The newest version of PHP is now available.</p>

12.5.4. See Also

Recipe 12.4 for tree-based XML parsing with DOM; Recipe 12.12 for more on parsing RSS; documentation on xml_parser_create( ) at http://www.php.net/xml-parser-create, xml_element_handler( ) at http://www.php.net/xml-element-handler, xml_character_handler( ) at http://www.php.net/xml-character-handler, xml_parse( ) at http://www.php.net/xml-parse, and the XML functions in general at http://www.php.net/xml; the official SAX site at http://www.saxproject.org/.



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.