XHTML
Direct Display of XML in Browsers
Authoring Compound Documents with Modular XHTML
Prospects for Improved Web-Search Methods
XML began as an effort to bring the full power and structure of SGML to the Web in a form that was simple enough for nonexperts to use. Like most great inventions, XML turned out to have uses far beyond what its creators originally envisioned. Indeed, there's a lot more XML off the Web than on it. Nonetheless, XML is still a very attractive language in which to write and serve web pages. Since XML documents must be well-formed and parsers must reject malformed documents, XML pages are less likely to have annoying cross-browser incompatibilities. Since XML documents are highly structured, they're much easier for robots to parse. Since XML tag and attribute names reflect the nature of the content they hold, search-engine spiders can more easily determine the true meaning of a page.
XML on the Web comes in three flavors. The first is XHTML, an XMLized variant of HTML 4.0 that tightens up HTML to match XML's syntax. For instance, XHTML requires that all start-tags correspond to a matching end-tag and that all attribute values be quoted. XHTML also adds a few bits of syntax to HTML, such as the XML declaration and empty-element tags that end with />. Most of XHTML can be displayed quite well in legacy browsers, with a few notable exceptions.
The second flavor of XML on the Web is direct display of XML documents that use arbitrary vocabularies in web browsers. Generally, the formatting of the document is supplied either by a CSS stylesheet or by an XSLT stylesheet that transforms the document into HTML (perhaps XHTML). This flavor requires an XML-aware browser and is only beginning to be supported by the installed base of web clients.
A third option is to mix raw XML vocabularies such as MathML and SVG with XHTML using Modular XHTML. Modular XHTML lets you embed RDF cataloging information, MathML equations, SVG pictures, and more inside your XHTML documents. Namespaces sort out which elements belong to which applications.
XHTML is an official W3C recommendation. It defines an XML-compatible version of HTML, or rather it redefines HTML as an XML application instead of as an SGML application. Just looking at an XHTML document, you might not even realize that there's anything different about it. It still uses the same <p>, <li>, <table>, <h1>, and other tags with which you're familiar. Elements and attributes have the same, familiar names they have in HTML. The syntax is still basically the same.
The difference is not so much what's allowed but what's not allowed. <p> is a legal XHTML tag, but <P> is not. <table border="0" width="515"> is legal XHTML; <table border=0 width=515> is not. A paragraph prefixed with a <p> and suffixed with a </p> is legal XHTML, but a paragraph that omits the closing </p> tag is not. Most existing HTML documents require substantial editing before they become well-formed and valid XHTML documents. However, once they are valid XHTML documents, they are automatically valid XML documents that can be manipulated with the same editors, parsers, and other tools you use to work with any XML document.
Most of the changes required to turn an existing HTML document into an XHTML document involve making the document well-formed. For instance, given a legacy HTML document, you'll probably have to make at least some of these changes to turn it into XHTML:
Add missing end-tags like </p> and </li>.
Rewrite elements so that they nest rather than overlap. For example, change <p><em>an emphasized paragraph</p></em> to <p><em>an emphasized paragraph</em></p>.
Put double or single quotes around your attribute values. For example, change <p align=center> to <p align="center">.
Add values (which are the same as the name) to all minimized Boolean attributes. For example, change <input type="checkbox" checked> to <input type="checkbox" checked="checked">.
Replace any occurrences of & or < in character data or attribute values with & and <. For instance, change A&P to A&P and <a href="http://www.google.com/search?client=googlet&q=Java%20XML"> to <a href="http://www.google.com/search?client=googlet&q=Java%20XML">.
Make sure the document has a single root html element.
Change empty elements like <hr> to <hr/> or <hr></hr>.
Add hyphens to comments so that <! this is a comment> becomes <!-- this is a comment -->.
Encode the document in UTF-8 or UTF-16, or add an XML declaration that specifies in which character set it is encoded.
However, XHTML doesn't merely require well-formedness; it requires validity. In order to create a valid XHTML document, you'll need to make these changes as well:
Add a DOCTYPE declaration to the document pointing to one of the three XHTML DTDs.
Make all element and attribute names lowercase.
Make any other changes you have to make to your markup so that the document validates against the DTD: for example, eliminating nonstandard elements like marquee, adding required attributes like the alt attribute of img, or moving child elements out from inside elements where they're not allowed such as a blockquote inside a p.
In addition, the XHTML specification imposes several requirements that, strictly speaking, are not required for either well-formedness or validity. However, they do make parsing XHTML documents a little easier. These are:
The root element of the document must be html.
There must be a DOCTYPE declaration that uses a PUBLIC ID to identify one of the three XHTML DTDs.
The root element of the document must have an xmlns attribute identifying the default namespace as http://www.w3.org/1999/xhtml.
Finally, if you wish, you may--but do not have to--add an XML declaration or an xml-stylesheet processing instruction to the prolog of your document.
Example 7-1 shows an HTML document from the O'Reilly web site that exhibits many of the validity problems you'll find on the Web today. In fact, this is a much neater page than most. Nonetheless, not all attribute values are quoted. The noshade attribute of the HR element doesn't even have a value. There's no document type declaration. Tags are a mix of upper- and lowercase, mostly uppercase. The DD elements are missing end-tags, and there's some character data inside the second definition that's not part of a DT or a DD.
<HTML><HEAD> <TITLE>O'Reilly Shipping Information</TITLE> </HEAD> <BODY BGCOLOR="#ffffff" VLINK="#0000CC" LINK="#990000" TEXT="#000000"> <table border=0 width=515> <tr> <td> <IMG SRC="/www/graphics_new/generic_ora_header_wide.gif" BORDER=0> <H2>U.S. Shipping Information </H2> <HR size="1" align=left noshade> <DL> <DT> <B>UPS Ground Service (Continental US only -- 5-7 business days):</B></DT> <DD> <PRE> $ 5.95 - $ 49.99 ......................... $ 4.50 $ 50.00 - $ 99.99 ......................... $ 6.50 $100.00 - $149.99 ......................... $ 8.50 $150.00 - $199.99 ......................... $10.50 $200.00 - $249.99 ......................... $12.50 $250.00 - $299.99 ......................... $14.50 </PRE> <DT> <B>Federal Express:</B></DT> (Shipping within 24 hours of receipt of order by O'Reilly) <DD> <PRE> <EM>1 or 2 books</EM>: Economy 2-day ............................. $ 8.75 Overnight Standard (Afternoon Delivery) ... $12.75 Overnight Priority (Morning Delivery) ..... $16.50 </PRE> </DL> <b>Alaska and Hawaii:</b> add $10 to Federal Express rates. <P> <A HREF="int-ship.html"><b>International Shipping Information</b></A> <P> <CENTER> <HR SIZE="1" NOSHADE> <FONT SIZE="1" FACE="Verdana, Arial, Helvetica"> <A HREF="http://www.oreilly.com/"> <B>O'Reilly Home</B></A> <B> | </B> <A HREF="http://www.oreilly.com/sales/bookstores"> <B>O'Reilly Bookstores</B></A> <B> | </B> <A HREF="http://www.oreilly.com/order_new/"> <B>How to Order</B></A> <B> | </B> <A HREF="http://www.oreilly.com/oreilly/contact.html"> <B>O'Reilly Contacts<BR></B></A> <A HREF="http://www.oreilly.com/international/"> <B>International</B></A> <B> | </B> <A HREF="http://www.oreilly.com/oreilly/about.html"> <B>About O'Reilly</B></A> <B> | </B> <A HREF="http://www.oreilly.com/affiliates.html"> <B>Affiliated Companies</B></A><p> <EM>© 2000, O'Reilly & Associates, Inc.</EM> </FONT> </CENTER> </td> </tr> </table> </BODY> </HTML>
Example 7-2 shows this document after it's been converted to XHTML. All the previously noted problems and a few more besides have been fixed. A number of deprecated presentational attributes, such as the size and noshade attributes of hr, had to be replaced with CSS styles. We've also added the necessary document type and namespace declarations. This document can now be read by both HTML and XML browsers and parsers.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content="HTML Tidy, see www.w3.org" /> <style type="text/css"> body {backgroundColor: #FFFFFF; color: #000000} a:visited {color: #0000CC} a:link {color: #990000} </style> <title>O'Reilly Shipping Information</title> </head> <body> <table border="0" width="515"> <tr> <td><img src="/www/graphics_new/generic_ora_header_wide.gif" style="border-width: 0" alt="O'Reilly"/> <h2>U.S. Shipping Information</h2> <hr style="height: 1; text-align: left"/> <dl> <dt><b>UPS Ground Service (Continental US only -- 5-7 business days):</b></dt> <dd> <pre> $ 5.95 - $ 49.99 ......................... $ 4.50 $ 50.00 - $ 99.99 ......................... $ 6.50 $100.00 - $149.99 ......................... $ 8.50 $150.00 - $199.99 ......................... $10.50 $200.00 - $249.99 ......................... $12.50 $250.00 - $299.99 ......................... $14.50 </pre> </dd> <dt><b>Federal Express:</b></dt> <dd>(Shipping within 24 hours of receipt of order by O'Reilly)</dd> <dd> <pre> <em>1 or 2 books</em>: Economy 2-day ............................. $ 8.75 Overnight Standard (Afternoon Delivery) ... $12.75 Overnight Priority (Morning Delivery) ..... $16.50 </pre> </dd> </dl> <b>Alaska and Hawaii:</b> add $10 to Federal Express rates. <p><a href="int-ship.html"><b>International Shipping Information</b></a></p> <div style="font-size: xx-small; font-face: Verdana, Arial, Helvetica; text-align: center"> <hr style="height: 1"/> <a href="http://www.oreilly.com/"><b>O'Reilly Home</b></a> <b>|</b> <a href="http://www.oreilly.com/sales/bookstores"><b>O'Reilly Bookstores</b></a> <b>|</b> <a href="http://www.oreilly.com/order_new/"><b>How to Order</b></a> <b>|</b> <a href="http://www.oreilly.com/oreilly/contact.html"><b> O'Reilly Contacts<br /> </b></a> <a href="http://www.oreilly.com/international/"><b> International</b></a> <b>|</b> <a href="http://www.oreilly.com/oreilly/about.html"><b>About O'Reilly</b></a> <b>|</b> <a href="http://www.oreilly.com/affiliates.html"><b>Affiliated Companies</b></a></div> <p style="font-size: xx-small; font-family: Verdana, Arial, Helvetica"><em>© 2000, O'Reilly & Associates, Inc.</em></p> </td> </tr> </table> </body> </html>
TIP: Making all these changes can be quite tedious for large documents or collections of many documents. Fortunately, there's an open source tool that can do most of the work for you. Dave Ragget's Tidy, http://tidy.sourceforge.net, is a C program that has been ported to most major operating systems and can convert some pretty nasty HTML into valid XHTML. For example, to convert the file bad.html to good.xml, you would type:% tidy --output-xhtml yes bad.html good.xmlTidy fixes as much as it can and warns you about what it can't fix so you can fix it manually--for instance, telling you that a required alt attribute is missing from an img element.
XHTML comes in three flavors, depending on which DTD you choose:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd" >
Example 7-2 used this DTD.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd" >
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "DTD/xhtml1-frameset.dtd" >
All three DTDs use the same http://www.w3.org/1999/xhtml namespace. You should choose the strict DTD unless you've got a specific reason to use another one.
Many current web browsers, especially Internet Explorer 5.0 and earlier and Netscape 4.79 and earlier, deal inconsistently with XHTML. Certainly they don't require it, accepting as they do such a wide variety of malformed, invalid, and out-and-out mistaken HTML. However, beyond that they do have some problems when they encounter certain common XHTML constructs.
Some browsers display processing instructions and the XML declaration inline. These should be omitted if possible.
Few, if any, browsers recognize or respect the encoding declaration in the XML declaration. Furthermore, many browsers won't automatically recognize UTF-8 or UCS-2 Unicode text. If you use a non-ASCII character set, you should also include a meta element in the header identifying the character set. For example:
<meta http-equiv="Content-type" content='text/html; charset=UTF-8'></meta>
Browsers deal inconsistently with both forms of empty element syntax. That is, some browsers understand <hr/> but not <hr></hr> (typically rendering it as two horizontal lines rather than one), while others recognize <hr></hr> but not <hr/> (typically omitting the horizontal line completely). The most consistent rendering seems to be achieved by using an empty-element tag with an optional attribute such as class or id, for example, <hr class="empty" />. There's no real reason for the class attribute here, except that its presence keeps browsers from choking on the />. Any other attribute the DTD allows would serve equally well.
On the other hand, if a particular instance of an element happens to be empty, but not all instances of the element have to be empty--for instance, a p that doesn't contain any text--you should use two tags like <p></p> rather than one empty-element tag <p/>.
Embedded scripts often contain reserved characters like & or < so the document that contains them is not well-formed. However, most JavaScript and VBScript interpreters won't recognize & or < in place of the operators they represent. If the script can't be rewritten without these operators (for instance, by changing a less-than comparison to a greater-than-or-equal-to comparison with the arguments flipped), then you should move to external scripts instead of embedded ones.
Furthermore, most non-XML-aware browsers don't recognize the ' predefined entity reference. You should avoid this if possible and just use the literal ' character instead. The only place this might be a problem is inside attribute values that are enclosed in single quotes because they contain double quotes. However, most browsers do recognize the " entity reference for the " character so you can enclose the attribute value in double quotes and escape the double quotes that are part of the attribute value as ".
There are a few other subtle differences between how browsers handle XHTML and how XHTML expects to be handled. For instance, XHTML allows character references and CDATA sections although almost no current browsers understand these constructs. However, you're unlikely to encounter these when converting from HTML to XHTML, and you can generally do without them if you're writing XHTML from scratch.
Mozilla, Opera 5.0 and later, Internet Explorer 5.5 and later, and Netscape 6.0 and later can parse and display valid XHTML without any difficulties and without requiring page authors to jump through these hoops. However, since many users have not upgraded their browsers to the level XHTML requires, user-friendly web designers will be jumping through these hoops for some years to come.
Copyright © 2002 O'Reilly & Associates. All rights reserved.