Whitespace Processing (XML Schema)

4.2. Whitespace Processing

The handling of special characters (tab, linefeeds, carriage returns and spaces, which are often used only to "pretty print" XML documents) has always been very controversial. W3C XML Schema has imposed a two-step generic algorithm, which is applied to most of the predefined datatypes (actually, on all of them except two, xs:string and xs:normalizedString).

Whitespace replacement: This is the first step of whitespace processing applied to the parsed value. During whitespace replacement, all occurrences of any whitespace--#x9 (tab), #xA (linefeed), and #xD (carriage return)--are replaced with a space (#x20). The number of characters is not changed by this step, which is applied to all the predefined datatypes (except for xs:string, since no whitespace replacement is performed on the parsed value for this).
Whitespace collapse: The second step removes the leading and trailing spaces, and replaces all contiguous occurrences of spaces by a single space character. This is applied on all the predefined datatypes (except for xs:string, since no whitespace replacement is performed on the parsed value for this, and for xs:normalizedString, in which whitespaces are only normalized).

TIP: This notion of "normalized string" does not match the XPath function normalize-space( ), which corresponds with what W3C XML Schema calls whitespace collapsing. It is also different from the DOM normalize() method, which is a merge of adjacent text objects.


4. Using Predefined Simple Datatypes		4.3. String Datatypes