XML is not just text!

To many intelligent and articulate XML programmers, "XML is just Unicode with pointy brackets" is almost a statement of faith. Predictably, these people also complain that it is difficult to process XML without a parser. For instance, Joe Gregorio notes that in XML this document:

must be treated as identical to this document:

To many of us, this is merely an indication that XML must first be parsed and converted to an appropriate data model before it can easily be processed. In fairness to Joe, he initially assumed this as well, but then changed his mind:

More XML experience is gained by yours truly and on many occasions I have found myself pining for the ability to do regular expression processing of XML. If only the pathologies of the above examples didn't exist then I could use a combination of XPath and regular expressions to perform XML manipulations that would be easier for me to implement, understand and maintain.

Today I reached the breaking point. The problem isn't with regular expressions, the problem is with XML. The pathologies in XML that preclude the use of regular expressions are just that, pathologies, and ones that need to be excised.

As a result, he suggests that XML be subsetted as follows:

  • All namespace declarations must be done in the root element.
  • Never a declaration for the "" namespace. I.e. if an element sits the "" namespace then the element name will never have a namespace qualifier.
  • No CDATA sections.
  • No DTDs

The above restrictions would make it easier for a programmer to work with XML without using an XML parser, but it is unlikely that the XML community will replace XML with something along these lines - especially since there are important usage scenarios for features like DTDs, schemas, and the ability to build compound documents without knowing, at the root level, all of the namespaces that may be used in a document. More to the point, Joe's original reason for trying to solve these problems with XPath and regular expressions was that the standard APIs do not make it easy to solve many simple problems. Looking at his article as a whole, and other articles he has written, we believe that many of these difficulties are caused by the same kind of semantic mismatches that a Native XML Programming Language is designed to solve.

In this paper, we assume that XML will remain as is, and that for general processing, the best approach is to use an XML parser to build a data model instance from the XML documents, and query the data model instance. Not everybody believes this is the best approach. Tim Bray, one of the editors of the original XML specification, objects to the Native XML Programming solution because he objects to the notion of an XML data model:

The notion that there is an "XML data model" is silly and unsupported by real-world evidence. The definition of XML is syntactic: the "Infoset" is an afterthought and in any case is far indeed from being a data model specification that a programmer could work with. Empirical evidence: I can point to a handful of different popular XML-in-Java APIs each of which has its own data model and each of which works. So why would you think that there's a data model there to build a language around?

Tim first says that there is no data model for XML, then argues that there are several. The differences among these data models, while annoying, are not great, and could have been avoided if XML had had a full-fledged data model. The differences between the DOM data model and the XPath data model are well known in the XML world. XQuery, XPath, and XSLT now use one common data model, which can represent both XML and the XML Schema datatypes. Although it would have been convenient if XML had defined a data model, there is no requirement that the data model used by a Native XML Programming Language be the same as any particular data model used in a Java API. As long as the data model supports the structure of XML directly, without losing or adding information in violation of the XML spec, it can be used as the basis for processing.

Tim also suggests that XML is "syntactic", as though this implies that there is no data model. This implies that syntax and structure are opposites, which is rather surprising, since the purpose of a syntax is to describe the structure of a language. In the XML Recommendation, the structure that corresponds to a data model is called the logica structure:

Each XML document has both a logical and a physical structure. [. . .] Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup.

Like most modern computer languages, XML uses a BNF to describe the syntactic representation of these structures. For instance, here is a production from the XML Recommendation:

The XML Recommendation is largely a description of these logical structures and the relationships among them. For instance, consider the following text:

Example: The element structure of an XML document may, for validation purposes, be constrained using element type and attribute-list declarations. An element type declaration constrains the element's content.

Element, XML document, and content all refer to logical structures that are represented in the BNF. These logical structures, taken together with the relationships among them as described in the XML Recommendation, come very close to being a data model, but the data model was not fully described.

The whole point of parsing is to create structures from a sequence of characters, using a grammar to determine which structures to create. When a parser is used to interpret the characters of a program in Java, it creates an Abstract Syntax Tree. When it is used to interpret the characters of XML, it creates a data model instance. We use parsers because (1) the parsed structure is more convenient for further processing, (2) the parsed structure distinguishes information from noise, eliminating differences in the character representation that are not significant in the relevant model, and (3) the parsed structure can fill in information not explicitly represented in the serialized form.

However, an XML parser is not enough. A parser creates a convenient representation of XML. We need a Native XML Programming Language to provide convenient processing of this XML.

Try DataDirect XQuery® Free!

Put the power, scalability, and performance of DataDirect XQuery® to work for you today! Our free trial lets you see for yourself how easy it is to build data integration applications that access relational, EDI, and other file formats as XML!

Download DataDirect XQuery® today!

Back to School with DataDirect XQuery®

Getting your mind around the possibilities of a data integration technology as promising as XQuery can be difficult, but our XML developers Webinars will help you understand the power and versatility of XQuery, and our favorite XQuery processor, DataDirect XQuery®.

Stay Informed!

XQuery is one of the hottest XML technologies being developed today. Stay informed with vital news about standards, tools, and trends by signing up for the DataDirect XQuery® newsletter.

New Case Study

Gevity produces sales proposals in real time using DataDirect XQuery®. See how Gevity uses DataDirect XQuery® to combine Web service data from SalesForce.com with relational data in Oracle in a pricing engine for HR management.

New Features in DataDirect XQuery®

DataDirect XQuery® is now released!

DataDirect XQuery® provides full update support for relational data, easy integration for Web Services, additional enhancements for performance and scalability and more!

Download now!

Where is XQuery Heading

We've collected numerous reports and surveys to help you understand the lay of the XQuery land and how DataDirect XQuery® can help you launch your data integration applications to new heights.

From XML industry observers to surveys of your peers, see why DataDirect XQuery® is right for you. Or better yet, try DataDirect XQuery® today for free!