Learn XQuery |
Home:Learn XQuery:XQuery Tutorial:An XQJ Tutorial:Processing Large Inputs XQJ Tutorial Part XI: Processing Large InputsThis chapter explains how to handle and query large XML documents through the XQJ API. The Roots of DOMSince XML became a standard in the late 90's, we have been taught that XML is a tree; and the most intuitive (and popular) representation of such a tree has been (still is!) the Document Object Model (DOM). When you think about querying XML documents, using XQuery, XSLT or XPath, you usually think about a processor that navigates the DOM tree, extracts values, compares the values it needs, and creates another DOM as a result of those operations. This is indeed what happens using typical XML processing implementations. Although today's processors use a more optimal representation than DOM, one problem remains the same — scalability. What happens if the XML you need to process cannot be represented within the physical constraints of the memory available to your application? That's usually the limit that typical "in-memory" XQuery, XSLT, XPath implementations hit when processing DOM, and short of writing your own SAX- or StAX-based processing, you really have no alternatives. But what if you are able to forget about DOMs, forget about materializing the whole XML tree in memory, and do XML processing in a purely streaming fashion? The Benefits of XML StreamingUsing an XQuery streaming processor, like DataDirect XQuery, is a good start. But a chain is only as strong as the weakest link. Beside the streaming capabilities of your XQuery implementation, the API must have the provision to handle those large XML fragments. From an XQuery API perspective, it is crucial that the input to your query can be handled in a streaming fashion. In XQJ Tutorial Part VIII: Binding External Variables, we learned how to bind values to external variables declared in an XQuery. By default, binding a value to an XQExpression or XQPreparedExpression using bindXXX(), it is consumed during the binding process, and it stays active and valid for all subsequent execution cycles. We say that XQJ operates in ‘immediate binding mode.’ Let's look closely at one of the pipeline examples from XQJ Tutorial Part X: XML Pipelines in this series. ... Binding and BufferingDuring the bindSequence() call, the complete xqs1 sequence is consumed. Subsequently, we can safely close the xqe1 expression, freeing up any runtime resources it holds. On the other hand, consuming the complete sequence during bindSequence() implies that the XQJ implementation has to buffer the data one way or the other for subsequent query evaluations. All this works perfectly fine as long as we're handling relatively small XML instances. But as the data is buffered, it deprives the underlying XQuery processor of opportunities to take advantage of its streaming capabilities. The default binding mode in XQJ is 'immediate,' which means the value bound to an external variable is consumed during the bindXXX() method. An application also has the ability to set the binding mode to 'deferred.' With deferred binding mode, the application gives a hint to the XQJ implementation and underlying XQuery processor to take advantage of its streaming capabilities. In deferred binding mode, bindings are only active for a single execution cycle. The application is required to explicitly re-bind values to every external variable before each execution. But what if you know that the data bound to the external variable will be used for only a single XQuery execution. Is there then a way to inform the XQJ/XQuery implementation of possible optimization opportunities, and use its streaming capabilities? You can change the binding mode through the XQStaticContext interface, as shown in the following example (refer to XQJ Tutorial Part VI: Manipulating the Static Context for more information on how to manipulate the static context): ... In deferred mode, the application cannot assume that the bound value will be consumed during the invocation of the bindXXX() method. The XQJ implementation is free to read the bound value either at bind time or during the subsequent evaluation and processing of the query results. This has some consequences for resource clean-up: the first example will not work properly in deferred binding mode, for example. In that example, nNote that xqe1 was closed right after calling bindSequence(). We can address this by modifying the example as follows: ... Binding Streams for Efficient ProcessingThe previous example shows how to build a pipeline of XQueries. But the deferred binding mode also applies to the other bindXXX() methods. In the following example, we show how to bind a StreamSource to the context item. As the binding mode is deferred, the implementation can handle the query in streaming mode, and as such process huge XML documents that don't otherwise fit in available memory. ... To conclude, using deferred binding mode requires a little more care than is immediately apparent. But the potential improvements when querying large XML documents are enormous. Of course, the API needs to provide the necessary functionality, but the heavy lifting is performed in the underlying XQuery processor — this is especially true with DataDirect XQuery, whose deferred binding mode allows you to take advantage of both XML document projection and its XML streaming capabilities. This allows efficient querying of XML documents in the hundreds of megabytes, even in the gigabytes! |
New Case StudyGevity produces sales proposals in real time using DataDirect XQuery®. See how Gevity uses DataDirect XQuery® to combine Web service data from SalesForce.com with relational data in Oracle in a pricing engine for HR management. New Features in DataDirect XQuery®DataDirect XQuery® is now released! DataDirect XQuery® provides full update support for relational data, easy integration for Web Services, additional enhancements for performance and scalability and more! DataDirect XQuery FAQThis informative DataDirect XQuery® FAQ answers frequently-asked questions about DataDirect XQuery®, including questions about performance, scalability, use-cases, resources, and more. If you're more of a hands-on learner, then download a free copy and start exploring DataDirect XQuery® today! |





