Querying Large XML Files

DataDirect XQuery® provides several ways to query an XML file. In DataDirect XQuery®, you will get better performance, especially for large XML files, if you use fn:doc() to access XML in a query. You will get much worse performance if you parse the XML to create a DOM tree, bind the DOM tree to an external variable, and query the external variable.

In most XQuery processors, a large percent of the time needed to query a large XML file is spent parsing the file and creating an in-memory representation that can be queried. The in-memory representation may be many times the size of the original XML file, and Java VM out of memory errors may occur at run time. When your query addresses an XML file using fn:doc(), DataDirect XQuery® uses a technique known as Document Projection, which allows it to create only the part of a document needed by the query. This results in dramatic improvements in memory usage and scalability, and significant improvements in performance. Using document projection, you can generally query documents many times the size of available memory. Incidentally, if your application uses XML Deployment Adapters, it addresses these adapters using fn:doc(), and DataDirect XQuery® uses Document Projection appropriately for these adapters just as it does for XML documents.

Document projection uses the path expressions in a query to determine what parts of a document need to be built. Path expressions that use wildcards ('*') provide less information for document projection, so they should be avoided when querying large documents. For instance, in the examples below, the second path expression will perform much better for large documents.

Example 6. Avoid path expressions with wildcards

doc("scenario.xml")//*[role='teacher']

Example 7. Specific path expressions aid document projection

doc("scenario.xml")/scenario/people/person[role='teacher']

If you parse a document to create a DOM tree, then bind the DOM tree to an external variable, you bypass DataDirect's Document Projection. This forces your program to create the entire in-memory representation of your XML document — not just the part that DataDirect XQuery® needs to process your queries. In addition to the savings due to Document Projection, DataDirect's internal representation of an XML document is much more efficient than most DOM implementations, so using fn:doc() gives you two substantial optimizations that you lose if you parse the document yourself.

DataDirect XQuery® also supports a second technique known as Streaming. Streaming processes a document sequentially, discarding portions of the document that are no longer needed to produce further query results. This reduces memory usage because only the portion of a document needed at a given stage of query processing is instantiated in memory. Streaming can reduce memory consumption, but generally involves a (relatively minor) performance penalty. Unlike document projection, streaming is not generally a performance win. Also, some queries require much of the document to be instantiated at a given time, diminishing the benefits of streaming. For some queries, however, streaming can be a substantial win, especially if the query results exceed available memory; therefore, we allow streaming to be enabled, but we do not enable it by default.

To enable streaming, set the ddtek:xml-streaming option to 'yes' in the XQuery Prolog:

Example 8. Enabling Streaming in the Query Prolog

declare option ddtek:xml-streaming 'yes';

You can also enable streaming at the DataSource level, which means streaming will be used for all queries that use the DataSource. One way to do this is to set the Pragmas property of the DDXQDataSource class.

Example 9. Enabling Streaming for a DataSource in the Java API


    XQDataSource dataSource = new DDXQDataSource();    
    dataSource.setJdbcUrl("jdbc:xquery:sqlserver://server1:1433;databaseName=stocks");
    dataSource.setPragmas("xml-streaming=yes");           
            

Another way to enable streaming at the DataSource level is to specify the declaration option using a pragma element in a Source Configuration File:

Example 10. Enabling Streaming for a DataSource in a Source Configuration File

<?xml version="1.0" encoding="UTF-8"?>
<XQJConnection xmlns="http://www.datadirect.com/xquery">
        
    <maxPooledQueries>20</maxPooledQueries>  
    <pragma name="xml-streaming">yes</pragma>	  
    <JDBCConnection name="stocks">
        <url>jdbc:xquery:sqlserver://localhost:1433</url>
        <sqlxmlMapping>
             <forest>true</forest>
             <identifierEscaping>none</identifierEscaping>
        </sqlxmlMapping>
    </JDBCConnection>
</XQJConnection>    
            

Try DataDirect XQuery® Free!

Put the power, scalability, and performance of DataDirect XQuery® to work for you today! Our free trial lets you see for yourself how easy it is to build data integration applications that access relational, EDI, and other file formats as XML!

Download DataDirect XQuery® today!

Online Video Tutorials!

Our easy-to-follow online video tutorials are a great way to get acquainted with the many features of DataDirect XQuery®.

And if you like what you see, download a free copy today and try DataDirect XQuery® for yourself!

Back to School with DataDirect XQuery®

Getting your mind around the possibilities of a data integration technology as promising as XQuery can be difficult, but our XML developers Webinars will help you understand the power and versatility of XQuery, and our favorite XQuery processor, DataDirect XQuery®.

Stay Informed!

XQuery is one of the hottest XML technologies being developed today. Stay informed with vital news about standards, tools, and trends by signing up for the DataDirect XQuery® newsletter.

DataDirect XQuery FAQ

This informative DataDirect XQuery® FAQ answers frequently-asked questions about DataDirect XQuery®, including questions about performance, scalability, use-cases, resources, and more.

If you're more of a hands-on learner, then download a free copy and start exploring DataDirect XQuery® today!