weily's blog: Schemaless JavaXML Data Binding with VTDXML

Limitations of Schema-based XML Data Binding
XML data binding APIs are a class of XML processing tools that automatically map XML data into custom, strongly typed objects or data structures, relieving XML developers of the drudgery of DOM or SAX parsing. In order for traditional, static XML data binding tools (e.g., JAXB, Castor, and XMLbeans) to work, developers assume the availability the XML schema (or its equivalence) of the document. In the first step, most XML data binders compile XML schemas into a set of class files, which the calling applications then include to perform the corresponding "unmarshalling."
However, developers dealing with XML documents don't always have their schemas on hand. And even when the XML schemas are available, slight changes to them (often due to evolving business requirements) require class files to be generated anew. Also, XML data binding is most effective when processing shallow, regular-shaped XML data. When the underlying structure of XML documents is complex, users still need to manually navigate the typed hierarchical trees, a task which can require significant coding.
Most limitations of XML data binding come from its rigid dependency on XML schema. Unlike many binary data formats, XML is intended primarily as a schemaless data format flexible enough to represent virtually any kind of information. For advanced uses, XML also is extensible: applications may use only the portion of the XML document that they need. Because of XML's extensibility, Web Services, and SOA applications are far less likely to break in the face of changes.
The schemaless nature of XML has subtle performance implications in XML data binding. In many cases, only a small subset in an XML document (as opposed to the whole data set) is necessary to drive the application logic. Yet, the traditional approach indiscriminately converts entire data sets into objects, producing unnecessary memory and processing overhead.
Binding XML with VTD-XML and XPath
Motivation
While the concept of XML data binding has essentially remained unchanged since the early days of XML, the landscape of XML processing has evolved considerably. The primary purpose of XML data binding APIs is to map XML to objects and the presence of XML schemas merely helps lighten the coding effort of XML processing. In other words, if mapping XML to objects is sufficiently simple, you not only don't need schemas, but have strong incentive to avoid them because of all the issues they introduce.
As you probably have guessed by looking at the title of this section, the combination of VTD-XML and XPath is ideally suited to schemaless data binding.
Why XPath and VTD-XML?
There are three main reasons why XPath lends itself to our new approach. First, when properly written, your data binding code only needs proximate knowledge (e.g., topology, tag names, etc.) of the XML tree structure, which you can determine by looking at the XML data. XML schemas are no longer mandatory. Furthermore, XPath allows your application to bind the relevant data items and filter out everything else, avoiding wasteful object creation. Finally, the XPath-based code is easy to understand, simple to write and debug, and generally quite maintainable.
But XPath still needs the parsed tree of XML to work. Superior to both DOM and SAX, VTD-XML offers a long list of features and benefits relevant to data binding, some of which are highlighted in the following list.
High performance, low memory usage, and ease of use: The SAX parser uses a constant amount regardless of document size, but doesn't export the hierarchical structure of XML, which makes it difficult to use. It doesn't even support XPath. The DOM parser builds the in-memory tree, is easier to use, and supports XPath. But it is also very slow and incurs exorbitant memory usage. VTD-XML pushes the XML processing envelope to a whole new level. Like DOM, VTD-XML builds an in-memory tree and is capable of random access. But it consumes only 1/5 the memory of DOM. Performance-wise, VTD-XML not only outperforms DOM by 5x to 12x, but also is typically twice as fast as SAX with null content handler (the max performance). The benchmark comparison can be found here.
Non-blocking XPath implementation: VTD-XML also pioneers incremental, non-blocking XPath evaluation. Unlike traditional XPath engines that return the entire evaluated node set all at once, VTD-XML's AutoPilot-based returns an qualified node as soon as it is evaluated, resulting in unsurpassed performance and flexibility. For further reading, please visit http://www.devx.com/xml/Article/34045.
Native XML indexing: VTD-XML is a native XML indexer that allows your applications to run XPath query without parsing.
Incremental update: VTD-XML is the only XML processing API that allows you to update XML content without touching irrelevant parts of the XML document (See this article on devx.com), improving performance and efficiency from a different angle.
Process Description
The process for our new schemaless XML data binding roughly consists of the following steps.
Observe the XML document and write down the XPath expressions corresponding to the data fields of interest.
Define the class file and member variables to which those data fields are mapped.
Refactor the XPath expressions in step 1 to reduce navigation cost.
Write the XPath-based data binding routine that does the object mapping. XPath 1.0 allows XPath to be evaluated to four data types: string, Boolean, double and node set. The string type can be further converted to additional data types.
If the XML processing requires the ability to both read and write, use VTD-XML's XMLModifier to update XML's content. You may need to record more information to take advantage of VTD-XML's incremental update capability.
A Sample Project
Let me show you how to put this new XML binding in action. This project, written in Java, follows the steps outlined above to create simple data binding routes. The first part of this project creates read-only objects that are not modified by application logic. The second part extracts more information that allows the XML document to be updated incrementally. The last part adds VTD+XML indexing to the mix. The XML document I use in this example looks like the following:

Empire Burlesque
Bob Dylan
USA
Columbia
10.90
1985

Still Got the Blues
Gary More
UK
Virgin Records
10.20
1990

Hide Your Heart
Bonnie Tyler
UK
CBS Records
9.90
1988

Greatest Hits
Dolly Parton
USA
RCA
9.90
1982

Read Only
The application logic is driven by CD record objects between 1982 and 1990 (non-inclusive), corresponding to XPath "/CATALOG/CD[ YEAR <>1982]." The class definition (shown below) contains four fields, corresponding to the title, artist, price, and year of a CD.public class CDRecord {
String title;
String artist;
double price;
int year;
}
The mapping between the object member and its corresponding XPath expression is as follows:
The TITLE field corresponds to "/CATALOG/CD[ YEAR <>1982]/TITLE."
The ARTIST field corresponds to "/CATALOG/CD[ YEAR <>1982]/ARTIST."
The PRICE field corresponds to "/CATALOG/CD[ YEAR <>1982]/PRICE."
The YEAR field corresponds to "/CATALOG/CD[ YEAR <>1982]/YEAR."
The XPath expressions can be further refactored (for efficiency reasons) as following:
Use "/CATALOG/CD[ YEAR <>1982]" to navigate to the CD node.
Use "TITLE" to extract the TITLE field (a string).
Use "ARTIST" to extract the ARTIST field (a string).
Use "PRICE" to extract the PRICE field (a double).
Use "YEAR" to extract the YEAR field (an integer).

weily's blog

2009年3月10日星期二

Schemaless JavaXML Data Binding with VTDXML

没有评论:

发表评论

关注者

博客归档

我的简介