Monday, August 29, 2011

XML DTDs Vs XML Schema

XML is a very handy format for storing and communicating your data between disparate systems in a platform-independent fashion. XML is more than just a format for computers — a guiding principle in its creation was that it should be Human Readable and easy to create. 

XML allows UNIX systems written in C to communicate with Web Services that, for example, run on the Microsoft .NET architecture and are written in ASP.NET. XML is however, only the meta-language that the systems understand — and they both need to agree on the format that the XML data will be in. Typically, one of the partners in the process will offer a service to the other: one is in charge of the format of the data.
The definition serves two purposes: the first is to ensure that the data that makes it past the parsing stage is at least in the right structure. As such, it’s a first level at which ‘garbage’ input can be rejected. Secondly, the definition documents the protocol in a standard, formal way, which makes it easier for developers to understand what’s available.
DTD – The Document Type Definition
The first method used to provide this definition was the DTD, or Document Type Definition. This defines the elements that may be included in your document, what attributes these elements have, and the ordering and nesting of the elements.

The DTD is declared in a DOCTYPE declaration beneath the XML declaration contained within an XML document:
Inline Definition:
  1. <?xml version="1.0"?> <br>  
  2. <!DOCTYPE documentelement [definition]>  
<?xml version="1.0"?> 

<!DOCTYPE documentelement [definition]>
External Definition:
  1. <?xml version="1.0"?> <br>  
  2. <!DOCTYPE documentelement SYSTEM "documentelement.dtd">  
<?xml version="1.0"?> 

<!DOCTYPE documentelement SYSTEM "documentelement.dtd">
The actual body of the DTD itself contains definitions in terms of elements and their attributes. For example, the following short DTD defines a bookstore. It states that a bookstore has a name, and stocks books on at least one topic.
Each topic has a name and 0 or more books in stock. Each book has a title, author and ISBN number. The name of the topic, and the name of the bookstore are defined as being the same type of element: this store’s PCDATA: just text data. The title and author of the book are stored as CDATA -- text data that won’t be parsed for further characters by the XML parser. The ISBN number is stored as an attribute of the book:
  1. <!DOCTYPE bookstore [ <br>  
  2.   <!ELEMENT bookstore (topic+)> <br>  
  3.   <!ELEMENT topic (name,book*)> <br>  
  4.   <!ELEMENT name (#PCDATA)> <br>  
  5.   <!ELEMENT book (title,author)> <br>  
  6.   <!ELEMENT title (#CDATA)> <br>  
  7.   <!ELEMENT author (#CDATA)> <br>  
  8.   <!ELEMENT isbn (#PCDATA)> <br>  
  9.   <!ATTLIST book isbn CDATA "0"> <br>  
  10.   ]>  
<!DOCTYPE bookstore [ 

  <!ELEMENT bookstore (topic+)> 

  <!ELEMENT topic (name,book*)> 

  <!ELEMENT name (#PCDATA)> 

  <!ELEMENT book (title,author)> 

  <!ELEMENT title (#CDATA)> 

  <!ELEMENT author (#CDATA)> 

  <!ELEMENT isbn (#PCDATA)> 

  <!ATTLIST book isbn CDATA "0"> 

  ]>
An example of a book store’s inline definition might be:
  1. <?xml version="1.0"?> <br>  
  2. <!DOCTYPE bookstore [ <br>  
  3.   <!ELEMENT bookstore (name,topic+)> <br>  
  4.   <!ELEMENT topic (name,book*)> <br>  
  5.   <!ELEMENT name (#PCDATA)> <br>  
  6.   <!ELEMENT book (title,author)> <br>  
  7.   <!ELEMENT title (#CDATA)> <br>  
  8.   <!ELEMENT author (#CDATA)> <br>  
  9.   <!ELEMENT isbn (#PCDATA)> <br>  
  10.   <!ATTLIST book isbn CDATA "0"> <br>  
  11.   ]> <br>  
  12. <bookstore> <br>  
  13.   <name>Mike's Store</name> <br>  
  14.   <topic> <br>  
  15.     <name>XML</name> <br>  
  16.     <book isbn="123-456-789"> <br>  
  17.       <title>Mike's Guide To DTD's and XML Schemas<</title> <br>  
  18.       <author>Mike Jervis</author> <br>  
  19.     </book> <br>  
  20.   </topic> <br>  
  21. </bookstore>  
<?xml version="1.0"?> 

<!DOCTYPE bookstore [ 

  <!ELEMENT bookstore (name,topic+)> 

  <!ELEMENT topic (name,book*)> 

  <!ELEMENT name (#PCDATA)> 

  <!ELEMENT book (title,author)> 

  <!ELEMENT title (#CDATA)> 

  <!ELEMENT author (#CDATA)> 

  <!ELEMENT isbn (#PCDATA)> 

  <!ATTLIST book isbn CDATA "0"> 

  ]> 

<bookstore> 

  <name>Mike's Store</name> 

  <topic> 

    <name>XML</name> 

    <book isbn="123-456-789"> 

      <title>Mike's Guide To DTD's and XML Schemas<</title> 

      <author>Mike Jervis</author> 

    </book> 

  </topic> 

</bookstore>
Using an inline definition is handy when you only have a few documents and they’re offline, as the definition is always in the file. However, if, for example, your DTD defines the XML protocol used to talk between two seperate systems, re-transmitting the DTD with each document adds an overhead to the communciations. Having an external DTD eliminates the need to re-send each time. We could remove the DTD from the document, and place it in a DTD file on a Web server that’s accessible by the two systems:
  1. <?xml version="1.0"?> <br>  
  2. <!DOCTYPE bookstore SYSTEM "http://webserver/bookstore.dtd"> <br>  
  3. <bookstore> <br>  
  4.   <name>Mike's Store</name> <br>  
  5.   <topic> <br>  
  6.     <name>XML</name> <br>  
  7.     <book isbn="123-456-789"> <br>  
  8.       <title>Mike's Guide To DTD's and XML Schemas<</title> <br>  
  9.       <author>Mike Jervis</author> <br>  
  10.     </book> <br>  
  11.   </topic> <br>  
  12. </bookstore>  
<?xml version="1.0"?> 

<!DOCTYPE bookstore SYSTEM "http://webserver/bookstore.dtd"> 

<bookstore> 

  <name>Mike's Store</name> 

  <topic> 

    <name>XML</name> 

    <book isbn="123-456-789"> 

      <title>Mike's Guide To DTD's and XML Schemas<</title> 

      <author>Mike Jervis</author> 

    </book> 

  </topic> 

</bookstore>
The file bookstore.dtd would contain the full defintion in a plain text file:
  1.  <!ELEMENT bookstore (name,topic+)> <br>  
  2.  <!ELEMENT topic (name,book*)> <br>  
  3.  <!ELEMENT name (#PCDATA)> <br>  
  4.  <!ELEMENT book (title,author)> <br>  
  5.  <!ELEMENT title (#CDATA)> <br>  
  6.  <!ELEMENT author (#CDATA)> <br>  
  7.  <!ELEMENT isbn (#PCDATA)> <br>  
  8.  <!ATTLIST book isbn CDATA "0">  
  <!ELEMENT bookstore (name,topic+)> 

  <!ELEMENT topic (name,book*)> 

  <!ELEMENT name (#PCDATA)> 

  <!ELEMENT book (title,author)> 

  <!ELEMENT title (#CDATA)> 

  <!ELEMENT author (#CDATA)> 

  <!ELEMENT isbn (#PCDATA)> 

  <!ATTLIST book isbn CDATA "0">
The lowest level of definition in a DTD is that something is either CDATA or PCDATA: Character Data, or Parsed Character Data. We can only define an element as text, and with this limitation, it is not possible, for example, to force an element to be numeric. Attributes can be forced to a range of defined values, but they can’t be forced to be numeric.

So for example, if you stored your applications settings in an XML file, it could be manually edited so that the windows start coordinates were strings — and you’d still need to validate this in your code, rather than have the parser do it for you.
XML Schemas
XML Schemas provide a much more powerful means by which to define your XML document structure and limitations. XML Schemas are themselves XML documents. They reference the XML Schema Namespace (detailed here), and even have their own DTD.
What XML Schemas do is provide an Object Oriented approach to defining the format of an XML document. XML Schemas provide a set of basic types. These types are much wider ranging than the basic PCDATA and CDATA of DTDs. They include most basic programming types such as integer, byte, string and floating point numbers, but they also expand into Internet data types such as ISO country and language codes (en-GB for example). A full list can be found here.
The author of an XML Schema then uses these core types, along with various operators and modifiers, to create complex types of their own. These complex types are then used to define an element in the XML Document.
As a simple example, let’s try to create a basic XML Schema for defining the bookstore that we used as an example for DTDs. Firstly, we must declare this as an XSD Document, and, as we want this to be very user friendly, we’re going to add some basic documentation to it:
  1. <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">  <br>  
  2. <xsd:annotation>  <br>  
  3.   <xsd:documentation xlm:lang="en">  <br>  
  4.     XML Schema for a Bookstore as an example.  <br>  
  5.   </xsd:documentation>  <br>  
  6. </xsd:annotation>  
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">  

<xsd:annotation>  

  <xsd:documentation xlm:lang="en">  

    XML Schema for a Bookstore as an example.  

  </xsd:documentation>  

</xsd:annotation>
Now, in the previous example, the bookstore consisted of the sequence of a name and at least one topic. We can easily do that in an XML Schema:
  1. <xsd:element name="bookstore" type="bookstoreType"/>  <br>  
  2. <xsd:complexType name="bookstoreType">  <br>  
  3.   <xsd:sequence>  <br>  
  4.     <xsd:element name="name" type="xsd:string"/>  <br>  
  5.     <xsd:element name="topic" type="topicType" minOccurs="1"/>  <br>  
  6.   </xsd:sequence>  <br>  
  7. </xsd:complexType>  
<xsd:element name="bookstore" type="bookstoreType"/>  

<xsd:complexType name="bookstoreType">  

  <xsd:sequence>  

    <xsd:element name="name" type="xsd:string"/>  

    <xsd:element name="topic" type="topicType" minOccurs="1"/>  

  </xsd:sequence>  

</xsd:complexType>
In this example, we’ve defined an element, bookstore, that will equate to an XML element in our document. We’ve defined it of type bookstoreType, which is not a standard type, and so we provide a definition of that type next.

We then define a complexType, which defines bookstoreType as a sequence of name and topic elements. Our "name" type is an xsd:string, a type defined by the XML Schema Namespace, and so we’ve fully defined that element.

The topic element, however, is of type topicType, another custom type that we must define. We’ve also defined our topic element with minOccurs="1", which means there must be at least one element at all times. As maxOccurs is not defined, there no upper limit to the number of elements that might be included. If we had specified neither, the default would be exactly one instance, as is used in the name element. Next, we define the schema for the topicType.
  1. <xsd:complexType name="topicType">  <br>  
  2.   <xsd:element name="name" type="xsd:string"/>  <br>  
  3.   <xsd:element name="book" type="bookType" minOccurs="0"/>  <br>  
  4. </xsd:complexType>  
<xsd:complexType name="topicType">  

  <xsd:element name="name" type="xsd:string"/>  

  <xsd:element name="book" type="bookType" minOccurs="0"/>  

</xsd:complexType>
This is all similar to the declaration of the bookstoreType, but note that we have to re-define our name element within the scope of this type. If we’d used a complex type for name, such as nameType, which defined only an xsd:string — and defined it outside our types, we could re-use it in both. However, to illustrate the point, I decided to define it within each section. XML gets interesting when we get to defining our bookType:
  1. <xsd:complexType name="bookType">  <br>  
  2.   <xsd:element name="title" type="xsd:string"/>  <br>  
  3.   <xsd:element name="author" type="xsd:string"/>  <br>  
  4.   <xsd:attribute name="isbn" type="isbnType"/>  <br>  
  5. </xsd:complexType>  <br>  
  6. <xsd:simpleType name="isbnType">  <br>  
  7.   <xsd:restriction base="xsd:string">  <br>  
  8.     <xsd:pattern value="\[0-9]{3}[-][0-9]{3}[-][0-9]{3}"/>  <br>  
  9.   </xsd:restriction>  <br>  
  10. </xsd:simpleType>  
<xsd:complexType name="bookType">  

  <xsd:element name="title" type="xsd:string"/>  

  <xsd:element name="author" type="xsd:string"/>  

  <xsd:attribute name="isbn" type="isbnType"/>  

</xsd:complexType>  

<xsd:simpleType name="isbnType">  

  <xsd:restriction base="xsd:string">  

    <xsd:pattern value="\[0-9]{3}[-][0-9]{3}[-][0-9]{3}"/>  

  </xsd:restriction>  

</xsd:simpleType>
So the definition of the bookType is not particularly interesting. But the definition of its attribute "isbn" is. Not only does XML Schema support the use of types such as xsd:nonNegativeNumber, but we can also create our own simple types from these basic types using various modifiers. In the example for isbnType above, we base it on a string, and restrict it to match a given regular expression. Excusing my poor regex, that should limit any isbn attribute to match the standard of three groups of three digits separated by a dash.
This is just a simple example, but it should give you a taste of the many things you can do to control the content of an attribute or an element. You have far more control over what is considered a valid XML document using a schema. You can even
  • extend your types from other types you’ve created,
  • require uniqueness within scope, and
  • provide lookups.
It’s a nicely object oriented approach. You could build a library of complexTypes and simpleTypes for re-use throughout many projects, and even find other definitions of common types (such as an "address", for example) from the Internet and use these to provide powerful definitions of your XML documents.
DTD vs XML Schema
The DTD provides a basic grammar for defining an XML Document in terms of the metadata that comprise the shape of the document. An XML Schema provides this, plus a detailed way to define what the data can and cannot contain. It provides far more control for the developer over what is legal, and it provides an Object Oriented approach, with all the benefits this entails.

So, if XML Schemas provide an Object Oriented approach to defining an XML document’s structure, and if XML Schemas give us the power to define re-useable types such as an ISBN number based on a wide range of pre-defined types, why would we use a DTD? There are in fact several good reasons for using the DTD instead of the schema.
Firstly, and rather an important point, is that XML Schema is a new technology. This means that whilst some XML Parsers support it fully, many still don’t. If you use XML to communicate with a legacy system, perhaps it won’t support the XML Schema.
Many systems interfaces are already defined as a DTD. They are mature definitions, rich and complex. The effort in re-writing the definition may not be worthwhile.
DTD is also established, and examples of common objects defined in a DTD abound on the Internet — freely available for re-use. A developer may be able to use these to define a DTD more quickly than they would be able to accomplish a complete re-development of the core elements as a new schema.
Finally, you must also consider the fact that the XML Schema is an XML document. It has an XML Namespace to refer to, and an XML DTD to define it. This is all overhead. When a parser examines the document, it may have to link this all in, interperate the DTD for the Schema, load the namespace, and validate the schema, etc., all before it can parse the actual XML document in question. If you’re using XML as a protocol between two systems that are in heavy use, and need a quick response, then this overhead may seriously degrade performance.

Then again, if your system is available for third party developers as a Web service, then the detailed enforcement of the XML Schema may protect your application a lot more effectively from malicious — or just plain bad — XML packets. As an example, Muse.net is an interesting technology. They have a publicly-available SOAP API defined with an XML Schema that provides their developers more control over what they receive from the user community.

No comments:

Post a Comment