Sunday, May 27, 2007

Understanding the Complexity of XML Schema

Variant Content Tree


XML Schema outlaws non-deterministic content models. However, there are still the following features in its representation allowing its contents to be variant:
  1. Occurrences of components
  2. Optional attributes
  3. Namespace attributes in the instance documents
  4. Substitution group replacement
  5. xsi:nil
  6. xsi:type (a Mechanism for Replacement by Derived Type)
  7. wildcard components
  8. dangling types



Occurrences of Components

In the schema definition, the following components can be used to assign occurrences of a subtree or element:
  • all
  • choice
  • sequence
  • element
  • any
  • group
The model group, particle, and wildcard components contribute to the portion of a complex type definition that controls an element information item's content.

A model group is a constraint in the form of a grammar fragment that applies to lists of element information items. It consists of a list of particles, i.e., element declarations, wildcards and model groups. There are three varieties of model group:

  • Sequence
  • All (or conjunction)
  • Choice (or Disjunction)
A particle is a term in the grammar for element content, consisting of either an element declaration, a wildcard or a model group, together with occurrence constraints.

A wildcard is a special kind of particle which matches element and attribute information items dependent on their namespace name, independently of their local names. In other words, to tell the schema processor to ignore markup, use the particle in your content model.

There are two types of convenience definitions provided to enable the re-use of pieces of complex type definitions:

  • model group definitions
  • attribute group definitions
Group definition components are a macro-like mechanism to allow the re-use of pieces of complex type definitions in defining a complex type in a schema file. These definition components cannot be referenced in instance documents.


Optional Attributes

Attributes can be optional.


Namespace Attributes in the Instance Documents

The targetNamespace attribute in an xs:schema element places all the components in a schema within one category. Namespaces allow you to “visually categorize” components in instance documents. Example: olympus:zoom, and olympus:f-stop.

Namespace bindings are done at per-document level. No transfer of namespace bindings between documents happens (exception: chameleon include-when including a targetless schema into a targetted one, all defs and refs of the form {None}xxx are turned into defs and refs of the form {tns}xxx, where 'tns' is the target namespace of the including schema). Namespace bindings are provided by namespace declaration attributes which can be explicitly provided or defaulted from the schema or DTD in the document. If needed, they can be specified anywhere in a document.


QName Interpretation
The XML Schema REC extends the XML Namespace REC to attribute values of type QName in a very simple way, which is always local to a single XML document. On this basis it constructs *expanded names* (written here in the form {ns or None}local) with respect to which all subsequent processing is done, including import, include, redefine and validation. For example, in b.xsd
b.xsd:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="boo" type="foo"/>
</xsd:schema>

type='foo' is interpreted in the context of b.xsd as an XML document, in which there is no default NS declaration, so this is a reference to {None}foo (i.e., this interpretation can only be changed in a chameleon include) as it stands.

The namespace of an element is completely determined by its form (or elementFormDefault attribute of its ancestor xs:schema element) and the targetNamespace attribute of its ancestor xs:schema element. Note that elementFormDefault applies just to the schema that it is in. It does not apply to schemas that it includes or imports.

The interpretation of all attributes in XML Schema documents declared to be of type QName, including 'type', 'base' and 'ref', is always strictly per the Namespaces REC for element names, i.e. qualified if prefixed or unqualified if unprefixed and no default namespace declaration in scope.

The interpretation of all 'name' attributes (which are NCName's) in XML Schema documents is that they provide names for the associated definition or declaration in the namespace determined by the enclosing element's 'targetNamespace' attribute (with an exception for chameleon includes).

Namespace Processing in DOM
  • Namespace declaration attributes are exposed and can be manipulated just like any other attribute.
    By definition, all namespace attributes (including those named xmlns, whose [prefix] property has no value) have a namespace URI of http://www.w3.org/2000/xmlns/.
  • Nodes are permanently bound to namespace URIs as they get created
  • Namespace validation is not enforced; the DOM application is responsible
    DOM provides normalizeDocument method to allow DOM application to fix up namespace at its chosen time.
  • Elements and attributes created with DOM Level 1 methods do not have any namespace prefix, namespace URI, or local name. Or, put in another way, DOM Level 1 methods are namespace ignorant.
DOM Level 1 methods solely identify attribute nodes by their nodeName. On the contrary, the DOM Level 2 methods related to namespaces, identify attribute nodes by their namespaceURI and localName.



Element Substitution Group(a Mechanism for Replacement by Substitution Group Members)

Element substitution group allows global elements to be substituted for other global elements. When an instance document contains element substitutions whose types are derived (by restriction or extension) from those of their head elements, it is not necessary to identify the derived types using the xsi:type construction.

XML Schema provides a mechanism that controls which substitution groups may be used in instance documents. If an element declaration's block attribute is set to 'substitution', it prevents users from substituting members of the substitution group of the element for the element itself in the instance document.




xsi:nil

XML Schema's nil mechanism enables an element to appear with or without non-nil value. This is good for elements that sometimes doesn't hold any information. In this case, nillable elements will always appear in the instance document and can avoid introducing non-determinism. The nil mechanism applies only to element values, and not to attribute values.

At validation time, we need to have nillable information on the element declaration to know whether an element is nillable or not.

As an alternative, a union can be used instead of xsi:nil. For example, given a requirement that a single number can be either empty or can have 10 digits:

<number> or <number>1234567890</number>
Then, you can use a union of an integer type and a string type and will validate as the empty string which is a valid value for string type.




xsi:type (a Mechanism for Replacement by Derived Type)

To use derived types in place of instances of a base type in instance document, the derived type must be identified in the instance document using xsi:type.

At validation time, we need to have type derivation information other than element tree to know whether a derived type can be used in place of a base type or not.

XML Schema provides a mechanism that controls which derivations may be used in instance documents. If an element declaration's block attribute is set to 'restriction' or 'extension', it prevents users from using xsi:type on instances of the element to change the type of the element in the instance document.




Wildcard Components

A wildcard is a special kind of particle which matches element and attribute information items dependent on their namespace name, independently of their local names. At validation time, wildcard components can match up any element or attribute instances from a set of namespaces.


Dangling Type

In XML Schema, you are allowed to declare an element with variable content by giving it a type that is in another namespace specified as the targetNamespace of an import component. However, no schemaLocation is provided for this import component. At validation time, we provide the needed type information by associating a targetNamespace with a schema via xsi:schemaLocation in the instance document.

Example:



<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.weather-station.org"
xmlns="http://www.weather-station.org"
xmlns:s="http://www.sensor.org"
elementFormDefault="qualified">
<xsd:import namespace="http://www.sensor.org"/>
<xsd:element name="weather-station">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="sensor" type="s:sensor_type"
maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>

Notes:


  1. An import with no schemaLocation!
  2. sensor is the variable content container. It contains a sensor_type for which no schema has been indicated!
  3. The instance document must then identify a schema that implements sensor_type. Thus, at run time (validation time) we are matching up the reference to sensor_type with the implementation of sensor_type. For example, an instance document may have this:
xsi:schemaLocation=
"http://www.weather-station.org weather-station.xsd
http://www.sensor.org boston-sensors.xsd"

In this instance document schemaLocation is identifying a schema, boston-sensors.xsd, which provides an implementation of sensor_type.

Chameleon Include

When an integrating schema s components from a schema with no targetNamespace, those no-namespace components "take on" the namespace of the integrating schema. This is called the Chameleon Effect. From now on the no-namespace components will be referred to as Chameleon components. When Chameleon components are used in a schema with a namespace they take-on that namespace.
In DOM, nodes are permanently bound to namespace URIs as they get created. In Chameleon include, the integrated schemas asssume the integrating's targetNamespace. In particular, it replaces "absent" in the following places:
  1. The {target namespace} of named schema components, both at the top and (in the case of nested type definitions and nested attribute and element declarations whose code was qualified) nested within definitions;
  2. The {namespace constraint} of a wildcard, whether negated or not.

Redefine

A "redefine" element can be used to redefine simpleType, complexType, group, and attributeGroup definitions in another schema document which either have the same targetNamespace as the ing schema document, or no targetNamespace at all, in which case the d schema document is converted to the ing schema document's targetNamespace. Note that you cannot redefine components that are in a different namespace.

For the following example,


v1.xsd:

<xs:complexType name="personName">
<xs:sequence>
<xs:element name="title" minOccurs="0"/>
<xs:element name="forename" minOccurs="0" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
<xs:element name="addressee" type="personName"/>


v2.xsd:

<xs:redefine schemaLocation="v1.xsd">
<xs:complexType name="personName">
<xs:complexContent>
<xs:extension base="personName">
<xs:sequence>
<xs:element name="generation" minOccurs="0"/>
</xs:sequence>
</xs:extension>
</xs:complexContent>
</xs:complexType>
</xs:redefine>

<xs:element name="author" type="personName"/>



the schema corresponding to v2.xsd has everything specified by v1.xsd, with the personName type redefined, as well as everything it specifies itself. According to this schema, elements constrained by the personName type may end with a generation element. This includes not only the author element, but also the addressee element.

Redefining components such as simpleType and complexType will be looked up at run time and other redefining components such as group and attributeGroup will be dereferenced at run time. Because the search order, redefining schema's type definitions and/or group definitions will be used or picked up first. So, for either case, redefining schema components take precedence over original schema components.

In this implementation, both original definitions and redefining definitions are kept separately. So, after a validation session, the original schema definitions can be re-exposed and be reused.