Lore’s XML Data Model

In Lore’s new XML-based data model, an XML element is a pair where (eid, value) eid is a unique element identifier, and value is either an atomic text string or a complex value containing the following four components:

1. A string-valued tag corresponding to the XML tag for that element.

2. An ordered list of attribute-name/atomic-value pairs, where each attribute-name is a string and each atomic-value has an atomic type drawn from integer, real, string, etc., or ID, IDREF, or IDREFS.

3. An ordered list of crosslink subelements of the form (label, eid), where label is a string. Crosslink subelements are introduced via an attribute of type IDREF or IDREFS.

4. An ordered list of normal subelements of the form (label, eid), where label is a string. Normal subelements are
introduced via lexical nesting within an XML document.

An XML document ismapped easily into our datamodel. Note that we ignore comments and whitespace
between tagged elements. As a base case, text between tags is translated into an atomic text
element; we do the same thing for CDATA sections, used in XML to escape text that might otherwise
be interpreted as markup. Otherwise, a document element is translated into a complex data element such that:

1. The tag of the data element is the tag of the document element.

2. The list of attribute-name/atomic-value pairs in the data element is derived directly from the document
element’s attribute list.

3. For each attribute value iof type IDREF in the document element, or component iof an attribute value of
type IDREFS, there is one crosslink sub-element (label, eid) in the data element, where label is the
corresponding attribute name and eid identifies the unique data element whose ID attribute value matches .

4. The subelements of the document element appear, in order, as the normal subelements of the data element.
The label for each data subelement is the tag of that document subelement, or Text if the document subelement
is atomic.

Note that multiple XML documents can be loaded into a single database, and any system of cross-document
links (e.g., XLink or XPointer) can be used provided information that uniquely identifies elements is not lost.

Figure 1:  XML document and its graph

Once one or more XML documents are mapped into our data model it is convenient to visualize the data as a
directed, labeled, ordered graph. The nodes in the graph represent the data elements and the edges represent
the element-subelement relationship. Each node representing a complex data element contains a tag and an
ordered list of attribute-name/atomic value pairs; atomic data element nodes contain string values. There are
two different types of edges in the graph: (i) normal subelement edges, labeled with the tag of the
destination subelement; (ii) crosslink edges, labeled with the attribute name that introduced the crosslink.
Note that the graph representation is isomorphic to the data model, so they can be discussed interchangeably.

It is useful to view the XML data in one of two modes: semantic or literal. Semantic mode is used when the
user or application wishes to view the database as an interconnected graph. The graph representing the
semantic mode omits attributes of type IDREF and IDREFS, and the distinction between sub element and
crosslink edges is gone. Literal mode is available when the user wishes to view the database as an XML
document. IDREF and IDREFS attributes are visible as textual strings, while crosslink edges are invisible.
In literal mode, the database is always a tree.

Figure 1 shows a small sample XML document and the graph representation in our datamodel. Element
identifiers (eids) appear within nodes and are written as &1, &2, etc. Attribute-name/atomic-value pairs
are shown next to the associated nodes (surrounded by {}), with IDREF attributes in italics. Subelement
edges are solid and crosslink edges are dashed. The ordering of subelements is left-to-right. We have
not shown the tag associated with each element since it is straight forward to deduce for this simple
database. (For example, node &3 has the tag Member and not Advisor.) In semantic mode, the database in
Figure 1 does not include the (italicized) IDREF attributes. In literal mode, the (dashed) crosslinks are
not included. Note that there is some structural heterogeneity in the data even though the sample data
was kept purposefully small.


, , , , , , , , , , , ,

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: