A fast index for semistructured data

Queries navigate semistructured data via path expressions, and can be accelerated using
an index. The solution encodes paths as strings, and inserts those strings into a
special index that is highly optimized for long and complex keys. It describes the
Index Fabric, an indexing structure that provides the efficiency and flexibility are
needed. We discuss how "raw paths" are used to optimize ad hoc queries over
semistructured data, and how "refined paths" optimize specific access paths. Although
we can use knowledge about the queries and structure of the data to create refined
paths, no such knowledge is needed for raw paths. A performance study shows that our
techniques, when implemented on top of a commercial relational database system,
outperform the more traditional approach of using the commercial system’s indexing
mechanisms to query the XML.

Semistructured data is often represented as a graph, with a set of data elements
connected by labeled relationships, and this self-describing relationship is no need
for a priori knowledge of the schema of the data, since the paths we encode are
extracted from the data itself. Second, our approach has high performance even when the
structure of the data is changing, variable or irregular. Third, the same index can
accelerate queries along many different, complex access paths. This is because our
indexing mechanism scales gracefully with the number of keys inserted, and is not affected
by long or complex keys (representing long or complex paths).

Our indexing mechanism, called the Index Fabric, utilizes the aggressive key
compression inherent in a Patricia trie to index a large number of strings in a
compact and efficient structure. Moreover, the Index Fabric is inherently 
balanced, so that all accesses to the index require the same small number of I/Os.
As a result, we can index a large, complex, irregularly-structured, disk-resident
semistructured data set while providing efficient navigation over paths in the data.

We manage two types of paths for semistructured data. First, we can index paths
that exist in the raw data (called raw paths) to accelerate any ad hoc query. We can
also reorganize portions of the data, to create refined paths, in order to better
optimize particular queries. Both kinds of paths are encoded as strings and inserted
into the Index Fabric. Because the index grows so slowly as we add new keys, we can 
create many refined paths and thus optimize many access patterns, even complex patterns
that traditional techniques cannot easily handle. As a result, we can answer general
queries efficiently using raw paths, even as we further optimize certain queries
using refined paths. Maintaining all of the paths in the same index structure reduces
the resource contention that occurs with multiple indexes, and provides a uniform
mechanism that can be tuned for different needs.

A popular syntax for semistructured data is XML, and in this paper we focus on using the Index Fabric
to index XML-encoded data. XML encodes information as data elements surrounded by tags, and tags can be
nested within other tags. This nesting structure can be viewed as a tree, and raw paths represent root-to-leaf
traversals of this tree. Refined paths represent traversing the tree in some other way (e.g. from sibling to sibling).

The index provides a flexible, uniform and efficient mechanism to access data, while
utilizing a stable storage manager to provide properties such as concurrency, fault tolerance, or security.

The Index Fabric as an index on top of a popular commercial relational DBMS. To
evaluate performance, we indexed an XML data set using both the Index Fabric and the DBMS’s native B-trees.

In the Index Fabric, we have constructed both refined and raw paths, while the relational index utilized an edge

mapping as well as a schema extracted by the STORED system. Both refined and raw paths are significantly
faster than the DBMS’s native indexing mechanism, sometimes by an order of magnitude or more. The
difference is particularly striking for data with irregular structure, or queries that must navigate multiple paths.

The structure of the Index Fabric and how it can be used to optimize searches over
semistructured databases. Specifically, we make the following contributions:

1.  How to utilize the Index Fabric’s support for long and complex keys to index semi structured
data paths encoded as strings.
2·  It examines a simple encoding of the raw paths in a semistructured document, and discuss how to answer
complex path queries over data with irregular structure using raw paths.
3· It represents refined paths, a method for aggressively optimizing frequently occurring and important access
patterns. Refined paths support answering complicated queries using a single index lookup.
4·  Reports the results of a performance study which shows that a semistructured index based on the Index
Fabric can be an order of magnitude faster than traditional indexing schemes.

Advertisements

, , , , , , , ,

  1. Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: