Queries navigate semistructured data via path expressions, and can be accelerated using an index. The solution encodes paths as strings, and inserts those strings into a special index that is highly optimized for long and complex keys. It describes the Index Fabric, an indexing structure that provides the efficiency and flexibility are needed. We discuss how "raw paths" are used to optimize ad hoc queries over semistructured data, and how "refined paths" optimize specific access paths. Although we can use knowledge about the queries and structure of the data to create refined paths, no such knowledge is needed for raw paths. A performance study shows that our techniques, when implemented on top of a commercial relational database system, outperform the more traditional approach of using the commercial system’s indexing mechanisms to query the XML. Semistructured data is often represented as a graph, with a set of data elements connected by labeled relationships, and this self-describing relationship is no need for a priori knowledge of the schema of the data, since the paths we encode are extracted from the data itself. Second, our approach has high performance even when the structure of the data is changing, variable or irregular. Third, the same index can accelerate queries along many different, complex access paths. This is because our indexing mechanism scales gracefully with the number of keys inserted, and is not affected by long or complex keys (representing long or complex paths). Our indexing mechanism, called the Index Fabric, utilizes the aggressive key compression inherent in a Patricia trie to index a large number of strings in a compact and efficient structure. Moreover, the Index Fabric is inherently balanced, so that all accesses to the index require the same small number of I/Os. As a result, we can index a large, complex, irregularly-structured, disk-resident semistructured data set while providing efficient navigation over paths in the data. We manage two types of paths for semistructured data. First, we can index paths that exist in the raw data (called raw paths) to accelerate any ad hoc query. We can also reorganize portions of the data, to create refined paths, in order to better optimize particular queries. Both kinds of paths are encoded as strings and inserted into the Index Fabric. Because the index grows so slowly as we add new keys, we can create many refined paths and thus optimize many access patterns, even complex patterns that traditional techniques cannot easily handle. As a result, we can answer general queries efficiently using raw paths, even as we further optimize certain queries using refined paths. Maintaining all of the paths in the same index structure reduces the resource contention that occurs with multiple indexes, and provides a uniform mechanism that can be tuned for different needs.
A popular syntax for semistructured data is XML, and in this paper we focus on using the Index Fabric
to index XML-encoded data. XML encodes information as data elements surrounded by tags, and tags can be
nested within other tags. This nesting structure can be viewed as a tree, and raw paths represent root-to-leaf
traversals of this tree. Refined paths represent traversing the tree in some other way (e.g. from sibling to sibling).
The index provides a flexible, uniform and efficient mechanism to access data, while
utilizing a stable storage manager to provide properties such as concurrency, fault tolerance, or security.
The Index Fabric as an index on top of a popular commercial relational DBMS. To
evaluate performance, we indexed an XML data set using both the Index Fabric and the DBMS’s native B-trees.
In the Index Fabric, we have constructed both refined and raw paths, while the relational index utilized an edge
mapping as well as a schema extracted by the STORED system. Both refined and raw paths are significantly
faster than the DBMS’s native indexing mechanism, sometimes by an order of magnitude or more. The
difference is particularly striking for data with irregular structure, or queries that must navigate multiple paths.
The structure of the Index Fabric and how it can be used to optimize searches over
semistructured databases. Specifically, we make the following contributions:
1. How to utilize the Index Fabric’s support for long and complex keys to index semi structured
data paths encoded as strings.
2· It examines a simple encoding of the raw paths in a semistructured document, and discuss how to answer
complex path queries over data with irregular structure using raw paths.
3· It represents refined paths, a method for aggressively optimizing frequently occurring and important access
patterns. Refined paths support answering complicated queries using a single index lookup.
4· Reports the results of a performance study which shows that a semistructured index based on the Index
Fabric can be an order of magnitude faster than traditional indexing schemes.