The amount of information available on-line is proliferating at a tremendous rate. At one extreme,
traditional database systems are managing large amounts of structured, well-understood data that
can be queried via declarative languages such as SQL. At the other extreme, millions of unstructured
Web pages are being collected and indexed by search engines for keyword-based search. Recently,
XML— the eXtensible Markup Language— has emerged as a simple, practical way to model and
exchange semi structured data across the Internet, without the rigid constraints of traditional database
This describes work towards unifying and integrating query techniques for traditional
databases, search engines, and XML. First, we describe our contributions to the Lore DBMS for
managing semi structured data, focusing on ways to enhance system usability for effective querying
and searching. Next, we discuss algorithms and indexing techniques that enable effective keyword based
search over traditional and semi structured databases. We then describe how we have migrated
and enhanced our research on semi structured data to support the subtle but important nuances of
XML. Finally, we describe a new platform that enables efficient combined querying over structured
traditional databases and existing Web search engines.
Keyword Search Over Semi structured And Structured Databases
Keyword-based search is very useful for unstructured documents, and often is the only way to query
such data. Keyword search also can be very useful over more structured data, since it is inherently
simple for users to master and often is sufficient for the task at hand. However, some IR concepts and
algorithms must be reconsidered in a database setting. In particular, proximity search benefits from
a new approach in a database setting. Traditionally, proximity search in IR systems is implemented
using the “ near” operator. If we search our document collection for “Harrison Ford” near “Carrie
Fisher”, we are looking for documents where those two names appear “ close” to each other, where
closeness is measured by textual proximity. In this sense, proximity search is a relatively simple,
“ intra-object” operation: we measure proximity along a single dimension (text) in each document.
Now, suppose that we have fully migrated our movie document collection to XML. Each movie
might begin with a MOVIE tag, followed by nested tags for that movie’s actors, producers, etc.
In this setting, we want to account for “ structural proximity” in the database, while textual proximity
may not be relevant. For example, if Harrison Ford and Carrie Fisher both star in the same movie,
then they will both be sub elements of a specific MOVIE element. In the textual representation,
however, there may be many other actors lexically in between these actors. Similarly, we may find
that the last actor listed for some movie X is textually close to the first actor listed for an adjacent
movie Y— but this doesn’t mean that the two actors are related in any way. Thus, we need to extend
the notion of proximity search to handle the structure inherent in a semi structured database.
As per that, algorithms and techniques for performing proximity search over
a graph-structured (semi structured) database are applicable to a traditional relational or object oriented
database as well. We can (logically) translate a relational database into a graph based
on the schema and on primary/foreign key relationships. We can then use our proximity search techniques to measure the distance between database elements based on the graph representation. Viewing an object-oriented database as a graph is of course even simpler. By combining proximity search with traditional indexing techniques for identifying tables or attribute values that contain given keywords, we can provide keyword-based search (and browsing) for traditional databases.