RDF and ontologies form the remaining piece of the enterprise data integration puzzle.
Many disparate legacy systems may be projected onto a common ontology using different
rules, providing instant content for the semantic web. One example of this is OpenLink’s
ongoing project of mapping popular Web 2.0 applications such as WordPress, Mediawiki,
PHP BB and others onto SIOC through Virtuoso’s RDF Views system.
The problem domain is well recognized, with work by D2RQ, SPASQL, DBLP among others.
Virtuoso differs from these primarily in that it combines the mapping with native
triple storage and may offer better distributed SQL query optimization through its
long history as a SQL federated database.
In Virtuoso, an RDF mapping schema consists of declarations of one or more quad storages.
The default quad storage declares that the system table RDF_QUAD consists of four
columns (G, S, P and O) that contain fields of stored triples, using special formats that are
suitable for arbitrary RDF nodes and literals. The storage can be extended as follows:
An IRI class defines that an SQL value or a tuple of SQL values can be converted into
an IRI in a certain way, e.g., an IRI of a user account can be built from the user ID, a
permalink of a blog post consists of host name, user name and post ID etc. A conversion
of this sort may be declared as bijection so an IRI can be parsed into original SQL values.
The compiler knows that an join on two IRIs calculated by same IRI class can be replaced
with join on raw SQL values that can efficiently use native indexes of relational tables. It
is also possible to declare one IRI class A as subClassOf other class B so the optimizer
may simplify joins between values made by A and B if A is bijection.
Most of IRI classes are defined by format strings that is similar to one used in standard
C sprintf function. Complex transformations may be specified by user-defined functions.
In any case the definition may optionally provide a list of sprintf-style formats such that
that any IRI made by the IRI class always match one of these formats. SPARQL optimizer
pays attention to formats of created IRIs to eliminate joins between IRIs created by totally
disjoint IRI classes. For two given sprintf format strings SPARQL optimizer can find a
common subformat of these two or try to prove that no one IRI may match both formats.
prefix : <http://www.openlinksw.com/schemas/oplsioc#>
create iri class :user-iri “http://myhost/sys/users/%s” (
in login_name varchar not null ) .
create iri class :blog-home “http://myhost/%s/home” (
in blog_home varchar not null ) .
create iri class :permalink “http://myhost/%s/%d” (
in blog_home varchar not null,
in post_id integer not null ) .
make :user_iri subclass of :grantee_iri .
make :group_iri subclass of :grantee_iri .
IRI classes describe how to format SQL values but do not specify the origin of those values.
This part of mapping declaration starts from a set of table aliases, somehow similar to
FROM and WHERE clauses of an SQL SELECT statement. It lists some relational tables,
assigns distinct aliases to them and provides logical conditions to join tables and to apply
restrictions on table rows. When a SPARQL query should select relational data using some
table aliases, the final SQL statement contains related table names and all conditions that
refer to used aliases and does not refer to unused ones.
from SYS_USERS as user from SYS_BLOGS as blog
where (ˆ{blog.}ˆ.OWNER_ID = ˆ{user.}ˆ.U_ID)
A quad map value describes how to compose one of four fields of an RDF quad. It may
be an RDF literal constant, an IRI constant or an IRI class with a list of columns of table
aliases where SQL values come from. A special case of a value class is the identity class,
which is simply marked by table alias and a column name.
Four quad map values (for G, S, P and O) form quad map pattern that specify how the
column values of table aliases are combined into an RDF quad. The quad map pattern can
also specify restrictions on column values that can be mapped. E.g., the following pattern
will map a join of SYS_USERS and SYS_BLOGS into quads with :homepage predicate.
graph <http://myhost/users>
subject :user-iri (user.U_ID)
predicate :homepage
object :blog-home (blog.HOMEPAGE)
where (not ˆ{user.}ˆ.U_ACCOUNT_DISABLED) .
Quad map patterns may be organized into trees. A quad map pattern may act as a root of a
subtree if it specifies only some quad map values but not all four; other patterns of subtree
specify the rest. A typical use case is a root pattern that specifies only the graph value
whereas every subordinate pattern specifies S, P and O and inherits G from root, as below:
graph <http://myhost/users> option (exclusive) {
:user-iri (user.U_ID)
rdf:type foaf:Person ;
foaf:name user.U_FULL_NAME ;
foaf:mbox user.U_E_MAIL ;
foaf:homepage :blog-home (blog.HOMEPAGE) . }
This grouping is not only a syntax sugar. In this example, exclusive option of the root
pattern permits the SPARQL optimizer to assume that the RDF graph contains only triples
mapped by four subordinates.
A tree of a quad map pattern and all its subordinates is called “RDF view” if the “root”
pattern of the tree is not a subordinate of any other quad map pattern. Quad map
patterns can be named; these names are used to alter mapping rules without destroying
and re-creating the whole mapping schema. The top-level items of the data mapping
metadata are quad storages. A quad storage is a named list of RDF views. A SPARQL query
will be executed using only quad patterns of views of the specified quad storage.
Declarations of IRI classes, value classes and quad patterns are shared between all quad
storages of an RDF mapping schema but any quad storage contains only a subset of all
available quad patterns. Two quad storages are always defined: a default that is used if
no storage is specified in the SPARQL query and a storage that refers to single table of
physical quads. The RDF mapping schema is stored as triples in a dedicated graph in the
RDF_QUAD table so it can be queried via SPARQL or exported for debug/backup purposes.
Like this:
Like Loading...