Posts Tagged metadata
The Semantic Web – A Web of Metadata
Posted by protogenist in Application Development on September 12, 2012
Here we have introduced and discussed some prototypical scenarios that are on a “wish-list” for
next-generation tourism information systems. The Semantic Web that is currently investigated by
different research communities is a suitable candidate to support the development of next
-generation tourism information systems. The Semantic Web is based on machine readable and
processable metadata.
The Syntax Layer: The interchange of data represented in the Semantic Web must
be facilitated through a concrete serialization syntax. XML is an obvious choice frequently
used by the upper layers. However, it is important to mention that the Semantic
Web is not tied to a particular syntax. Within the syntax layer Unicode is used,
that provides a unique number for every character, no matter what the platform, what
the program, or what the language is. Beside Unicode, the usage of so-called URI
(unified resource identifiers) is essential.
The RDF(S) Layer: The Semantic Web concept is to do for data what HTML did for textual
information systems: to provide sufficient flexibility to be able to represent all
databases, and logic rules to link them together to great added value. The first steps in
this direction were taken by the World-Wide Web Consortium (W3C) in defining Resource
Description Framework (RDF), a simple language for expressing relationships in triples
where any of the triple can be a first class web object. The RDF-Schema Specification,
which became a W3C candidate recommendation in March 2000, is an RDF application that
introduces an object-oriented, extensible type system to RDF. RDF-Schema is a minimalist
model, including primitives for representing classes, properties, subproperty, subclasses,
domain & range restrictions and means for representing comments & labels.
Figure 1: Semantic Web Representation Layers
The Ontology Layer: This layer includes more complex representation primitives,
such as transitive properties, cardinalities, etc.. We refer the interested reader to the
recent research initiatives OIL3 and DAML+OIL4 that are built on top of RDF(S).
E.g. OIL unifies the epistemologically rich modeling primitives of frames, the formal
semantics and efficient reasoning support of description logics and is mapped to the
standard Web metadata language proposals.
The Logical Layer: The logic layer consists of rules that enable inferences, e.g. to
choose courses of action and answer questions. The proof layer is required to provide
explanations about the answers given by automated agents that consume the provided
information. Naturally, you might want to check the results deduced by your agent,
this requires the translation of its internal reasoning mechanisms into an unifying
proof representation language.
The Proof & Trust Layer: Proof and trust mechanisms are still to be developed. At
this stage in the development of the Semantic Web, though, this problem is not tackled.
Most applications construction of a proof is done according to some fairly constrained
rules, and all that the other party has to do is validate a general proof.
Logical Layer, metadata, OIL, Ontology Layer, Proof & Trust Layer, RDF, RDF(S) Layer, Semantic Web, Syntax Layer, Web of Metadata
Replica Creation and Replica Selection in Data Grid Service
Posted by protogenist in Application Development on April 25, 2012
Replica selection is interesting because it does not build on top of the core services, but rather
relies on the functions provided by the replica management component described in the preceding
section. Replica selection is the process of choosing a replica that will provide an application with data
access characteristics that optimize a desired performance criterion, such as absolute performance (i.e.
speed), cost, or security. The selected le instance may be local or accessed remotely. Alternatively
the selection process may initiate the creation of a new replica whose performance will be superior to
the existing ones.
Where replicas are to be selected based on access time, Grid information services can provide
information about network performance, and perhaps the ability to reserve network bandwidth, while
the metadata repository can provide information about the size of the file. Based on this, the selector
can rank all of the existing replicas to determine which one will yield the fastest data access time.
Alternatively, the selector can consult the same information sources to determine whether there is a
storage system that would result in better performance if a replica was created on it.
A more general selection service may consider access to subsets of a file instance. Scientic exper-
iments often produce large les containing data for many variables, time steps, or events, and some
application processing may require only a subset of this data. In this case, the selection function may
provide an application with a file instance that contains only the needed subset of the data found
in the original file instance. This can obviously reduce the amount of data that must be accessed or
moved.
This type of replica management has been implemented in other data-management systems. For
example, STACS is often capable of satisfying requests from High Energy Physics applications by
extracting a subset of data from a file instance. It does this using a complex indexing scheme that
represents application metadata for the events contained within the file. Other mechanisms for provid-
ing similar function may be built on application metadata obtainable from self-describing file formats
such as NetCDF or HDF.
Providing this capability requires the ability to invoke ltering or extraction programs that un-
derstand the structure of the file and produce the required subset of data. This subset becomes a
file instance with its own metadata and physical characteristics, which are provided to the replica
manager. Replication policies determine whether this subset is recognized as a new logical file (with
an entry in the metadata repository and a file instance recorded in the replica catalog), or whether
the file should be known only locally, to the selection manager.
Data selection with subsetting may exploit Grid-enabled servers, whose capabilities involve com-
mon operations such as reformatting data, extracting a subset, converting data for storage in a different
type of system, or transferring data directly to another storage system in the Grid. The utility of this
approach has been demonstrated as part of the Active Data Repository. The subsetting function
could also exploit the more general capabilities of a computational Grid such as that provided by
Globus. This oers the ability to support arbitrary extraction and processing operations on files as
part of a data management activity.
Active Data Repository, cost, Data Grid, events, file instance, HDF, metadata, NetCDF, network bandwidth, Replica Creation, Replica Selection, Security, speed, STACS, storage system, time steps, variables
Mapping Legacy Relational Data into RDF for SPARQL Access
Posted by protogenist in Technology Research on April 2, 2012
RDF and ontologies form the remaining piece of the enterprise data integration puzzle.
Many disparate legacy systems may be projected onto a common ontology using different
rules, providing instant content for the semantic web. One example of this is OpenLink’s
ongoing project of mapping popular Web 2.0 applications such as WordPress, Mediawiki,
PHP BB and others onto SIOC through Virtuoso’s RDF Views system.
The problem domain is well recognized, with work by D2RQ, SPASQL, DBLP among others.
Virtuoso differs from these primarily in that it combines the mapping with native
triple storage and may offer better distributed SQL query optimization through its
long history as a SQL federated database.
In Virtuoso, an RDF mapping schema consists of declarations of one or more quad storages.
The default quad storage declares that the system table RDF_QUAD consists of four
columns (G, S, P and O) that contain fields of stored triples, using special formats that are
suitable for arbitrary RDF nodes and literals. The storage can be extended as follows:
An IRI class defines that an SQL value or a tuple of SQL values can be converted into
an IRI in a certain way, e.g., an IRI of a user account can be built from the user ID, a
permalink of a blog post consists of host name, user name and post ID etc. A conversion
of this sort may be declared as bijection so an IRI can be parsed into original SQL values.
The compiler knows that an join on two IRIs calculated by same IRI class can be replaced
with join on raw SQL values that can efficiently use native indexes of relational tables. It
is also possible to declare one IRI class A as subClassOf other class B so the optimizer
may simplify joins between values made by A and B if A is bijection.
Most of IRI classes are defined by format strings that is similar to one used in standard
C sprintf function. Complex transformations may be specified by user-defined functions.
In any case the definition may optionally provide a list of sprintf-style formats such that
that any IRI made by the IRI class always match one of these formats. SPARQL optimizer
pays attention to formats of created IRIs to eliminate joins between IRIs created by totally
disjoint IRI classes. For two given sprintf format strings SPARQL optimizer can find a
common subformat of these two or try to prove that no one IRI may match both formats.
prefix : <http://www.openlinksw.com/schemas/oplsioc#>
create iri class :user-iri “http://myhost/sys/users/%s” (
in login_name varchar not null ) .
create iri class :blog-home “http://myhost/%s/home” (
in blog_home varchar not null ) .
create iri class :permalink “http://myhost/%s/%d” (
in blog_home varchar not null,
in post_id integer not null ) .
make :user_iri subclass of :grantee_iri .
make :group_iri subclass of :grantee_iri .
IRI classes describe how to format SQL values but do not specify the origin of those values.
This part of mapping declaration starts from a set of table aliases, somehow similar to
FROM and WHERE clauses of an SQL SELECT statement. It lists some relational tables,
assigns distinct aliases to them and provides logical conditions to join tables and to apply
restrictions on table rows. When a SPARQL query should select relational data using some
table aliases, the final SQL statement contains related table names and all conditions that
refer to used aliases and does not refer to unused ones.
from SYS_USERS as user from SYS_BLOGS as blog
where (ˆ{blog.}ˆ.OWNER_ID = ˆ{user.}ˆ.U_ID)
A quad map value describes how to compose one of four fields of an RDF quad. It may
be an RDF literal constant, an IRI constant or an IRI class with a list of columns of table
aliases where SQL values come from. A special case of a value class is the identity class,
which is simply marked by table alias and a column name.
Four quad map values (for G, S, P and O) form quad map pattern that specify how the
column values of table aliases are combined into an RDF quad. The quad map pattern can
also specify restrictions on column values that can be mapped. E.g., the following pattern
will map a join of SYS_USERS and SYS_BLOGS into quads with :homepage predicate.
graph <http://myhost/users>
subject :user-iri (user.U_ID)
predicate :homepage
object :blog-home (blog.HOMEPAGE)
where (not ˆ{user.}ˆ.U_ACCOUNT_DISABLED) .
Quad map patterns may be organized into trees. A quad map pattern may act as a root of a
subtree if it specifies only some quad map values but not all four; other patterns of subtree
specify the rest. A typical use case is a root pattern that specifies only the graph value
whereas every subordinate pattern specifies S, P and O and inherits G from root, as below:
graph <http://myhost/users> option (exclusive) {
:user-iri (user.U_ID)
rdf:type foaf:Person ;
foaf:name user.U_FULL_NAME ;
foaf:mbox user.U_E_MAIL ;
foaf:homepage :blog-home (blog.HOMEPAGE) . }
This grouping is not only a syntax sugar. In this example, exclusive option of the root
pattern permits the SPARQL optimizer to assume that the RDF graph contains only triples
mapped by four subordinates.
A tree of a quad map pattern and all its subordinates is called “RDF view” if the “root”
pattern of the tree is not a subordinate of any other quad map pattern. Quad map
patterns can be named; these names are used to alter mapping rules without destroying
and re-creating the whole mapping schema. The top-level items of the data mapping
metadata are quad storages. A quad storage is a named list of RDF views. A SPARQL query
will be executed using only quad patterns of views of the specified quad storage.
Declarations of IRI classes, value classes and quad patterns are shared between all quad
storages of an RDF mapping schema but any quad storage contains only a subset of all
available quad patterns. Two quad storages are always defined: a default that is used if
no storage is specified in the SPARQL query and a storage that refers to single table of
physical quads. The RDF mapping schema is stored as triples in a dedicated graph in the
RDF_QUAD table so it can be queried via SPARQL or exported for debug/backup purposes.
D2RQ, database, DBLP, IRI, Mapping, metadata, pattern, quad storages, query optimization, RDF, Relational Data, SIOC, SPASQL, sprintf function, SQL, triple storage, user-defined function, Virtuoso
Categories
Subscribe RSS
Protogenist on Twitter
- blog.editeon.com/strategies-for… fb.me/Y3YPHsiF 1 day ago
- What are the Chicago manual of style & Turabian? shar.es/xXF0j via @sharethis 5 days ago
- What are the Chicago manual of style & Turabian? shar.es/xXQl6 via @sharethis 5 days ago
- blog.editeon.com/genre-knowledg… fb.me/Js2cIXtN 2 weeks ago
- Writing an Empirical Paper in APA Style blog.editeon.com/writing-an-emp… 1 month ago
