Archive for June, 2012

Non-relational Parallel Database Machines

While open research issues remain in the area of parallel database machines for relational
database systems, building a highly parallel database machine for an object-oriented database
system presents a number of new challenges. One of the first issues to resolve is how declustering
should be handled. For example, should one decluster all sets (such as set-valued attributes of a
complex object) or just top-level sets? Another question is how should inter-object references be
handled. In a relational database machine, such references are handled by doing a join between the
two relations of interest, but in an object-oriented DBMS references are generally handled via
pointers. In particular, a tension exists between declustering a set in order to parallelize scan
operations on that set and clustering an object and the objects it references in order to reduce the
number of disk accesses necessary to access the components of a complex object. Since clustering
in a standard object-oriented database system remains an open research issue, mixing in
declustering makes the problem even more challenging.

Another open area is parallel query processing in an OODBMS. Most OODBMS provide a
relational-like query language based on an extension to relational algebra. While it is possible to
parallelize these operators, how should class-specific methods be handled? If the method operates
on a single object it is certainly not worthwhile parallelizing it However, if the method operates on
a set of values or objects that are declustered, then it almost must be parallelized if one is going to
avoid moving all the data referenced to a single processor for execution. Since it is, at this point in
time, impossible to parallelize arbitrary method code, one possible solution might be to insist that
if a method is to be parallelized that it be constructed using the primitives from the underlying
algebra, perhaps embedded in a normal programming language.

Advertisements

, , , , , , , , , , , , ,

Leave a comment

Responsibility, Resolution, and Residue of Computer Ethics

There are many levels of relativity in value judgments. Some of our values
are relative to our being human. If we were angels or creatures from another
dimension, our core values might be different. And then, of course, different
cultures articulate the core human values differently. And different individuals
within a culture may differ in their assessments of values. Indeed, some
values of one individual may change over time. I have been arguing that
such relativity is compatible with rational discussion of ethical issues and
resolution of at least some ethical disputes. We are, after all, human beings,
not angels or creatures from another dimension. We share core values. This
provides us with a set of standards with which to assess policies even in
situations in which no previous policies exist and with which to assess other
value frameworks when disagreements occur.

Ethical responsibility begins by taking the ethical point of view. We must
respect others and their core values. If we can avoid policies that result in
significant harm to others, that would be a good beginning toward responsible
ethical conduct. Some policies are so obviously harmful that they are
readily rejected by our core-value standards. Selling computer software
which is known to malfunction in a way which is likely to result in death is
an obvious example. Other policies easily meet our standards. Building computer
interfaces which facilitate use by the disabled is a clear example. And
of course, some policies for managing computer technology will be disputed.
However, as I have been emphasizing, some of the ethical policies under
dispute may be subject to further rational discussion and resolution. The
major resolution technique, which I have been emphasizing, is the empirical
investigation of the actual consequences of proposed policies.For instance,
some people might propose a limitation on free speech on the Internet on
the grounds that such freedom would lead to an unstable society or to severe
psychological damage of some citizens. Advocates of free speech might
appeal to its usefulness in transmitting knowledge and its effectiveness
in calling attention to the flaws of government. To some extent these are
empirical claims that can be confirmed or disconfirmed, which in turn
may suggest compromises and modifications of policies.

Another resolution technique is to assume an impartial position when
evaluating policies. Imagine yourself as an outsider not being benefited or
harmed by a policy. Is it a fair policy? Is it a policy which you would advocate
if you were suddenly placed in a position in which you were affected by the
policy? It may be tempting to be the seller of defective software, but nobody
wants to be a buyer of defective software. And finally, analogies are sometimes
useful in resolving disagreements. If a computing professional would
not approve of her stockbroker’s with holding information from her about
the volatility of stock she is considering buying, it would seem by analogy
she should share information with a client about the instability of a computer
program which the client is considering purchasing.

All of these techniques for resolution can help form a consensus about
acceptable policies. But when the resolution techniques have gone as far as
they can, some residue of disagreement may remain. Even in these situations
alternative policies may be available which all parties can accept. But,
a residue of ethical difference is not to be feared. Disputes occur in every
human endeavor and yet progress is made. Computer ethics is no different
in this regard. The chief threat to computer ethics is not the possibility that
a residue of disagreements about which policies are best will remain after
debates on the issues are completed, but a failure to debate the ethical issues
of computing technology at all. If we naively regard the issues of computer
ethics as routine or, even worse, as unsolvable, then we are in the greatest
danger of being harmed by computer technology. Responsibility requires
us to adopt the ethical point of view and to engage in ongoing conceptual
analysis and policy formulation and justification with regard to this ever
evolving technology. Because the computer revolution now engulfs the
entire world, it is crucial that the issues of computer ethics be addressed on
a global level. The global village needs to conduct a global conversation
about the social and ethical impact of computing and what should be
done about it. Fortunately, computing may help us to conduct exactly that
conversation.

, , , , , , , ,

Leave a comment

C14N Denial of Service

Attack surface: Canonicalization

Attack impact: Denial of service

Description: C14N can be an expensive operation, requiring complex processing (Boyer ‟01), including entity expansion and normalization of whitespace, namespace declarations, and coalescing of adjacent text and CDATA nodes. This requires building a DOM and performing memory- and processor-intensive operations.

Exploit scenario: Attacker replaces the SignedInfo or XML content identified by a Reference with a very large set of XML data containing many namespace declarations, redundant adjacent text nodes, etc., leading to a denial of service condition.

Mitigation: Limit the total size of XML submitted for canonicalization.

Applies to XML Encryption? No

, , , , , , , , , , , ,

Leave a comment

Encryption and Digital Signature

Digital signature

The process of digitally signing starts by taking a mathematical summary (called
a hash code) of the message. This hash code is a uniquely identifying digital
fingerprint of the message. If even a single bit of the message changes, the hash
code will dramatically change. The next step in creating a digital signature is to
sign the hash code with your private key. This signed hash code is then
appended to the message.

How is this a signature? Well, the recipient of your message can verify the hash
code sent by you, using your public key. At the same time, a new hash code can
be created from the received message and compared with the original signed
hash code. If the hash codes match, then the recipient has verified that the
message has not been altered. The recipient also knows that only you could
have sent the message because only you have the private key that signed the original
hash code.

Confidentiality and encryption

Once the electronic message is digitally signed, it can be encrypted using a highspeed
mathematical transformation with a key that will be used later to decrypt
the document. This is often referred to as a symmetric key system because the
same key is used at both ends of the process. As the message is sent over the
network, it is unreadable without the key. The next challenge is to securely
deliver the symmetric key to the bank.

Public-key cryptography for delivering symmetric keys

Public-key encryption is used to solve the problem of delivering the symmetric
encryption key to the bank in a secure manner. To do so, you would encrypt
the symmetric key using the receiver’s (Here Bank) public key. Since only the
receiver (Bank) has the corresponding private key, only the receiver will be able
to recover the symmetric key and decrypt the message.

Why use this combination of public-key and symmetric cryptography?

The reason is simple. Public-key cryptography is relatively slow and is only
suitable for encrypting small amounts of information – such as symmetric keys.
Symmetric cryptography is much faster and is suitable for encrypting large
amounts of information.

 

, , , , , , , , , ,

Leave a comment

The Super Database Computer

The Super Database Computer (SOC) project at the University of Tokyo presents an
interesting contrast to other database system projects. This system
takes a combined hardware and software approach to the performance problem. The basic unit,
called a processing module (PM), consists of one or more processors with a shared memory.
These processors are augmented by a special purpose sorting engine that sorts at high speed
(3MB/s at present), and by a disk subsystem. Clusters of processing modules are connected via an
omega network which not only provides non-blocking NxN interconnect, but also
provides some dynamic routing to support data distribution during hash joins. The
SOC is designed to scale to thousands of PMs, and so considerable attention is paid to the problem
of data skew. Data is declustered among the PMs by hashing.

The SOC software includes a unique operating system, and a relational database query
executor. Most publish work so far has been on query execution and on efficient algorithms to
execute the relational operations, rather than on query planning or data declustering. The SOC is a
shared-nothing design with a software dataflow architecture. This is consistent with our assertion
that current parallel database machines systems use conventional hardware. But the special purpose
design of the omega network and of the hardware sorter clearly contradict the thesis that
special-purpose hardware is not a good investment of development resources. Time will tell
whether these special-purpose components offer better price performance or peak performance than
shared-nothing designs built of conventional hardware.

, , , , , , , , , ,

Leave a comment

Declustering

Declustering a relation involves distributing the tuples of a relation among two or more disk
drives according to some distribution criteria such as applying a hash function to the key attribute
of each tuple. Declustering has its origins in the concept of horizontal partitioning initially
developed as a distribution mechanism for distributed DBMS [RIES78] (see Figure 1). One of the
key reasons for using declustering in a parallel database systems is to enable the DBMS software to
exploit the I/O bandwidth reading and writing multiple disks in parallel. By declustering the tuples
of a relation the task of parallelizing a scan operator becomes trivial. All that is required is to start a
copy of the operator on each processor or disk containing relevant tuples, and to merge their
outputs at the destination. For operations requiring a sequential scan of an entire relation, this
approach can provide the same I/O bandwidth as a RAID-style system [SALE84], [PATI88]
without needing any specialized hardware.

While tuples can simply be declustered in a round-robin fashion, more interesting
alternatives exist. One is to apply a hashing function to the key attribute of each tuple to distribute
the tuples among the disks. This distribution mechanism allows exact match selection operations
on the partitioning attribute to be directed to a single disk, avoiding the overhead of starting such
queries on multiple disks. On the other hand, range queries on the partitioning attribute, must be
sent to all disks over which a relation has been declustered. A hash declustering mechanism is
provided by Arbre, Bubba, Gamma, and Teradata.

Figure 1: The three basic declustering schemes: range declustering maps contiguous fragments of a
table to various disks. Round-Robin declustering maps the i’th record to disk i mod n. Hashed
declustering, maps each record to a disk location based on some hash function. Each of these schemes
spreads data among a collection of disks, allowing parallel disk access and parallel processing.

An alternative declustering strategy is to associate a distinct range of partitioning attribute
values with each disk by, for example, dividing the range of possible values into N units, one for
each of the N processors in the system range partitioning. The advantage of range declustering is
that it can isolate the execution of both range and exact match-selection operations to the minimal
number of processors. Another advantage of range partitioning is that it can be used to deal with
non-uniformly distributed partitioning attribute values. A range-partitioning mechanism is provided
by Arbre, Bubba, Gamma, and Tandem.

While declustering is a simple concept that is easy to implement, it raises a number of new
physical database design issues. In addition to selecting a declustering strategy for each relation,
the number of disks over which a relation should be declustered must also be decided. While
Gamma declusters all relations across all disk drives (primarily to simplify the implementation), the
Bubba, Tandem, and Teradata systems allow a subset of the disks to be used. In general,
increasing the degree of declustering reduces the response time for an individual query and
(generally) increases the overall throughput of the system. For sequential scan queries, the
response time decreases because more processors and disks are used to execute the query. For
indexed selections on the partitioning attribute, the response time improves because fewer tuples
are stored at each node and hence the size of the index that must be searched decreases. However,
there is a point beyond which further declustering actually increases the response time of a query.
This point occurs when the cost of starting a query on a node becomes a significant fraction of the
actual execution time [COPE88, DEWI88, GHAN90a]. In general, full declustering is not always
a good idea, especially in a very large configurations. Bubba [COPE88] refines the concept of
range-partitioning by considering the heat of its tuples when declustering a relation; the goal
being to balance the frequency with which each disk is accessed rather than the actual number of
tuples on each disk. In [COPE88] the effect of the degree of the declustering on the multiuser
throughput of Bubba is studied. In [GHAN90a] the impact of the alternative partitioning strategies
on the multiuser throughput of Gamma is evaluated.

, , , , , , , , , , , , , , , ,

Leave a comment

Two-Party Query Computation Model

The basic two-party query computation model describes the types of privacy preserving protocols.
Figure 1 illustrates the basic two-party query computation model comprising of four different
entities: the randomizer, the computing engine, the query front end engine and the individual
databases. The two primary query computation entities in the system are the randomizer and the
computing engine. The query front end engine which receives queries from different users
forwards each query to the randomizer and an encoded version of the query (which contains the
type of the query) to the computing engine which in turn coordinate with the individual
databases to compute the query result. Our model assumes that all entities in the system
require strong privacy guarantees but act in an honest but curious manner. In other words, every
participating entity acts in an “honest” fashion and follows the protocol specification, but is
“curious” to infer the entries of other participating databases.

Figure 1 : The system model

Given this model, the basic steps in our query computation process are illustrated in Figure 1.
The randomizer upon receiving a query, forwards the query to each individual database along with
a set of randomization parameters. The randomizer also provides an essential set of the
derandomization parameters to the query front end (which the querier may use to encode the query
in case the selection predicate is based on a certain computation of the distributed data, and
should be protected). Every database independently computes the local query response for the
query and then obfuscates the query response using the randomization parameters. According to the
query type, the computing engine performs the query computation by “combining” the individual
obfuscated responses from the individual databases and produces an obfuscated response to the
query. The query front end then makes use of the derandomization parameters to deobfuscate the
query response from the computing engine.

We make a simplifying assumption that all the databases share the same schema which is also known
to the querier; in practice, even if the individual schemas differ, the query processing engine at
the individual databases can convert their individual results to a common schema.

, , , , , , , , , , , ,

Leave a comment

%d bloggers like this: