Archive for March, 2012

Attacks on Secure Embedded Systems

At the top level, attacks are classified into three main categories based on their functional objectives.

  • Privacy attacks: The objective of these attacks is to gain knowledge of sensitive information stored,
    communicated, or manipulated within an embedded system.
  • Integrity attacks: These attacks attempt to change data or code associated with an embedded system.
  • Availability attacks: These attacks disrupt the normal functioning of the system by mis-appropriating
    system resources so that they are unavailable for normal operation.

A second level of classification of attacks on embedded systems is based on the agents or means used to
launch the attacks. These agents are typically grouped into three main categories as shown in Figure 1:

Figure 1: Taxonomy of attacks on embedded systems

  1. Software attacks : which refer to attacks launched through software agents such as viruses,
    trojan horses, worms, etc.
  2. Physical or Invasive attacks : which refer to attacks that require physical intrusion into the system
    at some level (chip, board, or system level).
  3. Side-channel attacks : which refer to attacks that are based on observing properties of the system
    while it performs cryptographic operations, e.g., execution time, power consumption, or behavior in the
    presence of faults.

The agents used to launch attacks may either be passive in the sense that they do not interfere in any
manner with system execution (e.g., merely probe or observe certain properties), or may actively
interfere with the target system’s operation. Integrity and availability attacks require interference
with the system in some manner, and hence can be launched only through active agents.

It bears mentioning that, although we have classified attacks into various categories for the sake of
understanding. In practice, attackers often use a combination of various techniques to achieve their
objectives. For example, physical attacks may be used as a pre-cursor to side-channel attacks
(removing a chip’s packaging before observing the values on global wires within the chip). Our
classification is also by no means exhaustive, nor is it intended to be — the ingenuity of attackers
who invariably come up with new schemes to break security is arguably the greatest challenge to
tamper-resistant design.

, , , , , , , , , , , , , , , ,

Leave a comment

Specialized Parallel Relational Operators

Some algorithms for relational operators are especially appropriate for parallel execution,
either because they minimize data flow, or because they better tolerate data and execution skew.
Improved algorithms have been found for most of the relational operators. The evolution of join
operator algorithms is sketched here as an example of these improved algorithms.

Recall that the join operator combines two relations, A and B, to produce a third relation
containing all tuple pairs from A and B with matching attribute values. The conventional way of
computing the join is to sort both A and B into new relations ordered by the join attribute. These
two intermediate relations are then compared in sorted order, and matching tuples are inserted in
the output stream. This algorithm is called sort-merge join.

Many optimizations of sort-merger join are possible, but since sort has execution cost
nlog(n), sort-merge join has an nlog(n) execution cost. Sort-merge join works well in a parallel
dataflow environment unless there is data skew. In case of data skew, some sort partitions may
be much larger than others. This in turn creates execution skew and limits speedup and scaleup.
These skew problems do not appear in centralized sort-merge joins.

Hash-join is an alternative to sort-merge join. It has linear execution cost rather than
nlog(n) execution cost, and it is more resistant to data skew. It is superior to sort-merge join
unless the input streams are already in sorted order. Hash join works as follows. Each of the
relations A and B are first hash partitioned on the join attribute. A hash partition of relation A is
hashed into memory. The corresponding partition of table relation B is scanned, and each tuple
is compared against the main-memory hash table for the A partition. If there is a match, the pair
of tuples are sent to the output stream. Each pair of hash partitions is compared in this way.

The hash join algorithm breaks a big join into many little joins. If the hash function is
good and if the data skew is not too bad, then there will be little variance in the hash bucket size.
In these cases hash-join is a linear-time join algorithm with linear speedup and scaleup. Many
optimizations of the parallel hash-join algorithm have been discovered over the last decade. In
pathological skew cases, when many or all tuples have the same attribute value, one bucket may
contain all the tuples. In these cases no algorithm is known to speedup or scaleup.

, , , , , , , , , , , , , , ,

Leave a comment

XML Full Text Indexes

Full-text search is a common operation in document and content-centric XML applications. DB2’s existing
text search capabilities have been extended to work with the new XML column type. Full-text indexes
with awareness of XML document structures can be defined on any native XML column. The documents in an
XML column can be fully indexed or partially indexed, e.g. if it is known in advance that only a
certain part of each document will be subject to full-text search, such as a “description” or “comment”
element. Correspondingly, text search expressions can be applied to specific paths in a document.

The following statement defines a text index which fully indexes the documents in the XML column deptdoc
in our table dept in the database personneldb:

create index myIndex for text on dept (deptdoc) format xml connect to personneldb

The following query exploits this index but restricts the search to a specific element. The query
retrieves all documents where the element ‘/dept/comment’ contains the word “Brazil”:

select deptdoc from dept where
contains (deptdoc,‘sections(”/dept/comment”) “Brazil” ‘) = 1

Text search in specific parts of the documents is a critical feature for many applications. Standard
text search features are also available, such as scoring and ranking of search results as well as
thesaurus-based synonym search.

For best performance of XML insert, update, and delete operations the text index is maintained
asynchronously, i.e. not within the context of a DML transaction. However an “update index” command
is available to force synchronization of the text index.

, , , , , , ,

Leave a comment

The Knowledge Acquisition Process in Data Mining

The Knowledge Base is a dynamic part of the system that can be supplemented and refreshed through The
Intelligent KB Editor. We should notice that there are two potential sources of  knowledge to be discovered for
the proposed system. These are the analysis of theory background that lies behind the feature extraction and
classification methods, and field experiments.

In the first case, knowledge is formulated by an expert in the area of the specific feature extraction methods
and classification schemes, and then represented as a set of rules by a knowledge engineer in the terms of a
knowledge representation language that is supported by the system. We argue that it is possible and
reasonable to categorise the facts and rules that are present in the Knowledge Base. Categorisation can be
done according to the way the knowledge has been obtained – has it been got from the analysis of
experimental results of from the domain theory, was it put automatically by the Intelligent KB Editor or by a
knowledge engineer (who could be a data miner as well). Another categorisation criterion is the level of
confidence of a rule. The expert can be sure in a certain fact or may just think or to hypothesize about another
fact. In a similar way, a rule that has been just generated from the analysis of results by experimenting on
artificially generated data sets but has been never verified on real-worlds data sets and a rule that has been
verified on a number of real-world problems. These two rules definitely should not have the same level of
confidence.

In addition to the “trust“ criteria due to the categorisation of the rules it is possible to adapt the system to a
concrete researcher needs and preferences by giving higher weights to the rules that actually are the ones of
the user.

And, in the second case, a data miner can discover knowledge during the analysis of results obtained from
the experiments as separate facts, trends and dependencies. In the same manner, discovered knowledge is
represented as a set of rules by a knowledge engineer using of the knowledge representation language.
Alternatively, the knowledge acquisition process can be automatic, i.e. the knowledge discovery process
would be accomplished without any interference with a human expert. This may happen using the possibility
of deriving new rules and updating the old ones based on the analysis of results obtained during the self-run
experimenting.

In both the last cases we have a problem of learning how the Intelligent KB Editor should try to build up a
classification or a regression model on meta-data resulted from experiments. In this context the input
parameters for a classification model are specific data set characteristics and a classification model’s outputs
that include accuracy, sensitivity, specificity, time complexity, etc. The combination of a feature extraction
method’s and a classification model’s names with their parameter values represents a class label. When
building a regression model – meta-data-set attributes are data set characteristics, the feature extraction
method’s and the classification model’s names, and one of the model output characteristics is the attribute
which value (continuous) has to be predicted.

The results obtained to the present stage of research show a high level of complexity in
dependencies between the data set characteristics and the best-suited scheme for the data mining process.
In order to further develop our understanding it is necessary to proceed the research with the following
iterations:

  • Generation of artificial data sets with known characteristics (simple, statistical and information theoretic
    measures);
  • Design of experiments on the generated artificial data sets;
  • Derivation of dependencies and definition of the criteria from the obtained results;
  • Development of a knowledge base defining a set of rules on the set of obtained criteria;
  • Proof of the constructed theory with a set of experiments on real-world data sets.

Thus, three basic research methods are used in the research: the theoretical approach, the constructive
approach, and the experimental approach. These approaches are closely related and are applied in parallel.
The theoretical backgrounds are exploited during the constructive work and the constructions are used for
experimentation. The results of constructive and experimental work are used to refine the theory.
An example of such a procedure can be presented as:

  1. Generation of artificial data sets with the number of attributes from 2 to 100, with the number of
    instances from 150 to 5000, with the number of classes from 2 to 10, with the average correlation
    between the attributes from 10% to 90%, with the average noisiness of attributes from 10% to 50%,
    with the percent of irrelevant attributes from the total number of attributes from 10% to 50%.
  2. Design of the experiments on generated artificial data sets and analysing accuracy and efficiency of
    the classification model built on different learning algorithms and using different feature extraction
    methods. Tuning of the input parameters for each combination is required.
  3. Analysis of the dependencies and trends between output accuracies and efficiencies, feature
    extraction methods and classifiers, their input parameters, and pre-defined data set characteristics.
  4. Definition of a set of rules that reflect found dependencies and trends.
  5. Execution of a number of experiments on UCI data sets using DSS for the best-suited feature
    extraction method and classifier selection.
  6. Addition of the invented rules that were successfully validated during the tests on the benchmark
    data sets to the knowledge base.

, , , , , , , , , , ,

1 Comment

Lore’s XML Data Model

In Lore’s new XML-based data model, an XML element is a pair where (eid, value) eid is a unique element identifier, and value is either an atomic text string or a complex value containing the following four components:

1. A string-valued tag corresponding to the XML tag for that element.

2. An ordered list of attribute-name/atomic-value pairs, where each attribute-name is a string and each atomic-value has an atomic type drawn from integer, real, string, etc., or ID, IDREF, or IDREFS.

3. An ordered list of crosslink subelements of the form (label, eid), where label is a string. Crosslink subelements are introduced via an attribute of type IDREF or IDREFS.

4. An ordered list of normal subelements of the form (label, eid), where label is a string. Normal subelements are
introduced via lexical nesting within an XML document.

An XML document ismapped easily into our datamodel. Note that we ignore comments and whitespace
between tagged elements. As a base case, text between tags is translated into an atomic text
element; we do the same thing for CDATA sections, used in XML to escape text that might otherwise
be interpreted as markup. Otherwise, a document element is translated into a complex data element such that:

1. The tag of the data element is the tag of the document element.

2. The list of attribute-name/atomic-value pairs in the data element is derived directly from the document
element’s attribute list.

3. For each attribute value iof type IDREF in the document element, or component iof an attribute value of
type IDREFS, there is one crosslink sub-element (label, eid) in the data element, where label is the
corresponding attribute name and eid identifies the unique data element whose ID attribute value matches .

4. The subelements of the document element appear, in order, as the normal subelements of the data element.
The label for each data subelement is the tag of that document subelement, or Text if the document subelement
is atomic.

Note that multiple XML documents can be loaded into a single database, and any system of cross-document
links (e.g., XLink or XPointer) can be used provided information that uniquely identifies elements is not lost.

Figure 1:  XML document and its graph

Once one or more XML documents are mapped into our data model it is convenient to visualize the data as a
directed, labeled, ordered graph. The nodes in the graph represent the data elements and the edges represent
the element-subelement relationship. Each node representing a complex data element contains a tag and an
ordered list of attribute-name/atomic value pairs; atomic data element nodes contain string values. There are
two different types of edges in the graph: (i) normal subelement edges, labeled with the tag of the
destination subelement; (ii) crosslink edges, labeled with the attribute name that introduced the crosslink.
Note that the graph representation is isomorphic to the data model, so they can be discussed interchangeably.

It is useful to view the XML data in one of two modes: semantic or literal. Semantic mode is used when the
user or application wishes to view the database as an interconnected graph. The graph representing the
semantic mode omits attributes of type IDREF and IDREFS, and the distinction between sub element and
crosslink edges is gone. Literal mode is available when the user wishes to view the database as an XML
document. IDREF and IDREFS attributes are visible as textual strings, while crosslink edges are invisible.
In literal mode, the database is always a tree.

Figure 1 shows a small sample XML document and the graph representation in our datamodel. Element
identifiers (eids) appear within nodes and are written as &1, &2, etc. Attribute-name/atomic-value pairs
are shown next to the associated nodes (surrounded by {}), with IDREF attributes in italics. Subelement
edges are solid and crosslink edges are dashed. The ordering of subelements is left-to-right. We have
not shown the tag associated with each element since it is straight forward to deduce for this simple
database. (For example, node &3 has the tag Member and not Advisor.) In semantic mode, the database in
Figure 1 does not include the (italicized) IDREF attributes. In literal mode, the (dashed) crosslinks are
not included. Note that there is some structural heterogeneity in the data even though the sample data
was kept purposefully small.

, , , , , , , , , , , ,

Leave a comment

Polygon Assisted Synthetic Images for Textured versus untextured surfaces

High-end graphics workstations (such as the Silicon Graphics RealityEngine) are typically optimized
for the display of textured surfaces, while low-end workstations (such as the Silicon Graphics Indy)
are typically optimized for the display of untextured surfaces. Given these capabilities, the most
obvious way to partition rendering between a high-end server and a low-end client is to omit surface
texture on the client. To demonstrate this, we consider a room composed of flat surfaces that exhibit
smooth shading and texture (see figure 2). The model contains 1131 polygons with a fixed color at each
vertex. This color was calculated using a hierarchical radiosity algorithm that approximates the
diffuse interreflection among textured surfaces. The high-quality rendering (figure 2a) employs
antialiasing, Gouraud-interpolated shading, and texturing. The low-quality rendering (figure 2b)
employs antialiasing and Gouraud-interpolated shading but no texturing. The difference between the two
renderings is shown in figure 2c.

Figures 2d through 2g show image-based compression of the high-quality rendering using varying JPEG
quality factors. Figures 2h through 2k show polygon-assisted compression using quality factors selected
to match as closely as possible the code sizes in figures 2d through 2g, assuming that the geometric model
resides on both machines. The quality factors, code sizes, and compression rates are given below each
image. Figures 2l and 2m give one more pair, enlarged so that details may be seen.

Table I: Image-based compression versus polygon-assisted compression, compared for the scenes pictured in
figures 2 and 3. In D, we select a quality factor that gives 20 frames per second while requiring 2 Mbs or
less of network bandwidth. In E, we select a quality factor that matches as closely as possible the code
size obtained in D, assuming that the low-quality geometric model resides on both client and server. In F,
we estimate the number of bytes required to generate D assuming that a losslessly compressed
low-quality model is transmitted from server to client.

In every case, polygon-assisted compression is superior to image-based compression. There are two distinct
reasons for this:

1. The polygon-assisted rendering contains undegraded edges and smoothly shaded areas – precisely those
features that fare poorly in JPEG compression.

2. The difference image contains less information than the highquality rendering, so it can be compressed
using a higher JPEG quality factor without increasing code size – even higher than is required to compensate
for the division by 2 in the difference image representation. Thus, texture features, which are present only
in the difference image, fare better using our method.

As an alternative to comparing images at matching code sizes, we can compare the code sizes of images of
equal quality. Unfortunately, such comparisons are difficult because the degradations of the two methods
are different – polygon-assisted compression always produces perfect edges and smooth shading, while JPEG
never does. If one allows that figure 2j generated using polygon-assisted compression is comparable in
quality to figure 2d generated using image-based compression, then our method gives an additional 3x
compression for this scene.

Table I also estimates the number of bytes required to generate figure 2m if the model is transmitted from
server to client using the lossless compression method proposed. This size (13529 bytes) lies between the
code sizes of figures 2e and 2f. Even in this case, polygon-assisted compression is superior in image quality
to image-based compression, both in terms of its edges and smooth shading and in terms of the JPEG quality
factor used to transmit the texture information.

, , , , , , , , , , , , , , , ,

Leave a comment

SNR Based Automatic Rate Control for Optimized Video Streaming over 802.11 by CLS

A fundamental limit of indirect statistics-based feedback is that it classifies link conditions as
either “good” or “bad.” This binary information provides some notion about the direction in which
to adapt the rate setting, but does not suffice to select the appropriate rate at once. This leads
to a slow step-by-step accommodation to large changes in conditions, and introduces the risk of
oscillation in stable conditions. A better approach is to use direct measurements of the link
conditions.The SNR is directly related to the bit error rate (BER) in the link, and hence to the
FER. Consequently, the SNR is linked to the packet delay and jitter, and throughput, and has the
potential to provide rich feedback for automatic rate control. Knowing the current SNR and the
throughput vs SNR curves for each rate setting (e.g., Fig. 2) would solve the rate selection
problem instantly.

Figure 2. Throughput vs. SNR for some 802.11  modulation schemes

Despite the advantages, SNR-based rate control has not been applied in practice yet because of the
following three problems:

1) In reality, for certain link conditions the relation between the optimal rate and SNR is highly
variable. This is due to the imperfectness of the models describing the radio channel.

2) It is not trivial to obtain a reliable estimate of the SNR of a link. Many radio interfaces only
provide an uncalibrated signal strength indication (SSI).

3) The rate controller, which is on the sending side, in fact needs the SNR observed at the
receiving side.

Most work on using SNR information for automatic rate control is based on simulation and does not
consider the practical difficulties of obtaining good SNR estimates. It concentrates on the way in
which the noisy and drifting SNR can be used to determine the correct rate setting to address the
issue of how to communicate back SNR values, but their rate selection algorithm still relies on a
straight SNR threshold technique. Another approach is discussed in, where the assumption is made
that the channel is symmetric, meaning that the SNR observed at either station is very similar for
any given point in time. This assumption allows to use the SNR of the last ACK frame as an indicator
of the SNR at the other side, and to use it for selecting the rate of the next data frame to be sent.

, , , , , , , , , , , ,

Leave a comment

%d bloggers like this: