Archive for November, 2011
Graphics Processing Units Using the Compute Unified Device Architecture
Posted by protogenist in Technology Research on November 30, 2011
Graphic Processing Units (GPUs) have mainly been game- and video-centric devices.
Due to the increasing computational requirements of graphics-processing applications, GPUs have become
very powerful parallel processors and this, moreover, incited research interest in computing outside
the graphics-community. Until recently, however, programming GPUs was limited to graphics libraries
such as OpenGL and Direct3D, and for many applications, especially those based on integer-arithmetic,
the perfor mance improvements over CPUs was minimal or even degrading. The release of NVIDIA’s G80
series and ATI’s HD2000 series GPUs (which implemented the unified shader architecture), along with the
companies’ release of higherlevel language support with Compute Unified Device Architecture (CUDA),
Close to Metal (CTM) and the more recent Open Computing Language (OpenCL), however, facilitate the
development of massively-parallel general purpose applications for GPUs. These general purpose GPUs have
become a common target for numerically-intensive applications given their ease of programming
(compared to previous generation GPUs), and ability to outperform CPUs in data-parallel applications,
commonly by orders of magnitude.
In addition to the common floating point processing capabilities of previous generation GPUs, starting
with the G80 series, NVIDIA’s GPU architecture added support for integer arithmetic, including 32-bit
addition/subtraction and bit-wise operations, scatter/gather memory access and different memory spaces.
Each GPU contains between 10 and 30 streaming multiprocessors (SMs) each equipped with: eight scalar
processor (SP) cores, fast 16-way banked onchip shared memory (16KB/SM), a multithreaded instruction unit,
large register file (8192 for G80-based GPUs, 16384 for the newer GT200 series), read-only caches for
constant (8KB/SM) and texture memories (varying between 6 and 8 KB/SM), and two special function units
(for transcendentals).
CUDA is an extension of the C language that employs the new massively parallel programming model, single
instruction multiple-thread. SIMT differs from SIMD in that the underlying vector size is hidden and the
programmer is restricted to writing scalar code that is parallel at the thread-level. The programmer defines
kernel functions, which are compiled for and executed on the SPs of each SM, in parallel: each light-weight
thread executes the same code, operating on different data. A number of threads (less than 512) are grouped
into a thread block which is scheduled on a single SM, the threads of which timeshare the SPs. This additional
hierarchy provides for threads within the same block to communicate using the on-chip shared memory and
synchronize their execution using barriers. Moreover, multiple thread blocks can be executed simultaneously
on the GPU as part of a grid; a maximum of eight thread blocks can be scheduled per SM and in order to hide
instruction and memory (among other) latencies, it is important that at least two blocks be scheduled on each SM.
Proposition of the Independent Recovery Protocols
Posted by protogenist in Technology Research on November 30, 2011
There was a lot of effort invested into research of ideal independent recovery protocol.
The results are mainly negative.
Prop. 1
Independent recovery protocols exist only for single-site failures. There exists no independent recovery
protocol which is resilient to multiple-site failures.
Prop 2.
There exists no nonblocking protocol that is resilient to a network partition if messages are lost when the
partition occurs.
Prop 3.
There exist nonblocking protocols which are resilient to a single network partition if all undeliverable messages
are sent back to the sender.
Prop. 4
There exists no nonblocking protocol which is resilient to a multiple partition.
Thus it exists no a general solution of this problem.
Practical solutions: the largest partition terminates the transaction to be not blocked.
Problem: which partition is the largest?
Primary site approach and the majority approach
There are different methods to decide which partition is largest to terminate the group level transaction.
Primary site approach:
A site is designated as primary site, and the partition containing this primary site is allowed to terminate
the transaction. It is usual to denote the role of primary site to the coordinator. In this case all
transactions within this partition are terminated correctly.
If the primary site differs from the coordinator site, then a 3PC termination protocol should be used to
terminate all transaction of the group with the primary site.
Majority approach
Only the group containing the majority of sites can terminate the transaction. The sites in the groups may
vote for aborting or for committing. The majority of sites must agree on the abort or commit before the
transaction terminates.
Coding in Feature Driven Development
Posted by protogenist in Technology Research on November 28, 2011
Coding process in FDD is not as exciting and challenging as it is in XP (eXtreme Programming). This
happens because by the coding time the features have been extensively discussed during
Process One, iteration kick-off meeting, design review meeting. Classes and methods are
defined by now, their purpose is described in code documentation. Coding often becomes
a mechanical process.
Unlike XP FDD strongly discourages refactoring. The main argument against
refactoring here is that it takes time and does not bring any value to the customer. The
quality of code is addressed during code review meetings.
FDD encourages strong code ownership. The main idea is that every developer
knows the owned code and better realizes the consequence of changes. FDD fights the
problem of leaving team members from the different angle:
- Sufficient code documentation simplifies understanding somebody else’s code.
- Developers know what other people’s code does, since they reviewed the design.
- Developers will look at each other’s code during code review.
5 Reasons Why People Spam Your Blog
Posted by protogenist in Technology Research on November 25, 2011
No aspect of the World Wide Web is immune to spam – not even the blogosphere. No matter how strong your anti spam server is you may get hit every once and a while. Of course, the type of spam seen on personal blogs is different from the normal spam that you might be used to in the fact that instead of receiving these messages in your private inbox, they are being displayed on your blog for the entire world to see. Furthermore, the professional spammers who distribute unsolicited commercial e-mail for a living have different reasons for spamming a personal online blog versus sending unwanted junk mail into somebody’s inbox. So a bloggers need a good anti spam solutions in order to protect their blog.
#1: To advertise a website, product, or service. Perhaps the most generic reason for spamming a blog is for advertisement purposes. Through a blog it is easy to reach thousands of people every single day; this holds true for the owner of the blog as much as the ones who are spamming it.
#2: Get back links to their site. Many spammers simply leave a comment with nothing more than their website address, hoping to get as many clicks as possible.
#3: It is cheap when compared to other methods of spam. Even in the world of spam marketing, it takes money to make money – unless you’re spamming blogs, of course.
#4: The process can easily be automated to save time. Unlike some of the other spamming techniques, the entire process of spamming a blog can be automated.
#5: To collect e-mail addresses. Many times a user’s e-mail-address will be listed in their online profile, or even right alongside their post. Spammers collect these addresses in order to send them unsolicited commercial e-mail at a later time.
Dynamic Coordination of Information Management Services for Processing Dynamic Web Content
Posted by protogenist in Technology Research on November 24, 2011
Dynamic Web content provides us with time-sensitive and continuously changing data. To glean up-to-date information, users need to regularly browse, collect and analyze this Web content. Without proper tool support this information management task is tedious, time-consuming and error prone, especially when the quantity of the dynamic Web content is large, when many information management services are needed to analyze it, and when underlying services/network are not completely reliable. This describes a multi-level, lifecycle (design-time and run-time) coordination mechanism that enables rapid, efficient development and execution of information management applications that are especially useful for processing dynamic Web content. Such a coordination mechanism brings dynamism to co-ordinating independent, distributed information management services. Dynamic parallelism spawns/merges multiple execution service branches based on available data, and dynamic run-time reconfiguration coordinates service execution to overcome faulty services and bottlenecks. These features enable information management applications to be more efficient in handling content and format changes in Web resources, and enable the applications to be evolved and adapted to process dynamic Web content.
The coverage of individual Web sites that provide such dynamic content is often incomplete, since individually they are limited by time and resource constraints. To obtain a complete picture about time-sensitive or wide-range topics, people tend to access multiple Web sites. For example, during the terror attacks, since it was such a time-critical situation, no single news site could provide complete information to understand what was happening. People needed to browse different news sites to access different coverage and different opinions, then compile and assemble the information together to understand the full situation. In addition, to understand different reactions from different parts of the world, news sources from various countries needed to be accessed.If a Web site is unresponsive due to congestion, people tend to switch to another Web site and come back later. People exhibit other forms of dynamism as well. For example, they will select a different set of information sources based on their topic area and geographic region of interest, and they will mentally filter and analyze the news articles based on the articles’ content, structure and format.
Any information management tool that supports this process of gleaning information for dynamic Web content should help alleviate the tedious and repetitive aspects, but should be flexible enough to allow users to incorporate the dynamic aspects of information analysis. This describes a dynamic service coordination mechanism that brings dynamism in information management systems for processing dynamic Web content. This coordination mechanism allows users to incrementally develop information management applications on different abstraction levels through the design/runtime lifecycle, which is essential for processing dynamic Web contents efficiently and correctly. This mechanism has been adapted by USC ISI’s GeoWorlds system, and has been proven that it is practically effective on developing information management applications for processing dynamic Web content.
The characteristics of the class of information management :
1. The information is time-sensitive and continuously changing
2. The information needs to be joined together from multiple sources
3. Multiple complex analysis steps are needed to jointly process the information
4. The analysis steps need to be reconfigured to adapt to specific tasks
5. The tasks are repetitive. They need to be performed periodically
Polygon-Assisted JPEG and MPEG Compression of Synthetic Images
Posted by protogenist in Technology Research on November 23, 2011
In realtime image compression and decompression hardware make it possible for a high-performance
graphics engine to operate as a rendering server in a networked environment. If the client is a
low-end workstation or set-top box, then the rendering task can be split across the two devices.
we explore one strategy for doing this. For each frame, the server generates a high-quality
rendering and a low-quality rendering, subtracts the two, and sends the difference in compressed
form. The client generates a matching low quality rendering, adds the decompressed difference image,
and displays the composite. Within this paradigm, there is wide latitude to choose what constitutes
a high-quality versus low-quality rendering. We have experimented with textured versus untextured
surfaces, fine versus coarse tessellation of curved surfaces, Phong versus Gouraud interpolated
shading, and antialiased versus nonantialiased edges. In all cases, our polygon-assisted compression
looks subjectively better for a fixed network bandwidth than compressing and sending the high-quality
rendering. We describe a software simulation that uses JPEG and MPEG-1 compression, and we show results
for a variety of scenes.
we consider an alternative solution that partitions the rendering task between client and server. We use
the server to render those features that cannot be rendered in real time on the client – typically
textures and complex shading. These are compressed using JPEG or MPEG and sent to the client. We use the
client to render those features that compress poorly using JPEG or MPEG – typically edges and smooth
shading. The two renderings are combined in the client for display on its screen. The resulting image is
subjectively better for the same bandwidth than can be obtained using JPEG or MPEG alone. Alternatively,
we can produce an image of comparable quality using less bandwidth.
Client-server relationship
The hardware consists of a high-performance workstation (henceforth called the server), a low-performance
workstation (henceforth called the client), and a network. To produce each frame of synthetic imagery,
these two machines perform the following three steps:
(1) On the server, compute a high-quality and low-quality rendering of the scene using one of the
partitioning strategies described.
(2) Subtract the two renderings, apply lossy compression to the difference image, and send it to the client.
(3) On the client, decompress the difference image, compute a low-quality rendering that matches the
low-quality rendering computed on the server, add the two images, and display the resulting composite image.
Depending on the partitioning strategy, there may be two geometric models describing the scene or one model
with two rendering options. The low-quality model may reside on both machines, or it may be transmitted from
server to client (or client to server) for each frame. If the model resides on both machines, this can be
implemented using display lists or two cooperating copies of the application program. The latter solution is
commonly used in networked visual simulation applications. To provide interactive performance, the server in
such a system would normally be a graphics workstation with hardware accelerated rendering. The client might
be a lower-end hardware-accelerated workstation, or it might be a PC performing rendering in software, or it
might be a set-top box utilizing a combination of software and hardware. Differencing and compression on the
server, and decompression and addition on the client, would most likely be performed in hardware, although
real-time software implementations are also beginning to appear. One important caveat regarding the selection
of client and server is that there are often slight differences in pixel values between equivalent-quality
renderings computed by highperformance and low-performance machines, even if manufactured by the same
vendor. If both renderings are antialiased, these differences are likely to be small.
Streaming Tetrahedral Volume Meshes
Posted by protogenist in Technology Research on November 22, 2011
In a streaming mesh format, tetrahedral and the vertices they reference are stored in an interleaved fashion.
This makes it possible to start operating on the data immediately without having to first load all the vertices, as is common practice with standard indexed formats. Furthermore, streaming formats provide explicit information about when vertices are referenced for the last time. This makes it possible to complete operations on these vertices and free the corresponding data structures for immediate reuse. The width of a streaming mesh is the maximal number of vertices that need to be in memory simultaneously. Those are vertices that have already streamed in but have not been finalized yet. The width is the lower bound for the amount of memory needed for processing a streaming mesh since any mesh processing application has to store at least that many vertices simply to dereference the mesh.
The streaming approach to compression relies on the input meshes either being stored or produced in a streaming manner.
The set of example volume meshes that we use to test our compressor, however, does not fulfill these expectations at all.
Not only arethese tetrahedral meshes distributed in conventional, non-streaming formats, they also come with absolutely “un-streamable” element orders, as illustrated by the layout diagrams. The horizontal axis represents the tetrahedral (in the order they occur in the file), and the vertical axis represents the vertices (also echoing their order in the file. The few unclassified data sets that are currently used by the visualization community for performance measurements were created several years ago. Back then, the difficulty of using random access in-core algorithms for producing larger and larger meshes were overcome simply by employing sufficiently powerful computer equipment. But only when there is enough main memory to hold the entire mesh is it possible to output meshes whose vertices and tetrahedral are ordered so “randomly” in the file.
In the near future we anticipate a new generation of meshing algorithms that produces and outputs volume mesh
data in a more coherent fashion. This is a necessity if algorithms are to scale to increasingly large data sets. An algorithm for tetrahedral mesh refinement, for example, could be designed to sweep over the data set and restrict its operation at any time to the currently active set until it achieves the desired element quality. For a mesh generation algorithm operating in this manner, it is natural to output reasonably coherent meshes in a streaming manner. To stream legacy data stored in non-streaming formats or with highly incoherent layouts describe several conversion strategies.
Gathering Criterion-Related Evidence of Validity
Posted by protogenist in Technology Research on November 21, 2011
Gathering criterion-related evidence of validity is an important task for all language testers. This task is particularly difficult for the CAEL test given some of the unique features of the CAEL Assessment. That is, the use of constructed-response test items in a topic-based fully integrated language test is essentially a unique approach to language testing at the present time. These aspects of the CAEL Assessment strengthen the claim made by test developers that the CAEL Assessment is a reasonable approximation of the language demands of English for academic purposes, particularly in Canadian university contexts. However, the essentially unique nature of the test means that gathering criterion-related evidence of validity is problematic.
CAEL test scores have been compared with the performance of test takers on the Test of English as a Foreign Language (TOEFL). However, the TOEFL is clearly measuring English language proficiency in a very different manner. Does this mean, then, that a correlation of the two test scores provides criterion-related evidence in support of the CAEL test? Clearly, the establishment of an appropriate criterion is always a challenge when gathering this type of validity evidence.
One procedure that has been adopted in an effort to gather more meaningful criterion-related evidence of validity is to conduct follow-up studies of CAEL test takers who score at various proficiency levels. One such follow-up study was conducted for this manual. The university course performance of 79 test takers who achieved an Overall Result at a band score of 70 or greater was collected. The basic design was to determine the grade point averages (GPA) of these students in their first full term of study after achieving an Overall Result of 70 or greater on the CAEL Assessment. The score of 70 was selected for this study because test takers who achieve this score are permitted to register for regular courses at the university without any further ESL/EAP (English for Academic Purposes ) training. Data was collected for each test taker for the term immediately following their CAEL Assessment in an effort to avoid measuring the impact of language learning which occurred after the test was completed.
Authenticating Streamed Data in the Presence of Random Packet Loss
Posted by protogenist in Technology Research on November 20, 2011
It’s a scheme for authenticating streamed data delivered in real-time over an insecure network.
The difficulty of signing live streams is two fold. First, authentication must be efficient so
the stream can be processed without delay. Secondly, authentication must be possible even if
some packets in the sequence are missing. Streams of audio or video provide a good example.
They must be processed in real-time and are commonly exchanged over UDP, with no guarantee that
every packet will be delivered. Existing solutions to the problem of signing streams have been
designed to resist worst-case packet loss. In practice however, network loss is not malicious
but occurs in patterns of consecutive packets known as bursts. Based on this realistic model of
network loss, we propose an authentication scheme for streams which achieves better performance
as well as much lower communication overhead than existing solutions.
There are two issues to consider when signing streams. On the one hand, the signature scheme must
be efficient enough to permit authentication on the fly without introducing delays. On the other
hand, the signature scheme must be robust enough that authentication remains possible even if some
packets are lost. The naive solution to authenticate a stream is to sign each packet in the stream
individually. The receiver checks the signatures of packets as they arrive and stops processing the
stream immediately if an invalid signature is discovered. Immediate authentication is possible, but
the computational load on both the sender and the receiver is too great to make this approach practical.
A more efficient solution is proposed in by Gennaro and Rohatgi. They observe that one-time signatures
can be used in combination with a single digital signature to authenticate a sequence of packets. Each
packet carries a public-key, which is used in a one-time signature scheme to sign the following packet.
Only the first packet needs to be signed with a regular digital signature. Since one-time signatures
are an order of magnitude faster to apply than digital signatures, and can also be verified somewhat
more efficiently, this solution offers a significant improvement in execution speed.
However, there is a major difficulty with this approach. Recall that audio and video streams are sent
using UDP, which provides only ”best-effort” service and does not guarantee that all packets will be
delivered. If a packet is missing, the authentication chain is broken and subsequent packets can not
be authenticated. (Another problem is that one-time signatures incur a substantial communication
overhead). If a sequence is received incomplete, we would still like to be able to authenticate all
the packets that were not lost. This defines resistance to loss in a strong sense: a packet is either
lost or authenticable. A weaker alternative would allow a few packets to be received unauthenticated
in case of packet loss. We offer two justifications for adopting the strong definition. First, it is
essential for some applications that only authenticated content be received. Consider a stream that
delivers stock quotes in real time. While it might be acceptable to lose a quote, we must ensure that
only authenticated quotes are ever displayed. Secondly, our constructions which resist loss in the
strong sense can easily be adapted to the weaker notion of resistance.
Existing authentication schemes that resist packet loss have been designed to resist worst-case packet
loss. Any number of packets may be lost anywhere in the sequence, without interfering with the receiver’s
ability to authenticate the packets that arrived. Studies conducted on packet loss in UDP suggest that
resisting worst-case packet loss is an overkill. The focus should be instead on resisting random packet
loss. We will show how that leads to much more efficient constructions. Since packet loss on the network
is not malicious, it is natural to analyze the patterns of loss and design our authentication schemes
accordingly. Paxson shows that on the Internet consecutive packets tend to get lost together in a burst.
We adopt this model and propose authentication schemes designed to resist bursty loss. Specifically, our
goal is to maximize the size of the longest single burst of loss that our authenticated streams can
withstand. Of course, this is not to say that our constructions resist only a single burst. As will be
clear, once a few packets have been received after a burst, our scheme recovers and is ready to maintain
authentication even if further loss occurs.
Reasons for not using assembly code
Posted by protogenist in Technology Research on November 17, 2011
There are so many disadvantages and problems involved in assembly programming that it
is advisable to consider the alternatives before deciding to use assembly code for a
particular task. The most important reasons for not using assembly programming are:
1. Development time : Writing code in assembly language takes much longer time than
in a high level language.
2. Reliability and security : It is easy to make errors in assembly code. The assembler is
not checking if the calling conventions and register save conventions are obeyed.
Nobody is checking for you if the number of PUSH and POP instructions is the same in
all possible branches and paths. There are so many possibilities for hidden errors in
assembly code that it affects the reliability and security of the project unless you
have a very systematic approach to testing and verifying.
3. Debugging and verifying : Assembly code is more difficult to debug and verify
because there are more possibilities for errors than in high level code.
4. Maintainability : Assembly code is more difficult to modify and maintain because the
language allows unstructured spaghetti code and all kinds of dirty tricks that are
difficult for others to understand. Thorough documentation and a consistent
programming style is needed.
5. System code can use intrinsic functions instead of assembly : The best modern C++
compilers have intrinsic functions for accessing system control registers and other
system instructions. Assembly code is no longer needed for device drivers and other
system code when intrinsic functions are available.
6. Application code can use intrinsic functions or vector classes instead of assembly:
The best modern C++ compilers have intrinsic functions for vector operations and
other special instructions that previously required assembly programming. It is no
longer necessary to use old fashioned assembly code to take advantage of the
Single-Instruction-Multiple-Data (SIMD) instructions.
7. Portability: Assembly code is very platform-specific. Porting to a different platform is
difficult. Code that uses intrinsic functions instead of assembly are portable to all x86
and x86-64 platforms.
8. Compilers have been improved a lot in recent years : The best compilers are now
better than the average assembly programmer in many situations.
9. Compiled code may be faster than assembly code because compilers can make
inter-procedural optimization and whole-program optimization : The assembly
programmer usually has to make well-defined functions with a well-defined call
interface that obeys all calling conventions in order to make the code testable and
verifiable. This prevents many of the optimization methods that compilers use, such
as function inlining, register allocation, constant propagation, common subexpression
elimination across functions, scheduling across functions, etc. These
advantages can be obtained by using C++ code with intrinsic functions instead of
assembly code.