Dynamic Web Document Processing Task

The Web itself was developed to support information management, as the size and density of the Web increases, users are having a more difficult time in locating, retrieving and organizing the information they need. Many users have resorted to using front-end information management systems, which access the Web as an information source and provide various information management tools. One such system is USC ISI’s component-based GeoWorlds system, which handles both geographic and Web-based information. GeoWorlds combines Geographic Information Systems software with tools for finding, filtering, sorting, characterizing, and analyzing collections of documents accessed via the Web. GeoWorlds is currently in experimental use by the Crisis Operations Planning Team at the US Pacific Command (PACOM), and by the Virtual Information Center, also at US Pacific Command. Example uses at PACOM include: (1) mapping terrorist bombings in the Philippines; (2) locating patterns of recurring natural and technological disasters in China and India; and (3) investigating drug trafficking and piracy in various locales.

As a test application for our dynamic coordination mechanism we have extended GeoWorlds to build USC ISI’s GeoTopics, an information management application that daily examines news report age from the websites of a growing number of large, elite English-language news operations. Its contribution is to help identify the “hot topics,” and the most frequently referenced places, found that day in this very large collection of reports. The goal is to help users look at what’s going on worldwide in either of two ways: what places are relevant to a topic, or what topics are hot in a place. GeoTopics also ranks topics and places by the number of references found to them that day. It automatically compares these to the previous days’ results, flags new topics that emerged that day, and tells users whether ongoing topics seem to be getting more or less attention than the day before. For each days’ topics and places, clicking on them gets users both the full set of reports under that heading and a breakdown of those reports (i.e., reports on each topic are broken down by places referenced, and reports on places are broken down by topics referenced).

In principle, the underlying GeoWorlds system could be used to perform this news analysis task without the dynamic coordination mechanism. GeoWorlds’ analysis services and visual editing facilities provide enough functionality that they could be chained together to generate similar results. However this generation process would be very labor-intensive. Currently with the enhancements it takes about 20 minutes to perform the daily updates for GeoTopics. Without the enhancements, it would take several hours. The coordination between these information management services forms the GeoTopics application. That coordinated set of services can be reused for different set of information sources, and can be repeated for recurring analyses of the same set of information sources.

Although the high-level processing steps are the same (extracting articles, filtering and classifying them, and generating the HTML report), the selection and coordination of the information management services need to be flexible and reconfigurable to handle dynamic situations. For example, most of the 10 news sites, which are used for the current GeoTopics, have sidebars and footers in their articles, which cause false-matching problems (e.g., ‘NASDAQ’ was ranked high because it is appeared on the side bars in many of the news articles). The coordination mechanism allows an additional filter to be added to filter out the sidebars and footers, and to return only the pure article text. Also, when biased rankings are detected (e.g., ‘Los Angeles’ was ranked high because there were many LA Times articles that are about the city), a category weighting service can be added to give more weight value to the categories extracted from multiple news sources. Also, the coordination mechanism allows for alternative ranking services to be substituted, for example, a ranking service that weights values as well as the article counts.

The dynamic service coordination tasks during the development time and run-time should be done without modifying the underlying information management system. It is also desirable for the end-users to do the application composition and reconfiguration tasks at the level where they can understand and manage instead of bogging down to the technical details. In addition, the reconfiguration process should not alter the original context of the information management task. The requirements to support dynamic service coordination mechanism for processing dynamic Web content can be summarized as follows:

Composability :  an information management application should be composable by selecting and combining various information sources and services.

Reusability :  the composed information management applications should be reusable for similar information management cases and recurring analyses.

Reconfigurability :  a coordinated service should be reconfigurable so that it can be incrementally developed and adapted for dynamic situations.

Context-sensitivity :  the context of the intended information management should be maintained and ensured during the composition and reconfiguration process.


