Replica selection is interesting because it does not build on top of the core services, but rather
relies on the functions provided by the replica management component described in the preceding
section. Replica selection is the process of choosing a replica that will provide an application with data
access characteristics that optimize a desired performance criterion, such as absolute performance (i.e.
speed), cost, or security. The selected le instance may be local or accessed remotely. Alternatively
the selection process may initiate the creation of a new replica whose performance will be superior to
the existing ones.
Where replicas are to be selected based on access time, Grid information services can provide
information about network performance, and perhaps the ability to reserve network bandwidth, while
the metadata repository can provide information about the size of the file. Based on this, the selector
can rank all of the existing replicas to determine which one will yield the fastest data access time.
Alternatively, the selector can consult the same information sources to determine whether there is a
storage system that would result in better performance if a replica was created on it.
A more general selection service may consider access to subsets of a file instance. Scientic exper-
iments often produce large les containing data for many variables, time steps, or events, and some
application processing may require only a subset of this data. In this case, the selection function may
provide an application with a file instance that contains only the needed subset of the data found
in the original file instance. This can obviously reduce the amount of data that must be accessed or
This type of replica management has been implemented in other data-management systems. For
example, STACS is often capable of satisfying requests from High Energy Physics applications by
extracting a subset of data from a file instance. It does this using a complex indexing scheme that
represents application metadata for the events contained within the file. Other mechanisms for provid-
ing similar function may be built on application metadata obtainable from self-describing file formats
such as NetCDF or HDF.
Providing this capability requires the ability to invoke ltering or extraction programs that un-
derstand the structure of the file and produce the required subset of data. This subset becomes a
file instance with its own metadata and physical characteristics, which are provided to the replica
manager. Replication policies determine whether this subset is recognized as a new logical file (with
an entry in the metadata repository and a file instance recorded in the replica catalog), or whether
the file should be known only locally, to the selection manager.
Data selection with subsetting may exploit Grid-enabled servers, whose capabilities involve com-
mon operations such as reformatting data, extracting a subset, converting data for storage in a different
type of system, or transferring data directly to another storage system in the Grid. The utility of this
approach has been demonstrated as part of the Active Data Repository. The subsetting function
could also exploit the more general capabilities of a computational Grid such as that provided by
Globus. This oers the ability to support arbitrary extraction and processing operations on files as
part of a data management activity.