Brain Meets Brawn: Why Grid and Agents Need Each Other
Ian Foster I., Nicholas R., Kesselman C.
3rd Int. Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS 2004), New York, USA, 2004.
Abstract
The Grid and agent communities both develop concepts and mechanisms for open distributed systems, albeit from different perspectives. The Grid community has historically focused on “brawn”: infrastructure, tools, and applications for reliable and secure resource sharing within dynamic and geographically distributed virtual organizations. In contrast, the agents community has focused on “brain”: autonomous problem solvers that can act flexibly in uncertain and dynamic environments. Yet as the scale and ambition of both Grid and agent deployments increase, we see a convergence of interests, with agent systems requiring robust infrastructure and Grid systems requiring autonomous, flexible behaviors. Motivated by this convergence of interests, we review the current state of the art in both areas, review the challenges that concern the two communities, and propose research and technology development activities that can allow for mutually supportive efforts.
1 Introduction
In open distributed systems, independent components
cooperate to achieve individual and shared goals. Both
individual components and the system as a whole are
designed to cope with change and evolution in the number
and nature of the participating entities. Such systems are
important in many contexts, from large scientific
collaborations to enterprise systems and sensor networks.
The Grid and agent communities are both pursuing the
development of such open distributed systems, albeit from
different perspectives. The Grid community [12] has
historically focused on what we refer to here as “brawn”:
interoperable infrastructure and tools for secure and
reliable resource sharing within dynamic and
geographically distributed virtual organizations (VOs)
[14], and applications of the same to various resource
federation scenarios. In contrast, those working on agents
have focused on “brains,” i.e., on the development of
concepts, methodologies, and algorithms for autonomous
problem solvers that can act flexibly in uncertain and
dynamic environments in order to achieve their aims and
objectives [21]. A key component of this research is
motivated by the fact that such agents are often required
to form themselves into collectives (i.e., VOs) and act in a
coordinated manner. This need to support aggregation
has, in turn, led to much research into rich and flexible
mechanisms for managing such interactions.
As these two communities mature and turn their
attention to fundamental problems of scope, both are
encountering challenging problems in terms of scale and
application. This maturation process is causing an
increasing overlap in the problems that they address.
Specifically, current Grid systems are somewhat rigid and
inflexible in terms of their interoperation and their
interactions, while agent systems are typically not
engineered as serious distributed systems that need to
scale, that are robust, and that are secure [34].
Nevertheless, each is working its way towards the others’
territory, as Grids seek to become more flexible and agile,
and agent systems seek to be more reliable and scaleable.
Given this background, it is fruitful to examine work in
these two domains, first to communicate to each
community what has been done by the other, and second
to identify opportunities for cross fertilization. We seek to
take a first step towards that goal in this paper. To this
end, we first review the state of the art in Grids and agents
(Sections 2 and 3), compare and contrast the two
approaches (Section 4), present a common vision of
service-oriented architecture (Section 5), and conclude
with a list of significant research challenges (Section 6).
Limited time and space require that we restrict
ourselves in this article to the work being performed
within the Grid and agents communities. Thus, we do not
cover the highly relevant and interesting work pertaining
to open distributed systems that can be found in other
domains, including robotics, peer-to-peer networking,
semantic web, distributed systems, artificial intelligence,
and autonomic systems.
2 Grids
Grids aim to enable “resource sharing and coordinated problem solving in dynamic, multi-institutional VOs” [12]. In other words, Grids provide an infrastructure for federated resource sharing across trust domains. Much like the Internet on which they build, current Grids define protocols and middleware that can mediate access provided by this layer to discover, aggregate, and harness resources. These applications span a wide spectrum. Moreover, the standardization of the protocols and interfaces used to construct systems is an important part of the overall research and development program.
2.1 Technologies
Grid technologies have evolved through at least three
distinct generations: early ad hoc solutions, de facto
standards based on the Globus Toolkit (GT), and the
current emergence of more formal Web services (WS)-
based standards within the context of the Open Grid
Services Architecture (OGSA) [13].
OGSA adopts WS standards such as Web Services
Description Language (WSDL) as a basis for a serviceoriented
architecture within which arbitrary services can
be defined, discovered, and invoked in terms of their
interfaces rather than their implementations. This
approach provides a basis for virtualization,
interoperability, and composition.
The Grid community has participated in, and in some
cases led, the development of WS specifications that
address other Grid requirements. The WS-Resource
Framework (WSRF) defines uniform mechanisms for
defining, inspecting, and managing remote state, a crucial
concern in many settings. WSRF mechanisms underlie
work on service management (WSDM, in OASIS) and
negotiation (WS-Agreement, in GGF), efforts that are
crucial to the Grid vision of large-scale, reliable, and
interoperable Grid applications and services. Other
relevant efforts are aimed at standardizing interfaces to
data, computers, and other classes of resources.
Work on Grid-related standards is driven by, and
influences, the work of a vibrant open source community.
GT (in its most recent instantiation, Web services-based
and WSRF-compliant) provides basic middleware to
create VOs, addressing such issues as specification and
enforcement of VO wide policy, discovery, provisioning
and management of services and resources, and
federation, replication, discovery, and movement of data.
At deployment, depending on available resources and
planned applications, specific service implementations
can be chosen and deployed, often in conjunction with
other GT-based components.
Grid technology R&D has produced specifications and
technologies for realizing service-oriented architectures
according to robust distributed system principles. Global
control mechanisms able to deal reliably with failure and
adapt to changing environmental conditions and
application concerns have been a lesser concern.
2.2 Applications
Early application drivers were largely from scientific
computing [6, 10, 19], and included large-scale
distributed computing [2, 15] (federation of computers),
integration of large-scale data repositories (data grids [7]),
collaboration [31], and tele-instrumentation [23, 26].
More recently, the technology has seen considerable
uptake in industry as a means of addressing issues of
virtualization and distributed system management [13].
GT is in production use across VOs integrating
resources from 20-50 sites with thousands of
computational and data resources, and is expected to scale
to 100s of sites with 1000s of sites as a future goal. In the
remainder of this section, we list a few examples to show
the range and scope of Grid deployments.
The U.S. Network for Earthquake Engineering
Simulation Grid (NEESgrid) connects experimental
facilities (e.g., shake tables), data archives, computers,
and a user community of earthquake engineers. Its
service-oriented architecture defines standard interfaces
for telepresence, monitoring, and control of remote
scientific instruments, and for publishing, discovering,
and accessing data produced by these instruments [26].
NEESgrid experiments have linked facilities at three sites
and more than 50 remote participants.
Grid3 [15] links 28 sites with clusters totaling some
3000 processors. These resources are used by science
communities from high energy physics, astronomy,
biology, chemistry, and computer science for large-scale
simulation and data analysis computations.
In contrast, Access Grid [31] is focused on
interpersonal communication, via sharing of audio, video,
and applications within collaborative spaces. Grid
technologies are used in Access Grid for such purposes as
security, discovery, and resource management.
Butterfly.net is creating a GT-based provisioning
infrastructure for multiplayer online games, in which the
demands for computation, storage, and network resources
can vary dramatically as the popularity of games changes
over time [24]. As a second example of a commercial
Grid deployment, GlobeXplorer is using GT to support
integration and processing of satellite image data [17].
Experiences with such applications reveal issues that
must be addressed if Grids are to be scaled to larger
communities, more diverse resources, and more complex
applications. We review those challenges in Section 6.
3 Agent-Based Computing
An agent “is an encapsulated computer system that is
situated in some environment, and that is capable of
flexible, autonomous action in that environment in order
to meet its design objectives” [33]. In more detail [21],
agents are: (i) clearly identifiable problem solving entities
with well-defined boundaries and interfaces; (ii) situated
(embedded) in a particular environment—they receive
inputs related to the state of their environment through
sensors and they act on the environment through
effectors; (iii) designed to fulfill a specific role—they
have particular objectives to achieve and have particular
problem solving capabilities (services) that they can bring
to bear to this end; (iv) autonomous—they have control
both over their internal state and over their own behavior;
and (v) capable of exhibiting flexible problem solving
behavior in pursuit of their design objectives—they need
to be both reactive (able to respond in a timely fashion to
changes that occur in their environment) and proactive
(able to opportunistically adopt goals and take the
initiative).
When adopting an agent-oriented view of the world, it
soon becomes apparent that most problems require or
involve multiple agents: to represent the decentralized
nature of the problem, multiple loci of control, multiple
perspectives, or competing interests. Moreover, these
agents need to interact, either to achieve their individual
objectives or to manage the dependencies that ensue from
being situated in a common environment. Thus, in any
given system there may be both cooperative and selfish
agents whose aims are, respectively, to maximize the
social welfare of the system and to maximize their own
individual return. These interactions are built on some
form of semantic integration (Section 2.3), may well
involve trust relationships, and also include the traditional
service discovery and invocation discussed above, as well
as the more sophisticated social interactions related to the
ability to cooperate, coordinate and negotiate about which
services are performed by which agents at what time.
In the majority of cases, agents act to achieve
objectives either on behalf of individuals (or companies)
or as part of some wider problem solving initiative. (Note
the similarity to the VO concept.) Thus, when agents
interact there is typically some underpinning
organizational context that defines the relationship among
them. For example, agents may be peers working together
in a team or one may be the manager of the other agents.
To capture such links, agent systems typically have
explicit constructs for modeling organizational
relationships or roles such as peer, manager, or team
member. In many cases, these relationships are subject to
ongoing change: social interaction means existing
relationships evolve (e.g., a team of peers may elect a
leader) and new relations are created (e.g., a number of
agents may form a VO to deliver a particular service that
no one individual can offer). The temporal extent of these
relationships can also vary enormously: from just long
enough to deliver a particular service once, to a
permanent bond.
Whatever the nature of the social process, there are
two points that qualitatively differentiate agent
interactions from those that occur in other computational
models. First, agent-oriented interactions tend to be more
sophisticated than in other contexts, dealing, for example,
with notions of cooperation, coordination, and
negotiation. Second, agents are flexible problem solvers,
operating in an environment over which they have only
partial control and observability. Thus, interactions need
to be handled in a similarly flexible manner, and agents
need the computational apparatus to make contextdependent
decisions about the nature and scope of their
interactions and to initiate (and respond to) interactions
that were not foreseen at design time. The downside of
this autonomy and flexibility, however, is that it is
difficult to ensure that desirable global behaviors emerge.
To this end, a range of techniques (such as reinforcement
learning, mechanism design, and electronic institutions)
are often deployed to try and impose greater order.
Drawing these points together, Figure 1 shows that
adopting an agent-oriented approach to system
engineering means decomposing the problem into
multiple, interacting, autonomous components that have
particular objectives to achieve and are capable of
performing particular services. The key abstraction
models that define the agent-oriented mindset are agents,
interactions and organizations. Finally, explicit structures
and mechanisms are often used to describe and manage
the complex and changing web of organizational
relationships that exist between the agents.
Figure 1: Canonical view of a multiagent system
3.1 Technologies
In contrast to Grid computing, there is less focus on
identifiable agent technologies that can be used
off the shelf to build applications. Traditionally, more
attention has been given to theories and models of how
agents can be developed and how they can communicate,
cooperate, and negotiate. This work has resulted in the
development of a range of algorithms that can be used
both to build individual agents and to manage their
interactions. In the former case, algorithms and
architectures have been developed that enable an agent to
plan an effective course of action to achieve a goal in
uncertain and unpredictable environments, to adapt its
behavior to its prevailing circumstances, and to strike an
effective balance between being too responsive (and
continually changing its aim such that no task is ever
completed) and too committed to its present course of
action (such that more important activities are not dealt
with in a timely fashion). In the latter case, algorithms
have been developed that agents can use to achieve
efficient negotiation outcomes, to form teams composed
of the optimal set of parties, and to determine the degree
of trust that should be placed in a particular agent, based
upon its social and organizational relationships.
There has recently been an increasing trend towards
making agent technology a serious basis for building
complex, distributed systems. Several agent development
environments support specific agent architectures and
provide libraries of interaction protocols (e.g., JACK,
JADE, Cougaar, and ZEUS), software engineering
methodologies have been devised to analyze and design
agent-based systems (e.g., Gaia, Tropos, and AUML), and
there have been efforts to standardize various aspects of
agent systems, such as inter-agent communication (e.g.,
FIPA, KQML). Moreover, as in the Grid community,
there is an increasingly reliance on Web services and
semantic web technologies for providing the
computational infrastructure for such systems and an
increasing acceptance of the importance of trust as a
central issue in interaction.
3.2 Applications
Agent technology has been deployed in a number of isolated applications over the past ten years. However in the past few years the number and range of applications have increased significantly. In particular, many large companies are now interested in developing applications using agent technologies, and deployed applications exist for domains such as manufacturing, electronic commerce, process control, telecommunication systems, traffic and transportation management, information filtering and gathering, business process management, defense, entertainment and medical care [25].
4 Brains and Brawn
We see that a common thread underlies both agents and
Grids, namely, the creation of communities or VOs bound
together by a common goal or cause. Yet the two
communities have focused on different aspects of this
common problem. In the case of Grids, the primary
concern has been the mechanisms by which communities
form and operate. Thus, we see much effort devoted to
how community standards are represented via explicit
policy, how policy is enforced, how community members
identify one another, how actions within the community
are implemented, and how commitments by community
members are specified, monitored and enforced. On the
other hand, our understanding of how to use these
mechanisms to create large-scale systems with stable
collective behavior is less mature. For example,
commonly used Grid tools provide uniform mechanisms
for accessing data on different storage systems, but not for
the semantic integration of that data; for accessing service
and resource state, but not for anticipating, detecting, and
diagnosing problems implied by changes to that state; and
for securely authenticating users and services, but not for
inferring whether or not specific users or services can be
trusted to perform specific actions. To this extent, Grids
are all brawn and no brain.
Agents also focus on creating community. Out of the
flexible local decision making of system components,
sensible community wide behaviors emerge through rich
social interactions and explicit organizational structures.
However in building all this flexibility and sophistication,
scant attention has been paid to how these tasks should be
performed in realistic distributed environments. For
example, agent frameworks provide sophisticated internal
reasoning capabilities, but offer no support for secure
interaction or service discovery; cooperation algorithms
produce socially optimal outcomes, but assume the agents
have complete knowledge of all outcomes that any
potential grouping can produce; and negotiation
algorithms achieve optimal outcomes for the participating
agents, but assume that all parties in the system are
known at the outset of the negotiation and will not change
during the system’s operation. Thus, one may say that
agents are all brain and no brawn.
Clearly, neither situation is ideal: for Grids to be
effective in their goals, they must be imbued with
flexible, decentralized decision making capabilities.
Likewise, agents need a robust distributed computing
platform that allows them to discover, acquire, federate,
and manage the capabilities necessary to execute their
decisions. In other words, there are good opportunities for
exploiting synergies between Grid and agents.
One approach to exploiting such synergies might be a
simple layering of the technologies, i.e., to implement
agent systems on top of Grid mechanisms. However, it
seems more likely that the true benefits of an integrated
Grid/agent approach will only be achieved via a more
fine-grain intertwining of the two technologies, with Grid
technologies becoming more agent-like and agent-based
systems becoming more Grid-like.
As an early example of such a tighter coupling, we
point to work on agent-based resource selection, in which
re-enforcement-based learning is used to drive the
assignment of tasks to resources [16]. In this case, the
“agent” (i.e., the logic used to make the task assignment
decisions) uses Grid functions for status monitoring,
resource discovery, and task submission. The agent, in
turn, provides a valuable Grid function, with the
collection of agents implementing a robust global
resource management behavior that might not otherwise
be achieved. A second example is the use of automated
negotiation techniques (specifically, various forms of
auctions) to allocate resources in Grid systems [32]. Here,
designers evaluate the effectiveness of both commodity
market and Vickery auction protocols to the problem of
allocating resources within a distributed system. This
example also shows how techniques familiar to agents
researchers can be integrated with other more standard
components within a Grid architecture.
This level of integration will undoubtedly create new
challenges for both agents and Grids. However, the result
could be frameworks for constructing robust, large-scale,
agile distributed systems that are qualitatively and
quantitatively superior to the best current practice today.
5 Robust Agile Service-Oriented Systems
Having described key agent and Grid concepts, we now draw the two parallel lines of research together to highlight their commonalities and complementarities.
5.1 Autonomous Services
A core unifying concept that underlies Grids and agent
systems is that of a service: an entity that provides a
capability to a client via a well-defined message exchange
[4]. Within third-generation Grids, service interactions are
structured via Web service mechanisms, and thus all
entities are services. However, while every agent can be
considered a service (in that it interacts with other agents
and its environment via message exchanges), we might
reasonably state that not every Grid service is necessarily
an agent (in that it may not participate in message
exchanges that exhibit flexible autonomous actions).
This notion of autonomous action is thus central to the
question of how agents and Grids can interoperate. To
illustrate the issues, let us consider a service that
encapsulates a database. In a local area network, we might
find a version of this service that responds to requests to
“read a record” or “write a record.” Such an
implementation does not exhibit autonomous behavior.
On the other hand, in a more distributed,
administratively heterogeneous, and failure-prone
environment, the implementation of such a service might
exhibit more sophisticated behavior. For example, the
database might be replicated, with the number of replicas
determined dynamically by knowledge-based models of
system reliability and performance. Distributed
negotiation protocols might be used to establish the query
throughput achievable on individual copies, such that
community throughput is optimized. Finally, distributed
planning and scheduling algorithms might be used to map
queries to specific database replicas so as to minimize the
latency of user requests. In all these cases, a robust
database service, designed to operate in an open
distributed system, is exhibiting flexible autonomous
actions (in the sense that its behaviors are not driven
solely by a client request, but also by other
considerations, including local policies and the outcomes
of negotiations with the client). In short, such services
will exhibit agent behavior.
5.2 Rich Service Models
Both agent and Grid systems consist of dynamic and
stateful services. The underlying service model is
dynamic in that new services can be created and
destroyed over the lifetime of the system. Here an
important contribution of Grid technologies is a robust
lifetime and naming model for dynamic services [13].
Implicit in this model are the notion of service failure and
the definition of a scalable distributed systems semantics.
In contrast, agent-based systems rarely consider such
issues, but they could clearly benefit from exploiting this
approach to representing and managing dynamic services.
Statefulness is another important aspect of the service
model. A stateful service (or, more-or-less equivalently, a
resource [11]) has internal state that persists over multiple
interactions. It can often be useful to make this state
externally visible, so that, for example, another participant
in a distributed system can determine the current load on a
server, the policies that govern access to a service, and/or
the schema(s) supported by a database. Again, Grid
technologies have addressed this issue, defining a general
model for representing and querying service state [11].
This model includes mechanisms for describing state
“lifetime”, as well as a means of specifying and enforcing
policy with respect to access and modification.
The Grid state model defines how state is represented
and accessed, but does not speak to the structure or
semantics of the state that is thus exposed. Typical
practice is to define state in terms of fixed schema or
attributes. In contrast, agent systems address semantics
but do not provide a consistent state model. An integrated
approach can allow for the publication of richer semantic
information within the Grid state model, thus enhancing
the ability of applications to discover, configure, and
manage services in an interoperable manner [18].
5.3 Negotiation and Service Contracts
Negotiation is emblematic of the brain/brawn schism
between current Grid and agent systems. In general, it
cannot be assumed that a service will actually provide a
particular capability to a user: a provider may be unable
or unwilling to provide the service to a putative consumer.
Hence, if the system is to have any type of predictable
behavior, it becomes necessary to obtain commitments
(contracts) about the willingness to provide a service and
the characteristics, or quality, of its provision.
Given the ability to provision a resource to provide a
desired level of service, we are faced with the question of
exactly what levels of service can and should be obtained.
The process by which this is determined will necessarily
be some form of negotiation, since the autonomous
entities involved need to come to a mutually acceptable
agreement on the matter. If this negotiation is successful
(i.e., both parties come to an agreement) then the outcome
of the procurement is a contract (service level agreement)
between the service provider and the service consumer.
This negotiation can be arranged in many different
ways; there are millions of protocols, with varying
properties, and agent researchers have invested significant
effort in determining which protocols are appropriate in
which circumstances [9]. In this context, the negotiation is
driven by the operational policy of both the service
provider and the service consumer. Specifically, policy
terms to be considered may involve aspects such as the
current load, the identity and reputation of the requestor,
and the requestor’s ability to pay.
The use of negotiation as a means of establishing
service contracts is a topic of considerable interest in both
the agent [22] and Grid [8] communities. One promising
approach within Grids has been to represent agreement as
the creation of a shared policy statement and to define
robust extensible protocols for exchanging and agreeing
to policy terms. Creating these agreements in the face of a
Byzantine failure model can be complex. Having
designed such protocols, the next step is to determine the
strategy that the system components should adopt to
achieve their policy objectives. Strategies can vary from
the simple (e.g., an agent bidding its true valuation for a
service) to the complex (reasoning about the other
participants and their likely strategies).
5.4 Virtual Organization Management
A common interaction modality in both Grid and agent
systems occurs when several agents come together to
form a new VO. Such VOs can be viewed as a form of
dynamic service composition: a number of initially
distinct entities come together, under a set of operating
conditions, to form a new entity that offers a new service.
In such cases, one of the key challenges is for the
participating agents to determine who else should be
involved in the coalition and what their various roles and
responsibilities should be. Again, this activity typically
involves negotiation among participants, in this case to
determine a mutually acceptable agreement concerning
the division of labor and responsibilities.
Dynamic creation also raises the issue of service
discovery. Experience in the Grid community indicates
that this discovery should not simply be on the basis of
service type, but rather should incorporate notions of
service state and should be based on an understanding of
the capabilities of the service (i.e., semantics). While Grid
technologies provide the means for describing and
grouping services, these higher level matchmaking and
discovery capabilities are not currently part of Grid
infrastructure. Fortunately, this is an area where much
work has been done in the space of agents, and thus
incorporation of this technology would do much to
improve matters. This integration may have an impact on
how state is represented and how services are organized.
5.5 Authentication, Trust, and Policy
with dynamically created services has long been an
integral part of Grid infrastructure. A common approach
to this problem is to map identities into a global
namespace and then apply delegation as a means for
building federated namespaces for dynamically created
entities. More recent work has focused on the application
of richer policy statements and the creation of community
based authorization and assertion authorities [27].
Also fundamental to the creation of collaboration and
community, and building upon the aforementioned
notions of authentication, are notions of trust. The
effective management of trust and policy within a
community, like VO formation, requires flexible,
autonomous mechanisms able to consider, when
organizing communities, not only the semantics of policy
statements but also the ability to negotiate policy terms
and to manage restricted delegation of rights.
As with other aspects of agents and Grids, we expect
to see the adaptation of agent algorithms and technologies
as they incorporate policy specification and enforcement
into their basic operations and we expect to see Grid
algorithms make use of some of the richness of the
various agent trust and reputation models that have been
developed [28]. We also expect that the types of policy
statements made, along with how they are disseminated
and applied, will evolve as agent-based techniques
become more completely integrated into Grids. For
example, reputation-based authentication mechanisms,
which lend themselves to agent-based implementations,
show great promise in the Grid environment.
6 Ten Research Problems
We conclude by outlining ten areas (in no particular
order) in which research is needed to realize an integrated
agent-Grid approach to open distributed systems.
Service architecture. The convergence of agent and Grid
concepts and technologies will be accelerated if we can
define an integrated service architecture providing a
robust foundation for autonomous behaviors. This
architecture would define baseline interfaces and
behaviors supporting dynamic and stateful services, and a
suite of higher-level interfaces and services codifying
important negotiation, monitoring, and management
patterns. The definition of an appropriate set of such
architectural elements is an important research goal in its
own right, and, in addition, can facilitate the creation,
reuse, and composition of interoperable components.
Trust negotiation and management. All but the most
trivial distributed systems involve interactions with
entities (services) with whom one does not have perfect
trust. Thus, authorization decisions must often be made in
the absence of strong existing trust relationships. Grid
middleware addresses secure authentication, but not the
far harder problems of establishing, monitoring, and
managing trust in a dynamic, open, multi-valent system.
We need new techniques for expressing and reasoning
about trust. Reputation mechanisms [29] and the ability to
integrate assertions from multiple authorities (“A says M
can do X, but B disagrees”) will be important in many
contexts, with the identity and/or prior actions of an entity
requesting some action or asserting some fact being as
important as other metrics, such as location or willingness
to pay. Trust issues can also impinge on data integration,
in that our confidence in the “data” provided by an entity
may depend on our trust in that entity, so that, for
example, our confidence in an assertion “A says M is
green” depends on our past experiences with A.
System management and troubleshooting. Grid
technologies make it feasible to access large numbers of
resources securely, reliably, and uniformly. However, the
coordinated management of these resources requires new
abstractions, mechanisms, and standards for the quasiautomated
(“autonomic” [20]) management of the
ensemble—despite multiple, perhaps competing,
objectives from different parties, and complex failure
scenarios. A closely related problem is troubleshooting,
i.e., detecting, diagnosing, and ultimately responding to
the unexpected behavior of an individual component in a
distributed system, or indeed of the system as a whole.
This requirement will motivate the development of robust
and secure logging and auditing mechanisms. The
registration, discovery, monitoring, and management of
available logging points, and the development of
techniques for detecting and responding to “trouble” (e.g.,
overload or fraud), remain open problems. We also
require advances in the summarization and explanation
(e.g., visualization) of large-scale distributed systems.
Negotiation. We have already discussed negotiation at
some length; here we simply note that major open
problems remain in this vital area.
Service composition. The realization of a specific user or
VO requirement may require the dynamic composition of
multiple services. Web service technologies define
conventions for describing service interfaces and
workflows, and WSRF provides mechanisms for
inspecting service state and organizing service
collections. Yet we need far more powerful techniques for
describing, discovering, composing, monitoring,
managing, and adapting such service collections.
VO formation and management. While the notion of a
VO seems to be intuitive and natural, we still do not have
clear definitions of what constitutes a VO or well-defined
procedures for deciding when a new VO should be
formed, who should be in that VO, what they should do,
when the VO should be changed, and when the VO
should ultimately be disbanded.
System predictability. While open distributed systems
are inherently unpredictable, it can be important to
provide guarantees about system performance (e.g.,
liveness or safety properties, or stochastic performance
boundaries). However, such guarantees require a deeper
understanding of emergent behavior in complex systems.
Human-computer collaboration. Many VOs will be
hybrids in which some problem solving is undertaken by
humans and some by programs. These components must
interwork in a seamless fashion to achieve their aims.
New collaboration models are necessary to capture the
rich social interplay in such hybrid teams.
Evaluation. Meaningful comparison of new approaches
and technologies requires the definition of appropriate
benchmarks and challenge problems and the creation of
environments in which realistic evaluation can occur.
Perhaps the single most effective means of advancing
agent-Grid integration might be the definition of
appropriately attractive challenge problems. Such
problems should demand both the brawn of Grid and the
brains of agents, and define rigorous metrics that can be
used to drive the development in both areas. Potential
challenge problems might include the distributed
monitoring and management of large-scale Grids, and
robust and long-lived operation of agent applications.
Evaluation can occur in both simulated and physical
environments. Rapid progress has been made in
simulation systems for both agents and Grids (e.g., [30]).
Production deployments such as Grid3 [15], TeraGrid [5],
and NEESgrid [26], and testbeds such as PlanetLab [1],
are potentially available as experimental platforms for the
evaluation of converged systems, for example within the
context of the challenge problems just mentioned.
Semantic integration. Open distributed systems involve
multiple stakeholders that interact to procure and deliver
services. Meaningful interactions are difficult to achieve
in any open system because different entities typically
have distinct information models. Advances are required
in such interrelated areas as ontology definition, schema
mediation, and semantic mediation [3]. Again, issues of
trust and cost have vital roles to play.
Acknowledgments
The work of the first author was supported in part by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38. The second author acknowledges the support of the EPSRC project “Virtual organisations for e-Science” (GR/S62710/01).