Skip to content

Conversation

@challenger71498
Copy link

Please ensure you have read the contribution guide before creating a pull request.

Link to Issue or Description of Change

Link to an existing issue:

Problem:

Since I already described the issue here, I will just briefly introduce the problem in short.

When multiple runners which shares the same agent are running in parallel, first finished runner closes the toolset.
This make other runners throw resource-related exception.

This may occur in 2 scenarios:

  • executing Runner concurrently, agent of the tool is with stateful toolset.
  • calling AgentTool in parallel, agent of the tool is with stateful toolset(e.g. McpToolset).

I will attach 2 diagrams which describes the second scenario, since this PR only fixes that.
You may find other diagrams at the issue I've mentioned.

Diagrams

Assume that the root agent is configured with AgentTool. The agent of AgentTool uses McpToolset.

playground-adk-parallel-agent-tool-1 drawio

If SubRunner 1 finishes its work first, it closes the toolset. If SubRunner 2 is still running and waiting responses of tool call, the tool call will end up raising the exception, indicates that the resource has been exhausted.

playground-adk-parallel-agent-tool-2 drawio

Solution:

I managed to solve the second problem, calling AgentTool in parallel, by introducing a singleton AgentToolManager which basically does reference-couting between the agent and the runner.

  • When AgentTool::run_async invoked, right after when it initialized the runner, it registers the runner to AgentToolManager with the agent it is running.
  • AgentToolManager has its own dictionary matching the agent to runners. By registering the runner, it simply adds it to the dictionary.
  • After the runner finished its work, AgentTool deregisters the runner.
  • When deregistering happens, the manager returns an Boolean type of AsyncContextManager to hold the lock until the actual closing logic is finished.
    • Its boolean flag indicates that whether toolsets should be cleaned up or not.
  • Current interface of Runner.close() does not allow any options, so I added a cleanup_toolsets to propagate the flag.
    • The default value of the parameter is True, which is the behaviour AS-IS. Therefore it will not cause any side-effects.

This behaviour lets AgentTool to close toolset only when there is no more runners using the toolset, ensures that there should no more early-release of the shared resource.

I will attach diagrams below to illustrate the solution more readable.

Solution with diagrams

This diagram illustrates the flow of the AgentTool::run_async with AgentToolManager.

playground-adk-agent-tool-manager-1 drawio

When AgentTool executes the runner, it registers the runner to the AgentToolManager.
Since both shares the agent, their key is same.

playground-adk-agent-tool-manager-2 drawio

Assume that SubRunner 1 has finished its job first.
When the runner finished its work, AgentTool deregisters the runner from AgentToolManager.
AgentToolManager returns False since there is a runner still running.

playground-adk-agent-tool-manager-3 drawio

AgentTool calls runner.close() with cleanup_toolsets to False, prevents toolset cleanup when closing the runner.
SubRunner 2, waiting the tool call response, will have no effect on it.

playground-adk-agent-tool-manager-4 drawio

Now SubRunner 2 is done, AgentTool deregisters it from AgentToolManager.

playground-adk-agent-tool-manager-5 drawio

No more runners are registered to that agent, AgentToolManager returns True.
AgentTool calls runner.close() with cleanup_toolsets to True. It actually closes the toolset.

playground-adk-agent-tool-manager-6 drawio

Now all AgentTool calls are finished; root agent proceeds with parallel tool call results.

playground-adk-agent-tool-manager-7 drawio

Testing Plan

I made 2 sample agents at contributing/samples/adk_concurrent_agent_tool_call which reproduces the issue in a different way.

  • agent_tool_parallel: calling AgentTool in parallel.
  • runner_shared_toolset: executing Runner concurrently.

Both raises exception AS-IS.
With this PR, agent_tool_parallel does not raise exception, runner_shared_toolset still does.

Also I made a unit test for the new component, AgentToolManager.
The unit test is focused on testing the manager when multiple tasks are accessing the manager concurrently.

Unit Tests:

  • I have added or updated unit tests for my change.
  • All unit tests pass locally.
========================================================================================= 3693 passed, 1 skipped, 2208 warnings in 57.43s ==========================================================================================

Manual End-to-End (E2E) Tests:

Tested by:

  • Executing 2 sample agents that I've made
  • E2E test by using dedicate application server which utilizes Google ADK.

Checklist

  • I have read the CONTRIBUTING.md document.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.
  • I have manually tested my changes end-to-end.
  • Any dependent changes have been merged and published in downstream modules. No deps changed

Additional context

Caveats: what if the newborn runner re-uses the already-closed-toolset?
Well the answer is, depends. At least McpToolset is not affected by reusing closed session.
It is miles better at least, since the code AS-IS will not allow multiple runners to run simultaneously.

I will describe details in below.

Case of toolset re-use after close

This IS a problem since the behaviour of resurrecting the already-closed toolset is unknown. It depends whether the toolset supports it or not.

playground-adk-agent-tool-manager-limitations-1 drawio

At least McpToolset is fine, closing McpToolset basically just closes McpSessionManager, and what that does is just closing all sessions alive. Creating new session on a once-close-has-been-called manager is no problem with it.

playground-adk-agent-tool-manager-limitations-2 drawio

Also to solve this whole issue perfectly including concurrency problem with multiple runners, we need to handle the lifecycle of the toolset properly.

But I have not figured out whether should we separate the lifecycle of the toolset to the runner lifecycle. In some point it seems like the toolset lifecycle should be separated, in other point it should not..

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @challenger71498, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical concurrency bug where shared toolsets were being prematurely closed when multiple AgentTool calls or runners operated in parallel, causing resource exhaustion errors. The solution introduces a new AgentToolManager that intelligently tracks active runners for each agent, ensuring that toolsets are only deallocated when no active components are utilizing them. This prevents interruptions and stabilizes parallel execution flows within the system.

Highlights

  • Problem Addressed: This PR fixes a concurrency issue where toolsets are prematurely closed when multiple runners or AgentTool instances share the same agent and run in parallel, leading to resource-related exceptions.
  • New Component: AgentToolManager: A singleton AgentToolManager has been introduced to manage the lifecycle of toolsets. It uses reference counting to track active runners associated with a given agent.
  • Controlled Toolset Cleanup: AgentTool now registers runners with the AgentToolManager upon invocation and deregisters them upon completion. The manager determines if a toolset should be cleaned up (i.e., if it's the last active runner for that agent) and passes this decision via a new cleanup_toolsets flag to the Runner.close() method.
  • Updated Runner.close(): The Runner.close() method has been modified to accept a cleanup_toolsets boolean parameter, allowing conditional cleanup of toolsets. Its default value is True to maintain existing behavior.
  • New Sample Agents and Unit Tests: Two new sample agents (agent_tool_parallel and runner_shared_toolset) have been added to demonstrate and test the concurrency scenarios. Comprehensive unit tests for the AgentToolManager have also been included, focusing on concurrent registration and unregistration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical concurrency issue where toolsets were being cleaned up prematurely when multiple AgentTool instances run in parallel. The introduction of a singleton AgentToolManager to perform reference counting on runners is a solid and well-implemented solution. The changes are thoughtfully integrated, with backward compatibility maintained in the Runner.close method. The addition of extensive unit tests, especially those covering various concurrency scenarios, and the inclusion of new sample tests to demonstrate the fix, significantly increase confidence in the change. This is a high-quality contribution that improves the robustness of the library.

Comment on lines +59 to +62
# TODO: can we use the name of the agent as the key?
if id(agent) not in self._runners_by_agent:
self._runners_by_agent[id(agent)] = set()
self._runners_by_agent[id(agent)].add(runner)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TODO here raises a good question. Using id(agent) is the correct approach to track runners on a per-agent-instance basis. If different agent instances share the same name but have distinct toolsets, using agent.name as a key would incorrectly group them, leading to race conditions or incorrect cleanup. Since the goal is to manage resources for a specific shared agent object, id(agent) is the right key. I suggest resolving this TODO by replacing it with a comment explaining this rationale.

Suggested change
# TODO: can we use the name of the agent as the key?
if id(agent) not in self._runners_by_agent:
self._runners_by_agent[id(agent)] = set()
self._runners_by_agent[id(agent)].add(runner)
# Use `id(agent)` as the key to track runners on a per-agent-instance
# basis. This is crucial because different agent instances, even with
# the same name, may have distinct toolsets or states. Using
# `agent.name` could lead to incorrect resource management.
if id(agent) not in self._runners_by_agent:
self._runners_by_agent[id(agent)] = set()
self._runners_by_agent[id(agent)].add(runner)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I am still wondering is it okay to use agent name as the key.

Agent name is unique only in an agent tree. Different root agent may have same name.
Since AgentToolManager is a singleton object, it would be dangerous to use agent name.

But I doubt that using id(object) as the key is okay..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core [Component] This issue is related to the core interface and implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] When executing Runners(of which agent contains stateful toolset) in parallel, toolset is being released when only one of them has finished

2 participants