-
Notifications
You must be signed in to change notification settings - Fork 2.7k
fix: manage Toolset cleanup properly when runner is running concurrently #4046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: manage Toolset cleanup properly when runner is running concurrently #4046
Conversation
Summary of ChangesHello @challenger71498, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical concurrency bug where shared toolsets were being prematurely closed when multiple Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a critical concurrency issue where toolsets were being cleaned up prematurely when multiple AgentTool instances run in parallel. The introduction of a singleton AgentToolManager to perform reference counting on runners is a solid and well-implemented solution. The changes are thoughtfully integrated, with backward compatibility maintained in the Runner.close method. The addition of extensive unit tests, especially those covering various concurrency scenarios, and the inclusion of new sample tests to demonstrate the fix, significantly increase confidence in the change. This is a high-quality contribution that improves the robustness of the library.
| # TODO: can we use the name of the agent as the key? | ||
| if id(agent) not in self._runners_by_agent: | ||
| self._runners_by_agent[id(agent)] = set() | ||
| self._runners_by_agent[id(agent)].add(runner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO here raises a good question. Using id(agent) is the correct approach to track runners on a per-agent-instance basis. If different agent instances share the same name but have distinct toolsets, using agent.name as a key would incorrectly group them, leading to race conditions or incorrect cleanup. Since the goal is to manage resources for a specific shared agent object, id(agent) is the right key. I suggest resolving this TODO by replacing it with a comment explaining this rationale.
| # TODO: can we use the name of the agent as the key? | |
| if id(agent) not in self._runners_by_agent: | |
| self._runners_by_agent[id(agent)] = set() | |
| self._runners_by_agent[id(agent)].add(runner) | |
| # Use `id(agent)` as the key to track runners on a per-agent-instance | |
| # basis. This is crucial because different agent instances, even with | |
| # the same name, may have distinct toolsets or states. Using | |
| # `agent.name` could lead to incorrect resource management. | |
| if id(agent) not in self._runners_by_agent: | |
| self._runners_by_agent[id(agent)] = set() | |
| self._runners_by_agent[id(agent)].add(runner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I am still wondering is it okay to use agent name as the key.
Agent name is unique only in an agent tree. Different root agent may have same name.
Since AgentToolManager is a singleton object, it would be dangerous to use agent name.
But I doubt that using id(object) as the key is okay..
Please ensure you have read the contribution guide before creating a pull request.
Link to Issue or Description of Change
Link to an existing issue:
Problem:
Since I already described the issue here, I will just briefly introduce the problem in short.
When multiple runners which shares the same agent are running in parallel, first finished runner closes the toolset.
This make other runners throw resource-related exception.
This may occur in 2 scenarios:
Runnerconcurrently, agent of the tool is with stateful toolset.AgentToolin parallel, agent of the tool is with stateful toolset(e.g.McpToolset).I will attach 2 diagrams which describes the second scenario, since this PR only fixes that.
You may find other diagrams at the issue I've mentioned.
Diagrams
Assume that the root agent is configured with AgentTool. The agent of AgentTool uses McpToolset.
If SubRunner 1 finishes its work first, it closes the toolset. If SubRunner 2 is still running and waiting responses of tool call, the tool call will end up raising the exception, indicates that the resource has been exhausted.
Solution:
I managed to solve the second problem, calling
AgentToolin parallel, by introducing a singletonAgentToolManagerwhich basically does reference-couting between the agent and the runner.AgentTool::run_asyncinvoked, right after when it initialized the runner, it registers the runner toAgentToolManagerwith the agent it is running.AgentToolManagerhas its own dictionary matching the agent to runners. By registering the runner, it simply adds it to the dictionary.AgentToolderegisters the runner.Booleantype ofAsyncContextManagerto hold the lock until the actual closing logic is finished.Runner.close()does not allow any options, so I added acleanup_toolsetsto propagate the flag.True, which is the behaviour AS-IS. Therefore it will not cause any side-effects.This behaviour lets
AgentToolto close toolset only when there is no more runners using the toolset, ensures that there should no more early-release of the shared resource.I will attach diagrams below to illustrate the solution more readable.
Solution with diagrams
This diagram illustrates the flow of the
AgentTool::run_asyncwithAgentToolManager.When
AgentToolexecutes the runner, it registers the runner to theAgentToolManager.Since both shares the agent, their key is same.
Assume that SubRunner 1 has finished its job first.
When the runner finished its work,
AgentToolderegisters the runner fromAgentToolManager.AgentToolManagerreturnsFalsesince there is a runner still running.AgentToolcallsrunner.close()withcleanup_toolsetstoFalse, prevents toolset cleanup when closing the runner.SubRunner 2, waiting the tool call response, will have no effect on it.
Now SubRunner 2 is done,
AgentToolderegisters it fromAgentToolManager.No more runners are registered to that agent,
AgentToolManagerreturnsTrue.AgentToolcallsrunner.close()withcleanup_toolsetstoTrue. It actually closes the toolset.Now all AgentTool calls are finished; root agent proceeds with parallel tool call results.
Testing Plan
I made 2 sample agents at
contributing/samples/adk_concurrent_agent_tool_callwhich reproduces the issue in a different way.agent_tool_parallel: callingAgentToolin parallel.runner_shared_toolset: executingRunnerconcurrently.Both raises exception AS-IS.
With this PR,
agent_tool_paralleldoes not raise exception,runner_shared_toolsetstill does.Also I made a unit test for the new component,
AgentToolManager.The unit test is focused on testing the manager when multiple tasks are accessing the manager concurrently.
Unit Tests:
Manual End-to-End (E2E) Tests:
Tested by:
Checklist
Any dependent changes have been merged and published in downstream modules.No deps changedAdditional context
Caveats: what if the newborn runner re-uses the already-closed-toolset?
Well the answer is, depends. At least McpToolset is not affected by reusing closed session.
It is miles better at least, since the code AS-IS will not allow multiple runners to run simultaneously.
I will describe details in below.
Case of toolset re-use after close
This IS a problem since the behaviour of resurrecting the already-closed toolset is unknown. It depends whether the toolset supports it or not.
At least McpToolset is fine, closing McpToolset basically just closes McpSessionManager, and what that does is just closing all sessions alive. Creating new session on a once-close-has-been-called manager is no problem with it.
Also to solve this whole issue perfectly including concurrency problem with multiple runners, we need to handle the lifecycle of the toolset properly.
But I have not figured out whether should we separate the lifecycle of the toolset to the runner lifecycle. In some point it seems like the toolset lifecycle should be separated, in other point it should not..