Found a critical liveness issue in BaseGroupChatManager. The manager waits for ALL active speakers to respond before transitioning to the next turn. If a single agent hangs or fails to send a response, the entire group chat is deadlocked.
Proof of Concept:
I have implemented a minimal reproduction showing that if 2 agents are active and only 1 responds, the manager never calls _transition_to_next_speakers.
Proposed Fix:
Implement a timeout mechanism for active speakers. If a speaker fails to respond within X seconds, they should be removed from _active_speakers with a timeout error, allowing the conversation to proceed.
This is critical for production deployments where LLM reliability is not 100%.
Found a critical liveness issue in BaseGroupChatManager. The manager waits for ALL active speakers to respond before transitioning to the next turn. If a single agent hangs or fails to send a response, the entire group chat is deadlocked.
Proof of Concept:
I have implemented a minimal reproduction showing that if 2 agents are active and only 1 responds, the manager never calls
_transition_to_next_speakers.Proposed Fix:
Implement a timeout mechanism for active speakers. If a speaker fails to respond within X seconds, they should be removed from
_active_speakerswith a timeout error, allowing the conversation to proceed.This is critical for production deployments where LLM reliability is not 100%.