Prevent update_chat_ctx from deleting in-flight function calls using function call attributes#5021
Conversation
…they have arrived in local context
|
@StianHanssen I just realized the error That means the tool is already done and
chat_ctx = self._rt_session.chat_ctx.copy()
chat_ctx.items.extend(new_fnc_outputs)This happens because there is no lock between local |
|
@longcw Thanks for the reply, I really appreciate all the time you have dedicated to this issue. From what I understand, this is what we have observed happening: Even if the operation succeeded on the OpenAI server, what happens after is that the diff finds certain items still missing in remote context, and tries to insert them over again into OpenAI server. So even in the event of success on OpenAI server, you will get duplicate messages pushed. We have seen this manifest in the agent loosing track, by repeating itself, or even repeating tool calls. In our system, repeating tool calls like that has a very negative effect because we use them for conversation flow control in the agent. From what I gather form my analysis, the inflight function call removal triggers a cascading failure that has noticeable negative effect on the agent. |
yeah this is exactly what I mean, and from my understanding this happens because the remote_chat_ctx got modified while the OAI api is generating a new response. to avoid this, we can await the potential |
|
actually, I tested that removing the tool call from the chat_ctx before tool execution and return the tool output solely works fine, the OAI realtime can generate a tool response without tool call in the chat ctx. I didn't mean this is the correct behavior but most likely this is not the reason of the error you have seen. so my proposal is to track the in-progress |
Hmm, I don't think the problem is solely that a tool call arrives during
|
no, I mean a tool response triggered by |
We have some more extensive tests in our test suit our product, that simulates the agent running with this scenario, and we were able to reproduce the issue. I also found the solutions I have put forth resolves the issue. We have been using my fork of LiveKit for the past week or so, without the issue appearing again. All that said, it could very well be a more nuanced issue is happening below that this resolves accidentally. |
Sorry, perhaps I am still misunderstanding. I will take some time to read over and double-check my understanding with you |
|
yeah your fix works by avoiding deleting the tool call item, but I think the root cause is the tool response arrives during the |
|
@longcw Just to confirm, am I correct to describe the scenario you say as:
I looked over the logs for such failure cases. And this is what we see:
So based on the logs, the function_call was deleted ~400ms after creation but ~600ms before tool execution even started. If I understand you correctly, your scenario assumes the tool has already executed and That is not to say, the scenario you outlined isn't realistic, that could also be a real race condition. But it looks like two different issues to me. Sorry, if I am still misunderstanding your proposal here 🙏 Perhaps, if we have a branch with the lock solution, I could apply it to our product and give it a test. It should be quite quick to confirm if it solves our issue. |
|
yes, that's what my assumption.
from your logs, the error happened here, way after the function call was deleted at T+434ms. the question is how this new item generated and why it has the deleted FC as the previous item, is it the tool response or a voice triggered response? it looks like it's not tool response since the previous_item is the function call, otherwise the previous one should be the tool output. if it's a voice triggered response, it's returned after the FC deleted but use the FC as the previous item. then it looks like a race condition on OAI's side to me. IMO this is a general issue that may happen to FC or regular messages, only if the latest item deleted while a new response is in creating. proposal
for your specific case, maybe you can always keep the latest item after summarization. if you already have that logic then the in-flight FC is the exception that it's deleted unexpectedly. and that's why preventing it from deletion fixed the issue? |
I am quite puzzled by this as well. I thought it could be the function_call_output was placed at the end of the conversation (via the diff's previous_item_id, not next to the function_call) because the function_call was already deleted from the snapshot when the diff ran, but I am really not sure if that could be the case 🤔
This seems sensible to me. I am a little concerned of how much it can delay response from the model, as the timeout for
Hmm, yes, that could maybe work. I think it might trigger diff to try to reorder again on the next
Yeah, this makes total sense if we can assume function_call_output always arrive after function_call. I really do think this should be what normally happens, but I feel a bit of unease because I have logs showing otherwise.
That is exactly right. Our summarization only targets old items (50% of the oldest chat context). The in-flight FC was brand new, it should never have been in Perhaps a good next action on my part is to try to understand why the next item was not |
|
@longcw I did some more investigation. I added logging to Starting state: Server conversation: The server has just received user speech (msg5) and starts generating a response. Step 1: Server generates a response containing a Server creates
Step 2: User speaks. The server starts generating a new response (VAD) While the That response was still generating (audio streaming etc), so Step 3: Summarization's Our summarization calls Diff compares:
Server: Step 4: Tool executes
Step 5: LiveKit's internal
chat_ctx = self._rt_session.chat_ctx.copy() # reads _remote_chat_ctx, NOT _agent._chat_ctx
chat_ctx.items.extend(new_fnc_outputs) # appends function_call_outputThis builds the context from Diff compares:
Only
Step 6: A later server-initiated response arrives The next failing item is a server-created message with This item belongs to the response started in Step 2, so the predecessor link had already been committed before the delete in Step 3.
TLDR: Two different
And the first item that fails to insert is not the
I also found that this happened each time I reproduced it. It was never function_call_output being the next item that fails to insert. |
|
In conclusion, these are my thoughts: I think your proposal may technically work, though I cannot say for certain without testing it. Because it attempts to repair the issue rather than prevent it, I think it comes with some risks:
My personal view is:
|
|
Hi @StianHanssen, thanks for your investigation! I agree with your concerns about my proposal and I noticed them as well when I was thinking about the implementation. the root cause is the out-of-sync between I'll create a fix that creates placeholder items in local chat_ctx when remote_chat_ctx updated, for both function call and chat messages. |
|
this should fixed the issue #5114 |
Summary
update_chat_ctxcan delete in-flightfunction_callitems from the OpenAI Realtime server, causing cascading"failed to insert item: previous_item_id not found"corruption of_remote_chat_ctx.The root cause is a timing gap between two context-tracking structures:
_remote_chat_ctx: Updated immediately when the server sendsconversation.item.added_agent._chat_ctx: Updated later, only when tool execution starts (_tool_execution_started_cb)If
update_chat_ctxruns during this window (e.g. from context management), the diff sees thefunction_callin remote but not in local, treats it as intentionally removed, and sends a delete event. The existing_is_content_emptyguard only protectsmessageitems —function_callitems pass through unconditionally.A unit test gist replicating the exact pipeline demonstrates how
update_chat_ctxdeletes in-flightfunction_callitems.Fix
Use a shared-object flag (
extra["dispatched"]) onFunctionCallitems to distinguish in-flight from intentionally removed function calls.openai_item_to_livekit_itemsetsextra["dispatched"] = Falsewhen creating aFunctionCallfrom a server event._handle_function_callreuses the sameFunctionCallobject from_remote_chat_ctxinstead of creating a new one. This is safe becauseconversation.item.addedalways precedesresponse.output_item.doneon the websocket. The same Python object is now shared across both contexts._create_update_chat_ctx_eventsskips deletion for anyfunction_callwithextra["dispatched"] == False. Once the flag isTrue, summarization and other callers can delete the item normally.agent_activity.pysetsextra["dispatched"] = Truewhen tool execution starts or when all function calls for a generation are finalized (afterawait exe_task). Since the object is shared, this is visible to the diff guard immediately — no cross-package signaling needed.Future consideration
This fix only tracks
function_callitems.function_call_outputitems are currently client-initiated (manual_function_calls=True), so they enter_agent._chat_ctxbefore_remote_chat_ctxand are not vulnerable to this race. Ifauto_tool_reply_generationis enabled in a future configuration (server-generated outputs), a guard should be added to coverfunction_call_outputitems as well.