ProjectTech4DevAI · Prajna1999 · Feb 20, 2026 · Feb 21, 2026 · Feb 21, 2026 · Feb 27, 2026
diff --git a/backend/app/api/docs/llm/speech_to_speech.md b/backend/app/api/docs/llm/speech_to_speech.md
@@ -0,0 +1,228 @@
+# Speech-to-Speech (STS) with RAG
+
+Execute a complete speech-to-speech workflow with knowledge base retrieval.
+
+## Endpoint
+
+```
+POST /llm/sts
+```
+
+## Flow
+
+```
+Voice Input → STT (auto language) → RAG (Knowledge Base) → TTS → Voice Output
+```
-```
-POST /llm/sts
-```
-
-## Flow
-
-```
-Voice Input → STT (auto language) → RAG (Knowledge Base) → TTS → Voice Output
-```
+## Endpoint
+
-```
-POST /llm/sts
-```
-
-## Flow
-
-```
-Voice Input → STT (auto language) → RAG (Knowledge Base) → TTS → Voice Output
-```
+## Endpoint
+
+
+## Input
+
+- **Voice note**: WhatsApp-compatible audio format (required)
+- **Knowledge base IDs**: One or more knowledge bases for RAG (required)
+- **Languages**: Input and output languages (optional, defaults to Hindi)
+- **Models**: STT, LLM, and TTS model selection (optional, defaults to Sarvam)
+
+## Output
+
+You will receive **3 callbacks** to your webhook URL:
+
+1. **STT Callback** (Intermediate): Transcribed text from audio
+2. **LLM Callback** (Intermediate): RAG-enhanced response text
+3. **TTS Callback** (Final): Audio output + response text
+
+Each callback includes:
+- Output from that step
+- Token usage
+- Latency information (check timestamps)
+
+## Supported Languages
+
+### Primary Indian Languages
+- English, Hindi, Hinglish (code-switching)
+- Bengali, Kannada, Malayalam, Marathi
+- Odia, Punjabi, Tamil, Telugu, Gujarati
+
+### Additional Languages (Sarvam Saaras V3)
+- Assamese, Urdu, Nepali
+- Konkani, Kashmiri, Sindhi
+- Sanskrit, Santali, Manipuri
+- Bodo, Maithili, Dogri
+
+**Total: 25 languages** with automatic language detection
+
+## Available Models
+
+### STT (Speech-to-Text)
+- `saaras:v3` - Sarvam Saaras V3 (**default**, fast, auto language detection, optimized for Indian languages)
+- `gemini-2.5-pro` - Google Gemini 2.5 Pro
+
+**Note:** Sarvam STT uses automatic language detection. No need to specify input language.
+
+### LLM (RAG)
+- `gpt-4o` - OpenAI GPT-4o (**default**, best quality)
+- `gpt-4o-mini` - OpenAI GPT-4o Mini (faster, lower cost)
+
+### TTS (Text-to-Speech)
+- `bulbul:v3` - Sarvam Bulbul V3 (**default**, natural Indian voices, MP3 output)
+- `gemini-2.5-pro-preview-tts` - Google Gemini 2.5 Pro (OGG OPUS output)
+
+## Edge Cases & Error Handling
+
+### Empty STT Output
+If speech-to-text returns empty/blank:
+- Chain fails immediately
+- Error message: "STT returned no transcription"
+- No subsequent blocks are executed
+
+### Audio Size Limit
+WhatsApp limit: 16MB
+- TTS providers may fail if output exceeds limit
+- Error is caught and reported in callback
+- Consider using shorter responses or compression
+
+### Invalid Audio Format
+If input audio format is unsupported:
+- STT provider fails with format error
+- Error reported in callback
+- Supported: MP3, WAV, OGG, OPUS, M4A
+
+### Provider Failures
+Each block has independent error handling:
+- STT fails → Chain stops, STT error reported
+- LLM fails → Chain stops, RAG error reported
+- TTS fails → Chain stops, TTS error reported
+
+## Example Request
+
+```bash
+curl -X POST https://api.kaapi.ai/llm/sts \
+  -H "Authorization: Bearer YOUR_API_KEY" \
+  -H "Content-Type: application/json" \
+  -d @- <<EOF
+{
+  "query": {
+    "type": "audio",
+    "content": {
+      "format": "base64",
+      "value": "base64_encoded_audio_data",
+      "mime_type": "audio/ogg"
+    }
+  },
+  "knowledge_base_ids": ["kb_abc123"],
+  "input_language": "hindi",
+  "output_language": "english",
+  "callback_url": "https://your-app.com/webhook"
+}
+EOF
+```
+
+**Note:** `stt_model`, `llm_model`, and `tts_model` are optional and will use defaults if not specified.
+
+## Example Callbacks
+
+### Callback 1: STT Output (Intermediate)
+```json
+{
+  "success": true,
+  "data": {
+    "block_index": 1,
+    "total_blocks": 3,
+    "response": {
+      "provider_response_id": "stt_xyz789",
+      "provider": "sarvamai-native",
+      "model": "saarika:v1",
+      "output": {
+        "type": "text",
+        "content": {
+          "value": "नमस्ते, मुझे अपने अकाउंट के बारे में जानकारी चाहिए"
+        }
+      }
+    },
+    "usage": {
+      "input_tokens": 0,
+      "output_tokens": 12,
+      "total_tokens": 12
+    }
+  },
+  "metadata": {
+    "speech_to_speech": true,
+    "input_language": "hi-IN"
+  }
+}
+```
-### Callback 1: STT Output (Intermediate)
-```json
-{
-  "success": true,
-  "data": {
-    "block_index": 1,
-    "total_blocks": 3,
-    "response": {
-      "provider_response_id": "stt_xyz789",
-      "provider": "sarvamai-native",
-      "model": "saarika:v1",
-      "output": {
-        "type": "text",
-        "content": {
-          "value": "नमस्ते, मुझे अपने अकाउंट के बारे में जानकारी चाहिए"
-        }
-      }
-    },
-    "usage": {
-      "input_tokens": 0,
-      "output_tokens": 12,
-      "total_tokens": 12
-    }
-  },
-  "metadata": {
-    "speech_to_speech": true,
-    "input_language": "hi-IN"
-  }
-}
-```
+### Callback 1: STT Output (Intermediate)
-### Callback 1: STT Output (Intermediate)
-```json
-{
-  "success": true,
-  "data": {
-    "block_index": 1,
-    "total_blocks": 3,
-    "response": {
-      "provider_response_id": "stt_xyz789",
-      "provider": "sarvamai-native",
-      "model": "saarika:v1",
-      "output": {
-        "type": "text",
-        "content": {
-          "value": "नमस्ते, मुझे अपने अकाउंट के बारे में जानकारी चाहिए"
-        }
-      }
-    },
-    "usage": {
-      "input_tokens": 0,
-      "output_tokens": 12,
-      "total_tokens": 12
-    }
-  },
-  "metadata": {
-    "speech_to_speech": true,
-    "input_language": "hi-IN"
-  }
-}
-```
+### Callback 1: STT Output (Intermediate)
+
+### Callback 2: LLM Output (Intermediate)
+```json
+{
+  "success": true,
+  "data": {
+    "block_index": 2,
+    "total_blocks": 3,
+    "response": {
+      "provider_response_id": "chatcmpl_abc123",
+      "provider": "openai",
+      "model": "gpt-4o",
+      "output": {
+        "type": "text",
+        "content": {
+          "value": "आपके अकाउंट में कुल बैलेंस ₹5,000 है। पिछले महीने में 3 ट्रांजैक्शन हुए हैं।"
+        }
+      }
+    },
+    "usage": {
+      "input_tokens": 150,
+      "output_tokens": 45,
+      "total_tokens": 195
+    }
+  },
+  "metadata": {
+    "speech_to_speech": true
+  }
+}
+```
+
+### Callback 3: TTS Output (Final)
+```json
+{
+  "success": true,
+  "data": {
+    "response": {
+      "provider_response_id": "tts_def456",
+      "provider": "sarvamai-native",
+      "model": "bulbul:v1",
+      "output": {
+        "type": "audio",
+        "content": {
+          "format": "base64",
+          "value": "base64_encoded_audio_output",
+          "mime_type": "audio/ogg"
+        }
+      }
+    },
+    "usage": {
+      "input_tokens": 15,
+      "output_tokens": 0,
+      "total_tokens": 15
+    }
+  },
+  "metadata": {
+    "speech_to_speech": true,
+    "output_language": "hi-IN"
+  }
+}
+```
-### Callback 3: TTS Output (Final)
-```json
-{
-  "success": true,
-  "data": {
-    "response": {
-      "provider_response_id": "tts_def456",
-      "provider": "sarvamai-native",
-      "model": "bulbul:v1",
-      "output": {
-        "type": "audio",
-        "content": {
-          "format": "base64",
-          "value": "base64_encoded_audio_output",
-          "mime_type": "audio/ogg"
-        }
-      }
-    },
-    "usage": {
-      "input_tokens": 15,
-      "output_tokens": 0,
-      "total_tokens": 15
-    }
-  },
-  "metadata": {
-    "speech_to_speech": true,
-    "output_language": "hi-IN"
-  }
-}
-```
+### Callback 3: TTS Output (Final)
-### Callback 3: TTS Output (Final)
-```json
-{
-  "success": true,
-  "data": {
-    "response": {
-      "provider_response_id": "tts_def456",
-      "provider": "sarvamai-native",
-      "model": "bulbul:v1",
-      "output": {
-        "type": "audio",
-        "content": {
-          "format": "base64",
-          "value": "base64_encoded_audio_output",
-          "mime_type": "audio/ogg"
-        }
-      }
-    },
-    "usage": {
-      "input_tokens": 15,
-      "output_tokens": 0,
-      "total_tokens": 15
-    }
-  },
-  "metadata": {
-    "speech_to_speech": true,
-    "output_language": "hi-IN"
-  }
-}
-```
+### Callback 3: TTS Output (Final)
+
+## Latency Tracking
+
+Calculate latency from callback timestamps:
+- **STT latency**: Time from request to first callback
+- **LLM latency**: Time between first and second callback
+- **TTS latency**: Time between second and third callback
+- **Total latency**: Time from request to final callback
+
+## Best Practices
+
+1. **Language Consistency**: If not translating, keep input_language = output_language
+2. **Model Selection**: Use Sarvam models for Indian languages (faster, better quality)
+3. **Knowledge Base**: Ensure KB is properly indexed and relevant to expected queries
+4. **Error Handling**: Implement retry logic for transient provider failures
+5. **Webhook Security**: Validate webhook signatures and use HTTPS
diff --git a/backend/app/api/main.py b/backend/app/api/main.py
@@ -7,10 +7,12 @@
     config,
     doc_transformation_job,
     documents,
+    llm_sts,
     login,
     languages,
     llm,
     llm_chain,
+    llm_sts,
     organization,
     openai_conversation,
     project,
@@ -43,6 +45,7 @@
 api_router.include_router(languages.router)
 api_router.include_router(llm.router)
 api_router.include_router(llm_chain.router)
+api_router.include_router(llm_sts.router)
 api_router.include_router(login.router)
 api_router.include_router(onboarding.router)
 api_router.include_router(openai_conversation.router)

diff --git a/backend/app/api/routes/llm_sts.py b/backend/app/api/routes/llm_sts.py
@@ -0,0 +1,144 @@
+"""Speech-to-Speech (STS) API endpoint with RAG."""
+
+import logging
+
+from fastapi import APIRouter, Depends, HTTPException
+
+from app.api.deps import AuthContextDep, SessionDep
+from app.api.permissions import Permission, require_permission
+from app.models import Message
+from app.models.llm.request import (
+    LLMChainRequest,
+    QueryParams,
+    SpeechToSpeechRequest,
+)
+from app.services.llm.chain.utils import (
+    SUPPORTED_LANGUAGE_CODES,
+    build_rag_block,
+    build_stt_block,
+    build_tts_block,
+)
+from app.services.llm.jobs import start_chain_job
+from app.utils import APIResponse, load_description, validate_callback_url
+
+logger = logging.getLogger(__name__)
+
+router = APIRouter(tags=["LLM"])
+
+
+@router.post(
+    "/llm/sts",
+    description=load_description("llm/speech_to_speech.md"),
+    response_model=APIResponse[Message],
+    dependencies=[Depends(require_permission(Permission.REQUIRE_PROJECT))],
+)
+def speech_to_speech(
+    _current_user: AuthContextDep,
+    _session: SessionDep,
+    request: SpeechToSpeechRequest,
+):
-def speech_to_speech(
-    _current_user: AuthContextDep,
-    _session: SessionDep,
-    request: SpeechToSpeechRequest,
-):
+def speech_to_speech(
+    _current_user: AuthContextDep,
+    _session: SessionDep,
+    request: SpeechToSpeechRequest,
+) -> APIResponse[Message]:
-def speech_to_speech(
-    _current_user: AuthContextDep,
-    _session: SessionDep,
-    request: SpeechToSpeechRequest,
-):
+def speech_to_speech(
+    _current_user: AuthContextDep,
+    _session: SessionDep,
+    request: SpeechToSpeechRequest,
+) -> APIResponse[Message]:
+    """
+    Speech-to-speech (STS) endpoint with RAG.
+
+    Executes a 3-block chain:
+    1. STT (Speech-to-Text) - Transcribes audio to text (auto-detects language for Sarvam)
+    2. RAG (Retrieval-Augmented Generation) - Processes text with knowledge base
+    3. TTS (Text-to-Speech) - Converts response back to audio
+
+    Input: Voice note (WhatsApp compatible)
+    Output 1: Voice note
+    Output 2: text (via intermediate callback)
+
+    """
+    project_id = _current_user.project_.id
+    organization_id = _current_user.organization_.id
+
+    # Validate callback URL
+    if request.callback_url:
+        validate_callback_url(str(request.callback_url))
+
+    # Validate BCP-47 language codes
+    if (
+        request.input_language
+        and request.input_language not in SUPPORTED_LANGUAGE_CODES
+    ):
+        return APIResponse.failure_response(
+            error=f"Unsupported input language code: {request.input_language}. Supported: {', '.join(sorted(SUPPORTED_LANGUAGE_CODES))}",
+            metadata={"status_code": 400},
+        )
+
+    if (
+        request.output_language
+        and request.output_language not in SUPPORTED_LANGUAGE_CODES
+    ):
+        return APIResponse.failure_response(
+            error=f"Unsupported output language code: {request.output_language}. Supported: {', '.join(sorted(SUPPORTED_LANGUAGE_CODES))}",
+            metadata={"status_code": 400},
+        )
+
+    # Determine language codes (already BCP-47, no conversion needed)
+    input_lang_code = request.input_language or "auto"
+
+    # If output_language not set, default to input_language
+    # If input is "auto", use "{{detected}}" marker to signal TTS to use detected language
+    if request.output_language:
+        output_lang_code = request.output_language
+    elif input_lang_code == "auto":
+        output_lang_code = "{{detected}}"  # Marker to use detected language from STT
+    else:
+        output_lang_code = input_lang_code
+
+    logger.info(
+        f"[speech_to_speech] Starting STS chain | "
+        f"project_id={project_id}, "
+        f"input_lang={input_lang_code}, "
+        f"output_lang={output_lang_code}, "
+        f"stt_model={request.stt_model.value}, "
+        f"llm_model={request.llm_model.value}, "
+        f"tts_model={request.tts_model.value}"
+    )
+
+    # Build 3-block chain: STT → RAG → TTS
+    blocks = [
+        build_stt_block(request.stt_model, input_lang_code),
+        build_rag_block(request.llm_model, request.knowledge_base_ids),
+        build_tts_block(request.tts_model, output_lang_code),
+    ]
+
+    metadata = request.request_metadata or {}
+    metadata.update(
+        {
+            "speech_to_speech": True,
+            "input_language": input_lang_code,
+            "output_language": output_lang_code,
+            "stt_model": request.stt_model.value,
+            "llm_model": request.llm_model.value,
+            "tts_model": request.tts_model.value,
+        }
+    )
+
+    # Create chain request
+    chain_request = LLMChainRequest(
+        query=QueryParams(input=request.query),
+        blocks=blocks,
+        callback_url=request.callback_url,
+        request_metadata=metadata,
+    )
+
+    # Start async chain job
+    start_chain_job(
+        db=_session,
+        request=chain_request,
+        project_id=project_id,
+        organization_id=organization_id,
+    )
+
+    return APIResponse.success_response(
+        data=Message(
+            message=(
+                "Speech-to-speech processing initiated. "
+                "You will receive intermediate callbacks for STT and LLM outputs, "
+                "followed by the final callback with audio and text."
+            )
+        )
+    )