PR -> feat: Unified Telemetry Layer for Non-LangGraph Trace Pipelines (M2) Body by mjehanzaib999 · Pull Request #64 · AgentOpt/OpenTrace

mjehanzaib999 · 2026-02-26T11:18:16Z

Summary

This PR implements the "Generic Unified Telemetry" layer (Milestone 2), enabling OTEL span emission for non-LangGraph Trace pipelines while preserving all existing LangGraph instrumentation behavior.

After M1, only LangGraph pipelines could emit OTEL spans. This PR extends telemetry coverage so that any Trace pipeline using @trace.bundle or call_llm can produce OTEL-compatible spans when a TelemetrySession is active — with zero changes to existing code when no session is active.

What's new

Session activation via contextvars — TelemetrySession supports with context manager and activate() for global discovery by Trace hooks
OTEL spans around @trace.bundle ops — controlled by BundleSpanConfig (enable/disable, suppress default ops, capture inputs)
MessageNode-to-span binding — MessageNodeTelemetryConfig binds message.id to current span for stable node identity in TGJ conversion
call_llm provider span — emits a child OTEL span with trace.temporal_ignore=true when a session is active (visible for monitoring, excluded from output node selection)
Session activation in LangGraph root span — InstrumentedGraph._root_invocation_span now calls session.activate() so Trace-level hooks discover the session automatically
Optional MLflow autologging — opto.features.mlflow.autolog() enables mlflow.trace wrapping on bundle ops; safe no-op when MLflow is not installed
Export naming alignment — export_run_bundle() now writes otlp.json / tgj.json (aligned with repo demos) with backward-compatible aliases (otlp_trace.json / trace_graph.json)
Manifest + node records — manifest.json and message_nodes.jsonl included in export bundle for debugging

Files changed (9 files, +664 / -71)

File	Change
`opto/trace/settings.py`	New — global MLflow autologging toggle
`opto/features/mlflow/__init__.py`	New — MLflow integration package
`opto/features/mlflow/autolog.py`	New — `autolog()` / `disable_autolog()`
`opto/trace/__init__.py`	Expose `settings` and `mlflow` in public API
`opto/trace/bundle.py`	Optional OTEL span in `sync_forward`/`async_forward`; MLflow `mlflow.trace` wrapping
`opto/trace/io/telemetry_session.py`	Major expansion: activation, `BundleSpanConfig`, `MessageNodeTelemetryConfig`, span helpers, MLflow helpers, export alignment
`opto/trace/io/instrumentation.py`	Wrap root span with `session.activate()`
`opto/trace/nodes.py`	Hook `MessageNode.__init__` to call `on_message_node_created()`
`opto/trace/operators.py`	`call_llm` emits temporal-ignore provider span

Non-breaking guarantees

No session active → identical behavior — all hooks are guarded by TelemetrySession.current() is None checks
postprocess_output signature unchanged — preserves compatibility with existing callers
preprocess_inputs preserved — data extraction inside trace_nodes context is untouched
MLflow is optional — all imports are guarded; code works without MLflow installed

Test plan

…tion do not lose initial node to optimize (TODO: trainer might have a better solution)

…a lot of logs for further analysis

…ns and doc evaluation hooks

- Add T1 technical plan for LangGraph OTEL Instrumentation API - Add architecture & strategy doc (unified OTEL instrumentation design) - Add M0 README with before/after boilerplate reduction comparison - Add feedback analysis and API strategy comparison (Trace-first, dual semconv) - Add prototype_api_validation.py with real LangGraph StateGraph + OpenRouter/StubLLM - Add Jupyter notebook (prototype_api_validation.ipynb) for Colab-ready demo - Add example trace output JSON files (notebook_trace_output, optimization_traces) - Add .env.example for OpenRouter configuration

- Replace hardcoded API key with 3-tier auto-lookup (Colab Secrets → env → .env) - Save all trace outputs to RUN_FOLDER (Google Drive on Colab, local fallback) - Add run_summary.json export with scores and history - Update configuration docs with key setup priority table - Fix Colab badge URL with actual repo/branch path

…ace/io/otel_adapter.py

Deliver Milestone 1 — drop-in OTEL instrumentation and end-to-end optimization for any LangGraph agent via two function calls. New modules (opto/trace/io/): - instrumentation.py: instrument_graph() + InstrumentedGraph wrapper - optimization.py: optimize_graph() loop + EvalResult/EvalFn contracts - telemetry_session.py: TelemetrySession (TracerProvider + flush/export) - bindings.py: Binding dataclass + apply_updates() + make_dict_binding() - otel_semconv.py: emit_reward(), emit_trace(), record_genai_chat() Modified modules: - langgraph_otel_runtime.py: TracingLLM dual semconv (param.* parent + gen_ai.* child spans with trace.temporal_ignore) - __init__.py: export all new M1 public APIs Tests (63 passing, StubLLM-only, CI-safe): - Unit tests for bindings, semconv, session, instrumentation, optimization - E2E integration test (test_e2e_m1_pipeline.py): real LangGraph with StubLLM proving full pipeline instrument → invoke → OTLP → TGJ → optimizer → apply_updates → re-invoke with updated template Notebook + docs: - 01_m1_instrument_and_optimize.ipynb: dual-mode (StubLLM + live OpenRouter), Colab badge, executed outputs, <=3 item dataset, temperature=0, max_tokens=256 budget guard - docs/m1_README.md: architecture, API reference, data flow, semantic conventions, acceptance criteria status - requirements.txt: pinned dependencies for uv/pip environments

A. Live mode error handling: - A1: TracingLLM raises LLMCallError on HTTP errors/empty content instead of passing error strings as assistant content - A2: Notebook only prints [OK] when provider call actually succeeds with non-empty content - A3: gen_ai.provider.name correctly set to "openrouter" (not "openai") when using OpenRouter - A4: optimize_graph forces score=0 on invocation failure, bypassing eval_fn B. TelemetrySession API correctness + redaction: - B5: flush_otlp(clear=False) properly peeks at spans without clearing the exporter - B6: span_attribute_filter now applied during flush_otlp; supports drop (return {}), redact, and truncate C. TGJ/ingest correctness and optimizer safety: - C7: _deduplicate_param_nodes() strips numeric suffixes to collapse duplicate ParameterNodes - C8: _select_output_node() excludes child LLM spans, selects the true sink (synthesizer) D. OTEL topology and temporal chaining: - D9: Root invocation span wraps graph.invoke(), producing a single trace ID per invocation - D10: Temporal chaining uses trace.temporal_ignore attribute instead of OTEL parent presence E. optimize_graph semantics + trace-linked reward: - E11: best_parameters is a real snapshot captured at the best-scoring iteration - E12: eval.score attached to root invocation span before flush, linking reward to trace F. Non-saturating scoring for Stub mode: - F13: StubLLM and eval_fn are structure-aware; stub optimization demonstrates score improvement Files changed: - langgraph_otel_runtime.py: LLMCallError, _validate_content, flush_otlp(clear=) - telemetry_session.py: flush_otlp delegation, _apply_attribute_filter - otel_adapter.py: root span exclusion, trace.temporal_ignore chaining - instrumentation.py: _root_invocation_span context manager, root span on invoke/stream - optimization.py: _deduplicate_param_nodes, _select_output_node, _snapshot_parameters, eval-in-trace - __init__.py: export LLMCallError - test_optimization.py: updated for best_parameters field - 01_m1_instrument_and_optimize.ipynb: all fixes reflected in notebook - test_client_feedback_fixes.py: 20 new tests covering all 13 issues

… code Make the instrumentation layer fully generic and provider-agnostic: - TracingLLM: default provider_name "openai" → "llm", default llm_span_name "openai.chat.completion" → "llm.chat.completion" - init_otel_runtime: default service_name "trace-langgraph-demo" → "trace-otel-runtime" - DEFAULT_EVAL_METRIC_KEYS: remove example-specific "plan_quality", add generic "score" - instrument_graph: add llm_span_name, input_key, output_key parameters so callers explicitly configure provider/schema specifics - InstrumentedGraph: add input_key field; invoke()/stream() use it instead of hardcoded "query" for the root span hint - optimize_graph: add output_key parameter; _make_state uses graph.input_key instead of hardcoded "query"; error fallback no longer assumes result["answer"] - _select_output_node: replace hardcoded "openai"/"chat.completion" name checks with trace.temporal_ignore attribute from info.otel - otel_adapter: propagate temporal_ignore flag into TGJ info dict - tgj_ingest: preserve info.otel metadata through conversion and onto MessageNode objects Tests and notebook updated to explicitly pass example-specific values (provider_name, llm_span_name, output_key) rather than relying on defaults. All 88 tests pass.

…st iteration Previously, best_updates was overwritten on every iteration where updates were applied, regardless of whether that iteration achieved the best score. This caused best_updates to always contain the last applied updates rather than the updates that produced the best-performing parameters. Introduce last_applied_updates to track the most recently applied updates separately, and snapshot it at the start of each iteration as applied_updates_for_this_iter. best_updates is now only assigned inside the best-score guard (avg_score > best_score), ensuring it accurately reflects the updates that led to best_parameters. Addresses PR feedback item doxav#1: optimize_graph() best_updates tracking.

optimize_graph() previously ignored the graph's configured output_key unless the caller explicitly passed output_key=..., causing incorrect eval payload shape. Now auto-inherits graph.output_key when the parameter is not provided, and logs a debug note when an explicit override disagrees with the graph's configuration. Addresses PR feedback item doxav#2: output_key fallback in optimize_graph.

enable_code_optimization was accepted by instrument_graph() but never used — TracingLLM.emit_code_param always remained None. Now constructs a _emit_code_param callback when the flag is True that emits source code, SHA-256 hash, truncation metadata, and trainable marker as param.__code_* span attributes. Source is capped at 10K chars with truncation flag. Addresses PR feedback item doxav#3: enable_code_optimization no-op.

(4A) otel_adapter: after temporal hierarchy resolution, null out effective_psid when it still references a skipped root invocation span, preventing dangling parent edges in the TGJ graph. (4B) langgraph_otel_runtime: capture child LLM span ref and propagate error/error.type attributes to it on LLMCallError and unexpected exceptions, so OTEL UIs correctly flag the LLM call as failed. Addresses PR feedback item doxav#4.

…race validation Notebook trace validation used "openai" in name to detect child spans, which silently matched nothing after the generic refactoring. Now uses trace.temporal_ignore attribute for provider-agnostic detection and asserts the set is non-empty. Also adds root invocation span assertion to enforce the D9 single-trace-ID invariant. Addresses PR feedback item doxav#6.

…into m1-for-upstream

…into m1-for-upstream # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

…e spans Library (langgraph_otel_runtime.py): - Restructure child LLM span error handling: catch errors inside the child span context manager so attributes are set before the span ends - Add error.message attribute (truncated to 500 chars) on both parent and child spans for LLMCallError and unexpected exceptions Notebook (01_m1_instrument_and_optimize.ipynb): - Rewrite graph to 6-node architecture aligned with reference demo: planner → executor → web_researcher/wikidata_researcher → synthesizer → evaluator - Use Command routing from langgraph.types for dynamic node dispatch - Switch to DEMO_QUERIES (French Revolution / Tesla / CRISPR) - Add 3 trainable templates (planner, executor, synthesizer) with output_key=final_answer - Rewrite StubLLM to produce JSON plans, routing JSON, and topic-aware answers; respond to prompt template changes for non-saturating scoring - Rewrite stub_eval_fn: base 0.2 + plan richness + answer length, cap 0.95 - Fix live section: provider_name="openrouter", trace invariant checks, only print [OK] on actual success - Fix ParameterNode deduplication in TGJ inspection (id-based dedup) - Update Colab Drive paths to OpenTrace_runs/M1/{OPENTRACE_REF} - Add optimization table output (iteration → avg_score → best_score) Verified: 41 tests pass, notebook runs end-to-end, baseline=0.75 → best=0.95

…imeV2 optimizer - Remove custom OpenRouterLLM HTTP class; use opto.utils.llm.LiteLLM which natively supports OpenRouter via the "openrouter/" model prefix - Upgrade auto-created optimizer from OptoPrime to OptoPrimeV2 in optimize_graph() so live section uses real optimization - Rewrite live Section 9 to mirror Section 8 structure with real optimizer (optimizer=None auto-creates OptoPrimeV2) and same eval_fn - Fix Colab install cell: add sed patch for Python 3.12 compat, correct repo URL and branch for checkout - Fix Colab badge URL to point to fork (mjehanzaib999/NewTrace) - Fix StubLLM scoring: baseline no longer saturates at 0.95; optimization now demonstrates clear improvement (0.47 → 0.64)

- Replace rate-limited meta-llama/llama-3.3-70b-instruct:free with qwen/qwen3-next-80b-a3b-instruct:free (instruction-tuned, no thinking traces) - Use eval_fn=None in Section 9 live optimization so optimize_graph() uses the library's _default_eval_fn which reads eval.score from the evaluator span in the OTLP trace - Fix Cell 30 header to say 'openai client' instead of 'Trace LiteLLM'

…te limits

apply_updates() now normalizes ParameterNode object keys to strings via _normalize_key(), so OptoPrimeV2 updates are no longer silently skipped. ingest_tgj() gains a param_cache to reuse stable ParameterNode instances across multi-query iterations. The backward pass now iterates all output nodes, and stale OTLP spans are flushed at the start of optimize_graph(). - bindings.py: accept Dict[Any, Any], return applied dict - tgj_ingest.py: add param_cache kwarg for ParameterNode reuse - optimization.py: flush stale spans, use param_cache, fix backward loop, use applied dict from apply_updates() - notebook: enable INFO logging in live optimization cell

The GraphPropagator asserts that user_feedback is identical when aggregating across multiple backward passes. Running zero_feedback → backward → step per query (matching the BBEH notebook pattern) avoids this and lets each query contribute updates independently.

…optimizer steps Replace the per-query backward/step loop with Trace's canonical minibatch pattern: batchify all output nodes into a single batched target and all per-query feedback into a single batched feedback string, then call backward() and step() once. This avoids the GraphPropagator assertion ("user feedback should be the same for all children") while ensuring all queries' graph paths contribute to the optimization gradient. The batchify import is lazy-loaded via _ensure_trace_imports() to avoid pulling in numpy and the trainer package at module level.

Implement TelemetrySession activation via contextvars so @trace.bundle ops and MessageNode creation can emit OTEL spans outside LangGraph. - Add BundleSpanConfig and MessageNodeTelemetryConfig to control span emission and node-to-span binding (message.id) - Add bundle_span() context manager and on_message_node_created() hook in TelemetrySession for non-LangGraph OTEL visibility - Wrap sync_forward/async_forward in optional OTEL span when session active - Emit temporal-ignore child span in call_llm for provider monitoring - Activate session inside InstrumentedGraph root span so Trace hooks discover it automatically - Add opto.features.mlflow with autolog/disable_autolog (safe no-op when MLflow not installed) - Add opto.trace.settings for global MLflow toggle - Align export naming to otlp.json/tgj.json with legacy aliases - Add manifest.json and message_nodes.jsonl to export bundle

doxav and others added 30 commits February 12, 2026 15:01

checkpoint of WIP JSON OTEL demo

192949c

working OTEL/LANGGRAPH demo

2f1794b

converted demo JSON/OpenTelemetry to LangGraph

bc0b304

checkpoint

e81ad34

OTEL/JSON/LANGGRAPH demo: add a mechanism to ensure multiple optimiza…

a71e1ed

…tion do not lose initial node to optimize (TODO: trainer might have a better solution)

ADDED batchify for handling the multiple feedback in a batch + ADDED …

53871aa

…a lot of logs for further analysis

working code optimization - TODO: clean, simplify the code

87d3c67

fixed code optimization

da80055

ADD synthtizer prompt in optim score > High score

d88a779

TEST removing span/OTEL from optimized code

d03fec5

fixed and updated LangGraph/Otel demo README

1692a89

restore

1c75117

ADD demo and tests for native LangGraph integration with OTEL tracing

779db55

ADD refactor run_graph_with_otel to support custom evaluation functio…

23a377c

…ns and doc evaluation hooks

ADD implement run_benchmark function to compare different feedback mode

d19ba70

Fix Colab badge URL: replace placeholders with actual repo/branch path

30a89c8

Update T1 tech plan: notebooks + acceptance alignment + fixed opto/tr…

c85baf8

…ace/io/otel_adapter.py

Merge branch 'experimental' of https://github.com/AgentOpt/OpenTrace …

c39bed8

…into m1-for-upstream

mjehanzaib999 added 11 commits February 20, 2026 23:51

fix: update retired OpenRouter model to llama-3.3-70b-instruct:free

cb2eede

fix: use OPENAI_BASE_URL env var for OpenRouter routing

0f09eed

fix: add openai/ prefix for litellm OpenRouter routing

6fb3c92

fix: use openai package directly for OpenRouter, add smoke test

73f48bf

fix: use paid meta-llama/llama-3.3-70b-instruct to avoid free-tier ra…

82d0c31

…te limits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR -> feat: Unified Telemetry Layer for Non-LangGraph Trace Pipelines (M2) Body#64

PR -> feat: Unified Telemetry Layer for Non-LangGraph Trace Pipelines (M2) Body#64
mjehanzaib999 wants to merge 41 commits intoAgentOpt:experimentalfrom
mjehanzaib999:m2-unified-telemetry

mjehanzaib999 commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mjehanzaib999 commented Feb 26, 2026

Summary

What's new

Files changed (9 files, +664 / -71)

Non-breaking guarantees

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants