Skip to content

PR -> feat: Unified Telemetry Layer for Non-LangGraph Trace Pipelines (M2) Body#64

Draft
mjehanzaib999 wants to merge 41 commits intoAgentOpt:experimentalfrom
mjehanzaib999:m2-unified-telemetry
Draft

PR -> feat: Unified Telemetry Layer for Non-LangGraph Trace Pipelines (M2) Body#64
mjehanzaib999 wants to merge 41 commits intoAgentOpt:experimentalfrom
mjehanzaib999:m2-unified-telemetry

Conversation

@mjehanzaib999
Copy link

Summary

This PR implements the "Generic Unified Telemetry" layer (Milestone 2), enabling OTEL span emission for non-LangGraph Trace pipelines while preserving all existing LangGraph instrumentation behavior.

After M1, only LangGraph pipelines could emit OTEL spans. This PR extends telemetry coverage so that any Trace pipeline using @trace.bundle or call_llm can produce OTEL-compatible spans when a TelemetrySession is active — with zero changes to existing code when no session is active.

What's new

  • Session activation via contextvarsTelemetrySession supports with context manager and activate() for global discovery by Trace hooks
  • OTEL spans around @trace.bundle ops — controlled by BundleSpanConfig (enable/disable, suppress default ops, capture inputs)
  • MessageNode-to-span bindingMessageNodeTelemetryConfig binds message.id to current span for stable node identity in TGJ conversion
  • call_llm provider span — emits a child OTEL span with trace.temporal_ignore=true when a session is active (visible for monitoring, excluded from output node selection)
  • Session activation in LangGraph root spanInstrumentedGraph._root_invocation_span now calls session.activate() so Trace-level hooks discover the session automatically
  • Optional MLflow autologgingopto.features.mlflow.autolog() enables mlflow.trace wrapping on bundle ops; safe no-op when MLflow is not installed
  • Export naming alignmentexport_run_bundle() now writes otlp.json / tgj.json (aligned with repo demos) with backward-compatible aliases (otlp_trace.json / trace_graph.json)
  • Manifest + node recordsmanifest.json and message_nodes.jsonl included in export bundle for debugging

Files changed (9 files, +664 / -71)

File Change
opto/trace/settings.py New — global MLflow autologging toggle
opto/features/mlflow/__init__.py New — MLflow integration package
opto/features/mlflow/autolog.py Newautolog() / disable_autolog()
opto/trace/__init__.py Expose settings and mlflow in public API
opto/trace/bundle.py Optional OTEL span in sync_forward/async_forward; MLflow mlflow.trace wrapping
opto/trace/io/telemetry_session.py Major expansion: activation, BundleSpanConfig, MessageNodeTelemetryConfig, span helpers, MLflow helpers, export alignment
opto/trace/io/instrumentation.py Wrap root span with session.activate()
opto/trace/nodes.py Hook MessageNode.__init__ to call on_message_node_created()
opto/trace/operators.py call_llm emits temporal-ignore provider span

Non-breaking guarantees

  • No session active → identical behavior — all hooks are guarded by TelemetrySession.current() is None checks
  • postprocess_output signature unchanged — preserves compatibility with existing callers
  • preprocess_inputs preserved — data extraction inside trace_nodes context is untouched
  • MLflow is optional — all imports are guarded; code works without MLflow installed

Test plan

  • opto.trace import works without errors
  • TelemetrySession + BundleSpanConfig + MessageNodeTelemetryConfig import correctly
  • Bundle ops without a session produce identical results to M1 (no regression)
  • Bundle ops with active session emit OTEL spans with trace.bundle.* and inputs.* attributes
  • TelemetrySession.current() returns None outside context, active session inside
  • export_run_bundle() produces otlp.json, tgj.json, manifest.json + legacy aliases
  • autolog(silent=True) gracefully disables when MLflow is not installed
  • Run M1 notebook end-to-end to confirm no regressions
  • Run M2 demo notebook (generic_unified_telemetry_demo.ipynb)
  • pytest suite passes in clean environment

doxav and others added 30 commits February 12, 2026 15:01
…tion do not lose initial node to optimize (TODO: trainer might have a better solution)
- Add T1 technical plan for LangGraph OTEL Instrumentation API
- Add architecture & strategy doc (unified OTEL instrumentation design)
- Add M0 README with before/after boilerplate reduction comparison
- Add feedback analysis and API strategy comparison (Trace-first, dual semconv)
- Add prototype_api_validation.py with real LangGraph StateGraph + OpenRouter/StubLLM
- Add Jupyter notebook (prototype_api_validation.ipynb) for Colab-ready demo
- Add example trace output JSON files (notebook_trace_output, optimization_traces)
- Add .env.example for OpenRouter configuration
- Replace hardcoded API key with 3-tier auto-lookup (Colab Secrets → env → .env)
- Save all trace outputs to RUN_FOLDER (Google Drive on Colab, local fallback)
- Add run_summary.json export with scores and history
- Update configuration docs with key setup priority table
- Fix Colab badge URL with actual repo/branch path
Deliver Milestone 1 — drop-in OTEL instrumentation and end-to-end
optimization for any LangGraph agent via two function calls.

New modules (opto/trace/io/):
- instrumentation.py: instrument_graph() + InstrumentedGraph wrapper
- optimization.py: optimize_graph() loop + EvalResult/EvalFn contracts
- telemetry_session.py: TelemetrySession (TracerProvider + flush/export)
- bindings.py: Binding dataclass + apply_updates() + make_dict_binding()
- otel_semconv.py: emit_reward(), emit_trace(), record_genai_chat()

Modified modules:
- langgraph_otel_runtime.py: TracingLLM dual semconv (param.* parent +
  gen_ai.* child spans with trace.temporal_ignore)
- __init__.py: export all new M1 public APIs

Tests (63 passing, StubLLM-only, CI-safe):
- Unit tests for bindings, semconv, session, instrumentation, optimization
- E2E integration test (test_e2e_m1_pipeline.py): real LangGraph with
  StubLLM proving full pipeline instrument → invoke → OTLP → TGJ →
  optimizer → apply_updates → re-invoke with updated template

Notebook + docs:
- 01_m1_instrument_and_optimize.ipynb: dual-mode (StubLLM + live
  OpenRouter), Colab badge, executed outputs, <=3 item dataset,
  temperature=0, max_tokens=256 budget guard
- docs/m1_README.md: architecture, API reference, data flow, semantic
  conventions, acceptance criteria status
- requirements.txt: pinned dependencies for uv/pip environments
A. Live mode error handling:
 - A1: TracingLLM raises LLMCallError on HTTP errors/empty content instead of passing error strings as assistant content
 - A2: Notebook only prints [OK] when provider call actually succeeds with non-empty content
 - A3: gen_ai.provider.name correctly set to "openrouter" (not "openai") when using OpenRouter
 - A4: optimize_graph forces score=0 on invocation failure, bypassing eval_fn

B. TelemetrySession API correctness + redaction:
 - B5: flush_otlp(clear=False) properly peeks at spans without clearing the exporter
 - B6: span_attribute_filter now applied during flush_otlp; supports drop (return {}), redact, and truncate

C. TGJ/ingest correctness and optimizer safety:
 - C7: _deduplicate_param_nodes() strips numeric suffixes to collapse duplicate ParameterNodes
 - C8: _select_output_node() excludes child LLM spans, selects the true sink (synthesizer)

D. OTEL topology and temporal chaining:
 - D9: Root invocation span wraps graph.invoke(), producing a single trace ID per invocation
 - D10: Temporal chaining uses trace.temporal_ignore attribute instead of OTEL parent presence

E. optimize_graph semantics + trace-linked reward:
 - E11: best_parameters is a real snapshot captured at the best-scoring iteration
 - E12: eval.score attached to root invocation span before flush, linking reward to trace

F. Non-saturating scoring for Stub mode:
 - F13: StubLLM and eval_fn are structure-aware; stub optimization demonstrates score improvement

Files changed:
 - langgraph_otel_runtime.py: LLMCallError, _validate_content, flush_otlp(clear=)
 - telemetry_session.py: flush_otlp delegation, _apply_attribute_filter
 - otel_adapter.py: root span exclusion, trace.temporal_ignore chaining
 - instrumentation.py: _root_invocation_span context manager, root span on invoke/stream
 - optimization.py: _deduplicate_param_nodes, _select_output_node, _snapshot_parameters, eval-in-trace
 - __init__.py: export LLMCallError
 - test_optimization.py: updated for best_parameters field
 - 01_m1_instrument_and_optimize.ipynb: all fixes reflected in notebook
 - test_client_feedback_fixes.py: 20 new tests covering all 13 issues
… code

Make the instrumentation layer fully generic and provider-agnostic:

- TracingLLM: default provider_name "openai" → "llm",
  default llm_span_name "openai.chat.completion" → "llm.chat.completion"
- init_otel_runtime: default service_name "trace-langgraph-demo" → "trace-otel-runtime"
- DEFAULT_EVAL_METRIC_KEYS: remove example-specific "plan_quality",
  add generic "score"
- instrument_graph: add llm_span_name, input_key, output_key parameters
  so callers explicitly configure provider/schema specifics
- InstrumentedGraph: add input_key field; invoke()/stream() use it
  instead of hardcoded "query" for the root span hint
- optimize_graph: add output_key parameter; _make_state uses
  graph.input_key instead of hardcoded "query"; error fallback
  no longer assumes result["answer"]
- _select_output_node: replace hardcoded "openai"/"chat.completion"
  name checks with trace.temporal_ignore attribute from info.otel
- otel_adapter: propagate temporal_ignore flag into TGJ info dict
- tgj_ingest: preserve info.otel metadata through conversion and
  onto MessageNode objects

Tests and notebook updated to explicitly pass example-specific values
(provider_name, llm_span_name, output_key) rather than relying on defaults.

All 88 tests pass.
…st iteration

Previously, best_updates was overwritten on every iteration where updates
were applied, regardless of whether that iteration achieved the best score.
This caused best_updates to always contain the last applied updates rather
than the updates that produced the best-performing parameters.

Introduce last_applied_updates to track the most recently applied updates
separately, and snapshot it at the start of each iteration as
applied_updates_for_this_iter. best_updates is now only assigned inside
the best-score guard (avg_score > best_score), ensuring it accurately
reflects the updates that led to best_parameters.

Addresses PR feedback item doxav#1: optimize_graph() best_updates tracking.
optimize_graph() previously ignored the graph's configured output_key
unless the caller explicitly passed output_key=..., causing incorrect
eval payload shape. Now auto-inherits graph.output_key when the parameter
is not provided, and logs a debug note when an explicit override disagrees
with the graph's configuration.

Addresses PR feedback item doxav#2: output_key fallback in optimize_graph.
enable_code_optimization was accepted by instrument_graph() but never
used — TracingLLM.emit_code_param always remained None. Now constructs
a _emit_code_param callback when the flag is True that emits source code,
SHA-256 hash, truncation metadata, and trainable marker as param.__code_*
span attributes. Source is capped at 10K chars with truncation flag.

Addresses PR feedback item doxav#3: enable_code_optimization no-op.
(4A) otel_adapter: after temporal hierarchy resolution, null out
effective_psid when it still references a skipped root invocation span,
preventing dangling parent edges in the TGJ graph.

(4B) langgraph_otel_runtime: capture child LLM span ref and propagate
error/error.type attributes to it on LLMCallError and unexpected
exceptions, so OTEL UIs correctly flag the LLM call as failed.

Addresses PR feedback item doxav#4.
…race validation

Notebook trace validation used "openai" in name to detect child spans,
which silently matched nothing after the generic refactoring. Now uses
trace.temporal_ignore attribute for provider-agnostic detection and
asserts the set is non-empty. Also adds root invocation span assertion
to enforce the D9 single-trace-ID invariant.

Addresses PR feedback item doxav#6.
…into m1-for-upstream

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
…e spans

Library (langgraph_otel_runtime.py):
- Restructure child LLM span error handling: catch errors inside the
  child span context manager so attributes are set before the span ends
- Add error.message attribute (truncated to 500 chars) on both parent
  and child spans for LLMCallError and unexpected exceptions

Notebook (01_m1_instrument_and_optimize.ipynb):
- Rewrite graph to 6-node architecture aligned with reference demo:
  planner → executor → web_researcher/wikidata_researcher → synthesizer → evaluator
- Use Command routing from langgraph.types for dynamic node dispatch
- Switch to DEMO_QUERIES (French Revolution / Tesla / CRISPR)
- Add 3 trainable templates (planner, executor, synthesizer) with output_key=final_answer
- Rewrite StubLLM to produce JSON plans, routing JSON, and topic-aware
  answers; respond to prompt template changes for non-saturating scoring
- Rewrite stub_eval_fn: base 0.2 + plan richness + answer length, cap 0.95
- Fix live section: provider_name="openrouter", trace invariant checks,
  only print [OK] on actual success
- Fix ParameterNode deduplication in TGJ inspection (id-based dedup)
- Update Colab Drive paths to OpenTrace_runs/M1/{OPENTRACE_REF}
- Add optimization table output (iteration → avg_score → best_score)

Verified: 41 tests pass, notebook runs end-to-end, baseline=0.75 → best=0.95
…imeV2 optimizer

- Remove custom OpenRouterLLM HTTP class; use opto.utils.llm.LiteLLM
  which natively supports OpenRouter via the "openrouter/" model prefix
- Upgrade auto-created optimizer from OptoPrime to OptoPrimeV2 in
  optimize_graph() so live section uses real optimization
- Rewrite live Section 9 to mirror Section 8 structure with real
  optimizer (optimizer=None auto-creates OptoPrimeV2) and same eval_fn
- Fix Colab install cell: add sed patch for Python 3.12 compat,
  correct repo URL and branch for checkout
- Fix Colab badge URL to point to fork (mjehanzaib999/NewTrace)
- Fix StubLLM scoring: baseline no longer saturates at 0.95;
  optimization now demonstrates clear improvement (0.47 → 0.64)
- Replace rate-limited meta-llama/llama-3.3-70b-instruct:free with
  qwen/qwen3-next-80b-a3b-instruct:free (instruction-tuned, no thinking traces)
- Use eval_fn=None in Section 9 live optimization so optimize_graph()
  uses the library's _default_eval_fn which reads eval.score from the
  evaluator span in the OTLP trace
- Fix Cell 30 header to say 'openai client' instead of 'Trace LiteLLM'
apply_updates() now normalizes ParameterNode object keys to strings
via _normalize_key(), so OptoPrimeV2 updates are no longer silently
skipped. ingest_tgj() gains a param_cache to reuse stable
ParameterNode instances across multi-query iterations. The backward
pass now iterates all output nodes, and stale OTLP spans are flushed
at the start of optimize_graph().

- bindings.py: accept Dict[Any, Any], return applied dict
- tgj_ingest.py: add param_cache kwarg for ParameterNode reuse
- optimization.py: flush stale spans, use param_cache, fix backward
  loop, use applied dict from apply_updates()
- notebook: enable INFO logging in live optimization cell
The GraphPropagator asserts that user_feedback is identical when
aggregating across multiple backward passes. Running zero_feedback →
backward → step per query (matching the BBEH notebook pattern) avoids
this and lets each query contribute updates independently.
…optimizer steps

Replace the per-query backward/step loop with Trace's canonical minibatch
pattern: batchify all output nodes into a single batched target and all
per-query feedback into a single batched feedback string, then call
backward() and step() once. This avoids the GraphPropagator assertion
("user feedback should be the same for all children") while ensuring all
queries' graph paths contribute to the optimization gradient.

The batchify import is lazy-loaded via _ensure_trace_imports() to avoid
pulling in numpy and the trainer package at module level.
Implement TelemetrySession activation via contextvars so @trace.bundle
ops and MessageNode creation can emit OTEL spans outside LangGraph.

- Add BundleSpanConfig and MessageNodeTelemetryConfig to control span
  emission and node-to-span binding (message.id)
- Add bundle_span() context manager and on_message_node_created() hook
  in TelemetrySession for non-LangGraph OTEL visibility
- Wrap sync_forward/async_forward in optional OTEL span when session active
- Emit temporal-ignore child span in call_llm for provider monitoring
- Activate session inside InstrumentedGraph root span so Trace hooks
  discover it automatically
- Add opto.features.mlflow with autolog/disable_autolog (safe no-op
  when MLflow not installed)
- Add opto.trace.settings for global MLflow toggle
- Align export naming to otlp.json/tgj.json with legacy aliases
- Add manifest.json and message_nodes.jsonl to export bundle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants