[llm-d-legacy] Import the TOPSAIL legacy LLM-D project for more advance testing by kpouget · Pull Request #42 · openshift-psap/forge

kpouget · 2026-04-24T13:35:39Z

Summary by CodeRabbit

Release Notes

New Features
- Added comprehensive LLM inference service testing, deployment, and benchmarking framework.
- Introduced new Grafana dashboards for monitoring container, Kubernetes, GPU, vLLM, and workload metrics.
- Added performance analysis and benchmarking visualization capabilities with regression tracking.
- Implemented CI orchestration and cluster preparation workflows for LLM testing.

openshift-ci · 2026-04-24T13:35:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign albertoperdomo2 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-24T13:35:49Z

Warning

Rate limit exceeded

@kpouget has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 47 minutes and 37 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8135a592-dbd0-43bf-af6b-84447dcb49ad

📥 Commits

Reviewing files that changed from the base of the PR and between 6d1d536 and 7fc6400.

📒 Files selected for processing (68)

projects/core/library/ci.py
projects/legacy/library/run.py
projects/llm_d_legacy/orchestration/ci.py
projects/llm_d_legacy/testing/command_args.yml.j2
projects/llm_d_legacy/testing/config.yaml
projects/llm_d_legacy/testing/epp-config/epp-approximate-prefix-cache.yaml
projects/llm_d_legacy/testing/epp-config/epp-default-rhoai-3.4-ea.1.yaml
projects/llm_d_legacy/testing/epp-config/epp-pd.v0.4.yaml
projects/llm_d_legacy/testing/epp-config/epp-pd.v0.6.yaml
projects/llm_d_legacy/testing/epp-config/epp-precise-prefix-cache.yaml
projects/llm_d_legacy/testing/grafana/dashboards/container-overview.yaml
projects/llm_d_legacy/testing/grafana/dashboards/k8s-dashboard-starslcn.yaml
projects/llm_d_legacy/testing/grafana/dashboards/kubernetes-app-metrics.yaml
projects/llm_d_legacy/testing/grafana/dashboards/kubernetes-by-namespace-instance.yaml
projects/llm_d_legacy/testing/grafana/dashboards/llmd-vllm-wva.yaml
projects/llm_d_legacy/testing/grafana/dashboards/nvidia.yaml
projects/llm_d_legacy/testing/grafana/dashboards/vllm.yaml
projects/llm_d_legacy/testing/grafana/dashboards/workload-variant-autoscaler.yaml
projects/llm_d_legacy/testing/grafana/datasource.yaml
projects/llm_d_legacy/testing/llmisvcs/llmisvc-pd.yaml
projects/llm_d_legacy/testing/llmisvcs/llmisvc-simple.yaml
projects/llm_d_legacy/testing/prepare_llmd.py
projects/llm_d_legacy/testing/test.py
projects/llm_d_legacy/testing/test_llmd.py
projects/llm_d_legacy/toolbox/llmd.py
projects/llm_d_legacy/toolbox/llmd_capture_isvc_state/defaults/main/config.yml
projects/llm_d_legacy/toolbox/llmd_capture_isvc_state/tasks/main.yml
projects/llm_d_legacy/toolbox/llmd_capture_isvc_state/vars/main/resources.yml
projects/llm_d_legacy/toolbox/llmd_deploy_gateway/defaults/main/config.yml
projects/llm_d_legacy/toolbox/llmd_deploy_gateway/files/.keep
projects/llm_d_legacy/toolbox/llmd_deploy_gateway/meta/main.yml
projects/llm_d_legacy/toolbox/llmd_deploy_gateway/tasks/main.yml
projects/llm_d_legacy/toolbox/llmd_deploy_gateway/templates/gateway.yaml.j2
projects/llm_d_legacy/toolbox/llmd_deploy_gateway/vars/main/resources.yml
projects/llm_d_legacy/toolbox/llmd_deploy_llm_inference_service/defaults/main/config.yml
projects/llm_d_legacy/toolbox/llmd_deploy_llm_inference_service/tasks/main.yml
projects/llm_d_legacy/toolbox/llmd_deploy_llm_inference_service/vars/main/resources.yml
projects/llm_d_legacy/toolbox/llmd_run_guidellm_benchmark/defaults/main/config.yml
projects/llm_d_legacy/toolbox/llmd_run_guidellm_benchmark/tasks/main.yml
projects/llm_d_legacy/toolbox/llmd_run_guidellm_benchmark/templates/copy_helper_pod.yaml.j2
projects/llm_d_legacy/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_job.yaml.j2
projects/llm_d_legacy/toolbox/llmd_run_guidellm_benchmark/templates/guidellm_benchmark_pvc.yaml.j2
projects/llm_d_legacy/toolbox/llmd_run_guidellm_benchmark/vars/main/resources.yml
projects/llm_d_legacy/toolbox/storage_download_to_pvc/defaults/main/config.yml
projects/llm_d_legacy/toolbox/storage_download_to_pvc/files/entrypoint.sh
projects/llm_d_legacy/toolbox/storage_download_to_pvc/meta/main.yml
projects/llm_d_legacy/toolbox/storage_download_to_pvc/tasks/main.yml
projects/llm_d_legacy/toolbox/storage_download_to_pvc/templates/pod.yml.j2
projects/llm_d_legacy/toolbox/storage_download_to_pvc/templates/pvc.yml.j2
projects/llm_d_legacy/toolbox/storage_download_to_pvc/vars/main/resources.yml
projects/llm_d_legacy/visualizations/llmd_inference/analyze/__init__.py
projects/llm_d_legacy/visualizations/llmd_inference/data/plots.yaml
projects/llm_d_legacy/visualizations/llmd_inference/models/__init__.py
projects/llm_d_legacy/visualizations/llmd_inference/models/kpi.py
projects/llm_d_legacy/visualizations/llmd_inference/models/lts.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/__init__.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/error_report.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/prometheus.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/prometheus_reports.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/report.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/throughput_analysis.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/throughput_comparisons.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/utils.py
projects/llm_d_legacy/visualizations/llmd_inference/plotting/vllm_metrics.py
projects/llm_d_legacy/visualizations/llmd_inference/requirements.txt
projects/llm_d_legacy/visualizations/llmd_inference/store/__init__.py
projects/llm_d_legacy/visualizations/llmd_inference/store/lts_parser.py
projects/llm_d_legacy/visualizations/llmd_inference/store/parsers.py

📝 Walkthrough

Walkthrough

This PR introduces comprehensive LLM-D testing infrastructure, including CI orchestration, cluster preparation, Kubernetes inference service deployments, Grafana monitoring dashboards, benchmark running with GuideLLM, visualization pipelines, and matrix benchmarking support. It also updates CI error-handling logic in core and legacy libraries.

Changes

Cohort / File(s)	Summary
CI and Error Handling `projects/core/library/ci.py`, `projects/legacy/library/run.py`	Minor updates: conditional FAILURE file writing based on `env.ARTIFACT_DIR` presence, and switching from `sys.exit(1)` to `raise SystemExit(1)` during exception handling.
LLM-D CI Orchestration `projects/llm_d_legacy/orchestration/ci.py`	New CLI entrypoint with `main` command group and `test` subcommand; wraps test execution with logging, environment initialization, caliper export, and structured exit codes.
LLM-D Testing Configuration and Execution `projects/llm_d_legacy/testing/config.yaml`, `projects/llm_d_legacy/testing/command_args.yml.j2`, `projects/llm_d_legacy/testing/test.py`, `projects/llm_d_legacy/testing/test_llmd.py`, `projects/llm_d_legacy/testing/prepare_llmd.py`	Core testing modules: configuration for vaults, cluster profiles, models, and benchmarking; CLI entrypoints for test orchestration; test execution with multi-flavor ISVC deployment, GuideLLM benchmarking, and Prometheus capture; cluster preparation including operators, Grafana, monitoring, pull-secret management, model PVC downloads, and GPU readiness.
LLM-D Kubernetes Manifests `projects/llm_d_legacy/testing/epp-config/.yaml`, `projects/llm_d_legacy/testing/llmisvcs/.yaml`	Kubernetes custom resources: endpoint picker configurations (queue/cache/max-score scoring), and LLM inference service definitions (with pod templates, routing, GPU resources, prefill workloads, and VLLM tuning).
LLM-D Grafana Monitoring `projects/llm_d_legacy/testing/grafana/dashboards/*.yaml`, `projects/llm_d_legacy/testing/grafana/datasource.yaml`	Grafana custom resources: datasource configuration (Thanos/Prometheus), and seven dashboards covering container overview, Kubernetes app metrics, namespace/instance filtering, GPU (NVIDIA) metrics, vLLM inference metrics, workload autoscaler, and baseline resource monitoring.
LLM-D Ansible Toolbox Roles `projects/llm_d_legacy/toolbox/llmd.py`, `projects/llm_d_legacy/toolbox/llmd_/...`, `projects/llm_d_legacy/toolbox/storage_download_to_pvc/...`	Ansible role implementations: `Llmd` class with methods for gateway deployment, ISVC deployment, GuideLLM benchmark execution, ISVC state capture; supporting roles with defaults, tasks, templates, and vars for gateway setup, ISVC deployment, GuideLLM job orchestration, and PVC downloads (with multi-protocol support: HF, HTTPS/git, S3, DMF).
LLM-D Visualization and Analysis `projects/llm_d_legacy/visualizations/llmd_inference/...`	Complete visualization pipeline: LTS data models (settings, metadata, results, KPIs), store/parser infrastructure for GuideLLM benchmarks and Prometheus metrics, and Dash-based reporting modules (error reports, Prometheus resource/GPU/system health, GuideLLM performance analysis and throughput scaling, baseline/routing/P-D comparisons, VLLM metrics).
Matrix Benchmarking Framework `projects/matrix_benchmarking/library/matbenchmark.py`, `projects/matrix_benchmarking/library/visualize.py`, `projects/matrix_benchmarking/subproject/...`	Benchmarking execution (prepare/save/run matbench files and commands) and comprehensive visualization orchestration CLI with matrix parsing, LTS generation, historical downloads, regressions analysis via hunter/stdev/z-score methods, and per-filter visualization output. Includes plugin architecture documentation, executable wrapper, and regression analyzers.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Test CLI
    participant Prepare as Cluster Prepare
    participant Deploy as ISVC Deploy
    participant Bench as GuideLLM Bench
    participant Capture as State Capture
    participant Visualize as Visualization

    CLI->>Prepare: prepare_ci()
    Prepare->>Prepare: Setup operators/namespace
    Prepare->>Prepare: Deploy Grafana/monitoring
    Prepare->>Prepare: Update pull secrets
    Prepare->>Prepare: Download models to PVC
    Prepare->>Prepare: Preload container images
    
    CLI->>Deploy: test_ci() per flavor
    Deploy->>Deploy: Parse & reshape ISVC YAML
    Deploy->>Deploy: Apply EPP routing config
    Deploy->>Deploy: Deploy LLMInferenceService
    Deploy->>Deploy: Wait for readiness
    
    Deploy->>Bench: run_guidellm_benchmark()
    Bench->>Bench: Create Job + PVC
    Bench->>Bench: Run GuideLLM benchmark
    Bench->>Bench: Extract results.json
    
    CLI->>Capture: capture_llm_inference_service_state()
    Capture->>Capture: Dump ISVC/pods/logs
    Capture->>Capture: Capture Prometheus metrics
    
    CLI->>Visualize: generate_visualization()
    Visualize->>Visualize: Parse results → LTS payload
    Visualize->>Visualize: Generate plots/reports

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

PR #20: Updates same projects/core/library/ci.py module for error-summary FAILURE file handling tied to env.ARTIFACT_DIR.
PR #17: Prior modifications to projects/core/library/ci.py error-reporting flow and _display_error_summary / _write_error_summary_to_file.
PR #6: Adds projects/core/library/env.py which defines ARTIFACT_DIR variable used in this PR's CI error-handling logic.

Poem

🐰 A testing warren grows today,
With grafana dashboards on display,
LLM benchmarks leap and bound,
Inference pipelines all around—
From cluster prep to metrics gleam,
This infrastructure is a dream! 🚀

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

kpouget · 2026-04-24T13:45:57Z

/test fournos llm_d_legacy psap_h200 intelligentrouting-flavors
/cluster athena-fire
/var fournos.namespace: psap-automation-wip

psap-forge-bot · 2026-04-24T13:47:27Z

🔴 Test of 'llm_d_legacy test' failed after 00 hours 00 minutes 00 seconds 🔴

• Link to the test results.

• No reports index generated...

• No test configuration (variable_overrides.yaml) available.

• Failure indicator: Empty.
• Execution logs

psap-forge-bot · 2026-04-24T13:47:45Z

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 31 seconds 🔴

• Link to the test results.

• No reports index generated...

Test configuration:

/test fournos llm_d_legacy psap_h200 intelligentrouting-flavors
/cluster athena-fire
/var fournos.namespace: psap-automation-wip

• Failure indicator:

## /logs/artifacts/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/fournos_launcher/toolbox/submit_and_wait/main.py:169
~~ TASK: wait_for_job_completion: Wait for FOURNOS job to complete
~~ ARTIFACT_DIR: /logs/artifacts/001__submit_and_wait
~~ LOG_FILE: /logs/artifacts/001__submit_and_wait/task.log
~~ ARGS:
~~     cluster_name: athena-fire
~~     project: llm_d_legacy
~~     args:
~~     - psap_h200
~~     - intelligentrouting-flavors
~~     variables_overrides: {}
~~     job_name: ''
~~     namespace: psap-automation-wip
~~     owner: kpouget
~~     display_name: llm_d_legacy psap_h200 intelligentrouting-flavors
~~     pipeline_name: forge-test-only
~~     env:
~~       JOB_TYPE: presubmit
~~       JOB_NAME: pull-ci-openshift-psap-forge-main-fournos
~~       JOB_SPEC: '{"type":"presubmit","job":"pull-ci-openshift-psap-forge-main-fournos","buildid":"2047673414344249344","prowjobid":"b250c4d4-b130-4c29-9c76-e1d95fbdaafc","refs":{"org":"openshift-psap","repo":"forge","repo_link":"https://github.com/openshift-psap/forge","base_ref":"main","base_sha":"2e20a6a265b879a6b4edbe6c81afe14ca104d9d3","base_link":"https://github.com/openshift-psap/forge/commit/2e20a6a265b879a6b4edbe6c81afe14ca104d9d3","pulls":[{"number":42,"author":"kpouget","sha":"276f23d453ea7df4a3cb7f1193c8608e0cce2f06","title":"[llm-d-legacy]
~~         Import the TOPSAIL legacy LLM-D project for more advance testing","head_ref":"llm_d_legacy","link":"https://github.com/openshift-psap/forge/pull/42","commit_link":"https://github.com/openshift-psap/forge/pull/42/commits/276f23d453ea7df4a3cb7f1193c8608e0cce2f06","author_link":"https://github.com/kpouget"}]},"decoration_config":{"timeout":"23h0m0s","grace_period":"15s","utility_images":{"clonerefs":"us-docker.pkg.dev/k8s-infra-prow/images/clonerefs:v20260421-d25a17867","initupload":"us-docker.pkg.dev/k8s-infra-prow/images/initupload:v20260421-d25a17867","entrypoint":"us-docker.pkg.dev/k8s-infra-prow/images/entrypoint:v20260421-d25a17867","sidecar":"us-docker.pkg.dev/k8s-infra-prow/images/sidecar:v20260421-d25a17867"},"resources":{"clonerefs":{"limits":{"memory":"3Gi"},"requests":{"cpu":"100m","memory":"500Mi"}},"initupload":{"limits":{"memory":"200Mi"},"requests":{"cpu":"100m","memory":"50Mi"}},"place_entrypoint":{"limits":{"memory":"100Mi"},"requests":{"cpu":"100m","memory":"25Mi"}},"sidecar":{"limits":{"memory":"2Gi"},"requests":{"cpu":"100m","memory":"250Mi"}}},"gcs_configuration":{"bucket":"test-platform-results","path_strategy":"single","default_org":"openshift","default_repo":"origin","mediaTypes":{"log":"text/plain"},"compress_file_types":["txt","log","json","tar","html","yaml"]},"gcs_credentials_secret":"gce-sa-credentials-gcs-publisher","skip_cloning":true,"censor_secrets":true,"censoring_options":{"minimum_secret_length":6}}}'
~~       OPENSHIFT_CI: 'true'
~~       JOB_NAME_SAFE: fournos
~~       BUILD_ID: '2047673414344249344'
~~       PULL_PULL_SHA: 276f23d453ea7df4a3cb7f1193c8608e0cce2f06
~~       PULL_NUMBER: '42'
~~       PULL_BASE_REF: main
~~       REPO_NAME: forge
~~       REPO_OWNER: openshift-psap
~~       PULL_BASE_SHA: 2e20a6a265b879a6b4edbe6c81afe14ca104d9d3
~~       PULL_TITLE: '[llm-d-legacy] Import the TOPSAIL legacy LLM-D project for more advance
~~         testing'
~~       PULL_REFS: main:2e20a6a265b879a6b4edbe6c81afe14ca104d9d3,42:276f23d453ea7df4a3cb7f1193c8608e0cce2f06
~~       PULL_HEAD_REF: llm_d_legacy
~~     status_dest: /logs/artifacts
~~     ci_label: pr42_b2047673414344249344
~~     artifact_dir: /logs/artifacts/001__submit_and_wait
~~ CONTEXT:
~~     final_job_name: forge-llm-d-legacy-20260424-134714
~~     manifest_file: /logs/artifacts/001__submit_and_wait/src/forge-llm-d-legacy-20260424-134714-manifest.yaml
~~
~~ EXCEPTION: RuntimeError
~~     Job forge-llm-d-legacy-20260424-134714 failed: Tasks Completed: 1 (Failed: 1, Cancelled 0), Skipped: 0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Conversation

kpouget commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

openshift-ci Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

psap-forge-bot Bot commented Apr 24, 2026

Uh oh!

kpouget commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

kpouget commented Apr 27, 2026 •

edited

Loading