Skip to content

ci: stabilize iluvatar runner and test images#625

Merged
voltjia merged 3 commits into
InfiniTensor:masterfrom
zhangyue207:ci/iluvatar-runner-budget
May 29, 2026
Merged

ci: stabilize iluvatar runner and test images#625
voltjia merged 3 commits into
InfiniTensor:masterfrom
zhangyue207:ci/iluvatar-runner-budget

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

@zhangyue207 zhangyue207 commented May 29, 2026

Summary

  • Reserve one Iluvatar device for the iluvatar_gpu CI job in .github/ci_config.yml.
  • Reduce Iluvatar test parallelism from pytest -n 8 to pytest -n 4 and extend the job timeout from 3600 seconds to 7200 seconds.
  • Pin InfiniOps reusable CI workflows to InfiniTensor/ci revision b45d360c8cc529747ee31c5451d7eac96ac9f309, which reuses unchanged local test images with content-based tags and supports static GPU IDs when probing is unavailable and falls back from unwritable host lock directories.
  • Set Iluvatar to gpu_ids: "0" so the CI runner takes a deterministic file-lock lease without relying on ixsmi host probing.

Motivation

Recent Iluvatar checks on the upstream PR stack failed with full-regression timeouts and exit 137 kills. The Iluvatar job was configured with ngpus: 0, so the CI agent did not take a device lease for the job, while the test stage still ran the full Iluvatar suite with high parallelism.

This PR makes the job reserve one Iluvatar device, lowers local memory pressure, and pulls in the reusable CI workflow update from #626 so unchanged test images can be reused instead of rebuilt.

The first #625 hardware run exposed that ixsmi probing on the Iluvatar runner can return no devices before the container starts. The second run exposed a root-owned /tmp/infinitensor-ci-resource-locks directory on the NVIDIA runner. The third run showed the Iluvatar shadow job correctly waiting on the static device lock but timing out after the old 600-second queue window. A later NVIDIA run showed the same 600-second queue window is too short when all detected GPUs are already busy. The follow-up CI pin lets explicit gpu_ids use host file locks directly, falls back to a user-writable lock directory when the default is unavailable, and lets Iluvatar set CUDA_VISIBLE_DEVICES=0 without depending on auto-probing; the Iluvatar queue window now matches the 7200-second test timeout so shadow can wait for the main job. Other 60-minute platforms now use a 3600-second queue window to avoid false failures under normal runner contention.

Closes N/A

Type of Change

  • feat — new feature / new operator / new platform
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA N/A Passed in hardware CI. ci / unit / nvidia and ci-v2-shadow / ci-v2-shadow / nvidia succeeded; queue timeout is now 3600 seconds.
Iluvatar N/A Passed in hardware CI. ci / unit / iluvatar and ci-v2-shadow / ci-v2-shadow / iluvatar succeeded; local matrix confirms gpu_ids=0, ngpus=1, timeout_minutes=120, queue_timeout=7200, and pytest -n 4.
MetaX N/A Passed in hardware CI. ci / unit / metax and ci-v2-shadow / ci-v2-shadow / metax succeeded.
Cambricon N/A Passed in hardware CI. ci / unit / cambricon and ci-v2-shadow / ci-v2-shadow / cambricon succeeded.
Moore N/A Passed in hardware CI. ci / unit / moore and ci-v2-shadow / ci-v2-shadow / moore succeeded.
Ascend N/A Passed in hardware CI. ci / unit / ascend and ci-v2-shadow / ci-v2-shadow / ascend succeeded.
Full `pytest` output (optional)
git diff --check upstream/master..HEAD
# passed

python .ci/config_to_matrix.py --config .github/ci_config.yml --dump-by-type
# Validated nvidia, iluvatar, metax, moore, cambricon, and ascend matrix entries.
# The latest PR run at 8a56c691760d10e1df8e86aafc2a7de6997ba0a0 has all
# generated matrix, format, lint, hardware CI, and CI v2 shadow checks passing.

cd .ci
python -m py_compile build.py run.py config_to_matrix.py
python -m pytest tests/test_resource.py tests/test_ci_agent.py tests/test_run.py tests/test_config_to_matrix.py tests/test_workflow.py tests/test_shadow_workflow.py tests/test_workflows.py -q
# 126 passed in 1.57s

Benchmark / Performance Impact

N/A

Notes for Reviewers

  • This is a CI-only change. It trades Iluvatar wall-clock time for lower memory pressure and explicit device leasing.
  • This PR also includes the content-tag image reuse workflow pin from fix(ci): reuse unchanged test images #626, plus a follow-up static-device lease fix for Iluvatar runners where ixsmi probing is unavailable.
  • The latest ci: stabilize iluvatar runner and test images #625 run passed all generated matrix, format, lint, hardware CI, and CI v2 shadow checks at commit 8a56c691760d10e1df8e86aafc2a7de6997ba0a0.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Each commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • N/A: Public API changes. This PR changes only CI configuration.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • N/A: Identifiers in comments and error messages. No code comments or error messages were changed.
  • N/A: All comments and error messages are in English. No code comments or error messages were changed.
  • N/A: Comments and error messages are complete sentences. No code comments or error messages were changed.

C++ Specific (if C++ files changed)

  • N/A: No C++ files changed.

Python Specific (if Python files changed)

  • N/A: No Python files changed in this repository.

Testing

  • N/A: Full Iluvatar hardware pytest must run in CI; this PR changes the CI runner configuration itself.
  • For any platform that could not be tested, an explicit reason is given in the table.
  • N/A: New functionality has matching tests under tests/. No operator or runtime functionality was added.
  • N/A: Pytest parameterization. No tests were added in this repository.
  • N/A: pytest.mark.auto_act_and_assert. No tests were added.
  • N/A: Default dtype / device parameterization. No tests were added.
  • N/A: Flaky test documentation. No tests were added.
  • N/A: Bug-fix regression test. This is a CI resource configuration change.

Build, CI, and Tooling

  • N/A: Fresh platform build. This PR does not change source or build files used by local compilation.
  • N/A: compile_commands.json. This PR does not change CMake configuration.
  • N/A: New backend or device auto-detection. No backend was added.
  • N/A: CUDA-like mutual exclusion. This PR does not change backend selection.
  • CI matrix generation was validated for Iluvatar with .ci/config_to_matrix.py.
  • Focused reusable CI workflow tests passed in .ci.
  • N/A: Runtime dependencies. No runtime dependency was added.

Documentation

  • N/A: User-facing documentation. This PR only changes CI runner configuration.
  • N/A: New operators, dispatch helpers, or public utilities. None were added.
  • N/A: Breaking change. This PR has no user-visible API impact.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • N/A: Third-party code. No third-party code was added.
  • N/A: Unsafe pointer arithmetic, uninitialized reads, or missing bounds checks. No source code was changed.

@zhangyue207 zhangyue207 changed the title ci: stabilize iluvatar runner budget ci: stabilize iluvatar runner and test images May 29, 2026
@zhangyue207 zhangyue207 force-pushed the ci/iluvatar-runner-budget branch 2 times, most recently from 526f158 to 2cc7bf6 Compare May 29, 2026 03:37
@zhangyue207 zhangyue207 force-pushed the ci/iluvatar-runner-budget branch from 2cc7bf6 to 8a56c69 Compare May 29, 2026 03:50
@zhangyue207 zhangyue207 marked this pull request as ready for review May 29, 2026 05:44
@zhangyue207 zhangyue207 requested a review from a team May 29, 2026 05:44
@zhangyue207 zhangyue207 requested a review from voltjia May 29, 2026 05:46
@voltjia voltjia merged commit 4273849 into InfiniTensor:master May 29, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants