fix: deflake //rs/registry/canister:registry_canister_integration_test_tests/create_subnet#10003
fix: deflake //rs/registry/canister:registry_canister_integration_test_tests/create_subnet#10003basvandijk wants to merge 1 commit intomasterfrom
Conversation
…t_tests/create_subnet
Set timeout = "long" on the registry_canister_integration_test target.
The create_subnet test binary runs 6 integration tests, each spinning up
its own replica runtime; on loaded CI runners the total wallclock
routinely exceeds the default 'moderate' (300s) timeout, causing flakes.
Across the 7 flaky runs in the last week, the fingerprint is consistent:
5 tests complete and a 6th (varying: usually test_accepted_proposal_with_
schnorr_gets_keys_from_other_subnet, sometimes the trivial
test_the_anonymous_user_cannot_create_a_subnet) is killed mid-execution
with no panic or deadlock. The chain-key waiters (wait_for_{ecdsa,
schnorr,vetkd}_setup) are bounded (100 x 500ms), so there is no real
hang to hide by extending the timeout.
There was a problem hiding this comment.
Pull request overview
Deflakes the //rs/registry/canister:registry_canister_integration_test_tests/create_subnet Bazel test by increasing its allowed execution time to avoid CI wall-clock timeouts under load.
Changes:
- Set
timeout = "long"onregistry_canister_integration_testso generated integration-test targets (including..._tests/create_subnet) get a longer Bazel timeout.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
This pull request changes code owned by the Governance team. Therefore, make sure that
you have considered the following (for Governance-owned code):
-
Update
unreleased_changelog.md(if there are behavior changes, even if they are
non-breaking). -
Are there BREAKING changes?
-
Is a data migration needed?
-
Security review?
How to Satisfy This Automatic Review
-
Go to the bottom of the pull request page.
-
Look for where it says this bot is requesting changes.
-
Click the three dots to the right.
-
Select "Dismiss review".
-
In the text entry box, respond to each of the numbered items in the previous
section, declare one of the following:
-
Done.
-
$REASON_WHY_NO_NEED. E.g. for
unreleased_changelog.md, "No
canister behavior changes.", or for item 2, "Existing APIs
behave as before.".
Brief Guide to "Externally Visible" Changes
"Externally visible behavior change" is very often due to some NEW canister API.
Changes to EXISTING APIs are more likely to be "breaking".
If these changes are breaking, make sure that clients know how to migrate, how to
maintain their continuity of operations.
If your changes are behind a feature flag, then, do NOT add entrie(s) to
unreleased_changelog.md in this PR! But rather, add entrie(s) later, in the PR
that enables these changes in production.
Reference(s)
For a more comprehensive checklist, see here.
GOVERNANCE_CHECKLIST_REMINDER_DEDUP
Set
timeout = "long"on//rs/registry/canister:registry_canister_integration_test_tests/create_subnet.Root cause
The target did not specify a
timeout, so it used Bazel's defaultmoderate(300s).The
create_subnettest binary runs 6 integration tests. Each spins up its own replica runtime viastate_machine_test_on_nns_subnet/canister_test_with_config_async, and the three chain-key variants additionally poll for DKG completion viawait_for_chain_key_setup.Across the 7 flaky runs in the last week (2026-04-16 .. 2026-04-22, all with "Test timed out"), the fingerprint was always the same:
running 6 teststest_accepted_proposal_with_schnorr_gets_keys_from_other_subnet(occasionally…_ecdsa_…/…_vetkd_…), and in one run even the trivialtest_the_anonymous_user_cannot_create_a_subnetwas the 6th one running at kill time.No panics, no stack traces, no deadlock signatures. The polling helpers (
wait_for_{ecdsa,schnorr,vetkd}_setup) are each bounded to 100 × 500ms ≈ 50s, so there is no infinite wait being hidden.This is a classic total-wallclock overrun on loaded CI runners, not a test bug — the appropriate remedy is bumping the Bazel timeout bucket.
Verification
All 3 runs passed (max 32.3s, min 31.9s locally). On CI the same binary has been observed exceeding 300s, so
long(900s) gives adequate headroom while still catching any true hang.PR created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.