Skip to content

fix: deflake //rs/tests/consensus/upgrade:upgrade_downgrade_nns_subnet_test#9972

Closed
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-upgrade_downgrade_nns_subnet_test-2026-04-22
Closed

fix: deflake //rs/tests/consensus/upgrade:upgrade_downgrade_nns_subnet_test#9972
basvandijk wants to merge 2 commits intomasterfrom
ai/deflake-upgrade_downgrade_nns_subnet_test-2026-04-22

Conversation

@basvandijk
Copy link
Copy Markdown
Collaborator

@basvandijk basvandijk commented Apr 22, 2026

I noticed some flaky runs of the upgrade_downgrade_nns_subnet_test caused by failure to mount /var. See:

$ bazel run //ci/githubstats:query -- last --flaky //rs/tests/consensus/upgrade:upgrade_downgrade_nns_subnet_test --since=81554a3 --download-ic-logs --download-console-logs
...
$ rg --no-ignore '(Failed unmounting var.mount|/var/lib/ic.*Permission denied)' logs
logs/upgrade_downgrade_nns_subnet_test/2026-04-22T11:15:10/2026-04-20T15:33:47_dddd0e47-4205-447b-bc1b-a38dcdf49c99/1/ic_logs/3gw4p-2ob5a-emk36-lxl4i-iy3sc-ifozv-emov3-mvvia-7ynb4-54tgj-aae.log
11538:2026-04-20 15:21:37.481995 Failed unmounting var.mount - /var.
11789:2026-04-20 15:23:08.884652 Error creating DB directory /var/lib/ic/data/ic_adapter/dogecoin_testnet_cache/headers: Permission denied (os error 13)
11790:2026-04-20 15:23:08.885665 Error creating DB directory /var/lib/ic/data/ic_adapter/dogecoin_mainnet_cache/headers: Permission denied (os error 13)
11908:2026-04-20 15:23:08.922718 Failed to create dir /var/lib/ic/data/images: Permission denied (os error 13)

logs/upgrade_downgrade_nns_subnet_test/2026-04-22T11:15:10/2026-04-20T14:27:00_59474c91-6c69-4916-8241-67b5222a5f58/1/ic_logs/terzt-dgicf-zqgoe-kj37i-4ao7r-oozjw-co3du-7nki7-lucxc-5d3s7-zae.log
11377:2026-04-20 14:18:45.403327 Failed unmounting var.mount - /var.
11653:2026-04-20 14:20:17.056633 Error creating DB directory /var/lib/ic/data/ic_adapter/dogecoin_testnet_cache/headers: Permission denied (os error 13)
11660:2026-04-20 14:20:17.140964 Error creating DB directory /var/lib/ic/data/ic_adapter/dogecoin_mainnet_cache/headers: Permission denied (os error 13)
11726:2026-04-20 14:20:17.419073 Failed to create dir /var/lib/ic/data/images: Permission denied (os error 13)

Root cause

Disclaimer: the following root cause analysis might not cover the whole issue but could address one part of it:

On GuestOS boot the fstab-generator creates var.mount for
/dev/mapper/var_crypt /var ext4 defaults and the
mount-generator at
ic-os/components/upgrade/systemd-generators/mount-generator
emits systemd-cryptsetup@var_crypt.service with:

BindsTo=${SYSTEMD_DEVICE}   # /dev/disk/by-partuuid/<var partition>
Conflicts=umount.target
Before=umount.target

During the very first boot transaction systemd sometimes has to prune
jobs for devices that will not appear in this boot (e.g.
dev-sev-guest.device when TEE is disabled,
dev-mapper-store-shared-*.device before
lvm-activate-store.service runs). The logs show this as
Unnecessary job was removed for dev-... lines around the var.mount
window.

Because systemd-cryptsetup@var_crypt.service has
BindsTo=${SYSTEMD_DEVICE}, a transient udev removal/reappearance of
the underlying encrypted partition during this churn enqueues a
stop of the cryptsetup service; through
Conflicts=umount.target this cascades into a stop of var.mount
right after the kernel has mounted the filesystem, producing the
spurious:

Mounting var.mount - /var...
EXT4-fs (dm-1): mounted filesystem ... r/w
Failed unmounting var.mount - /var.

Note that a matching Mounted var.mount line is absent on these boots
— the unit never transitioned through the mounted state. The
kernel mount stays live (boot continues, ic-replica eventually
starts) but the unit state is bad and, on the same transaction, a
~90 s dependency timeout on
dev-mapper-store-shared--swap.device/swap.target follows, which
plausibly pushes the test past its deadline.

The systemd-fsck@dev-mapper-var_crypt.service override already
drops BindsTo on the same device to avoid a very similar udev-race
lock-up (commit 9795661, PR #8368). This change applies the same
treatment to systemd-cryptsetup@var_crypt.service.

Fix

Replace BindsTo=${SYSTEMD_DEVICE} with Requires=${SYSTEMD_DEVICE}
on the generated cryptsetup unit (keeping After=). BindsTo= implies
Requires= plus stop-propagation; dropping only the stop-propagation
preserves the "pull the device into the transaction and wait for it"
semantics while fixing the spurious var.mount stop.

Validation

bazel test --test_output=errors --runs_per_test=3 --jobs=3 \
    //rs/tests/consensus/upgrade:upgrade_downgrade_nns_subnet_test

All 3 runs PASSED, stats over 3 runs: max = 885.8 s, min = 747.3 s,
avg = 815.4 s.

Created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to deflake //rs/tests/consensus/upgrade:upgrade_downgrade_nns_subnet_test by preventing a udev “device flapping” scenario from stopping systemd-cryptsetup@var_crypt.service and cascading into an unhealthy var.mount unit state during early boot.

Changes:

  • Remove BindsTo=${SYSTEMD_DEVICE} from the generated systemd-cryptsetup@var_crypt.service unit.
  • Add in-script rationale documenting the udev race and why After=${SYSTEMD_DEVICE} is kept.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ic-os/components/upgrade/systemd-generators/mount-generator Outdated
…e var.mount

The GuestOS mount-generator currently emits systemd-cryptsetup@var_crypt.service
with BindsTo=${SYSTEMD_DEVICE} on the underlying encrypted partition. A transient
udev removal/reappearance of that device during boot enqueues a stop of the
cryptsetup service which, via Conflicts=umount.target, cascades into stopping
var.mount right after the kernel has mounted the filesystem. This produces a
spurious 'Failed unmounting var.mount - /var.' during boot on
upgrade_downgrade_nns_subnet_test{,_head_nns} runs.

The systemd-fsck@dev-mapper-var_crypt.service override already drops BindsTo for
the same reason (commit 9795661, PR #8368). This change applies the same
treatment to systemd-cryptsetup@var_crypt.service. Ordering via After= is
sufficient; setup-var-encryption.sh will fail loudly if the device is absent.

Validated with: bazel test --runs_per_test=3 --jobs=3 \
  //rs/tests/consensus/upgrade:upgrade_downgrade_nns_subnet_test
All 3 runs passed (747-886 s, avg 815 s).

Created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.
@basvandijk basvandijk force-pushed the ai/deflake-upgrade_downgrade_nns_subnet_test-2026-04-22 branch from a9f2d58 to ee7d05d Compare April 22, 2026 11:03
@basvandijk basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Apr 22, 2026
@basvandijk basvandijk marked this pull request as ready for review April 22, 2026 12:11
@basvandijk basvandijk requested a review from a team as a code owner April 22, 2026 12:11
@github-actions github-actions Bot added the @node label Apr 22, 2026
@basvandijk
Copy link
Copy Markdown
Collaborator Author

Closed in favour of #9984.

@basvandijk basvandijk closed this Apr 24, 2026
@basvandijk basvandijk deleted the ai/deflake-upgrade_downgrade_nns_subnet_test-2026-04-22 branch April 24, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants