Fix race condition causing sshd start failure during provisioning#460
Fix race condition causing sshd start failure during provisioning#460s4heid wants to merge 1 commit intocloudfoundry:ubuntu-jammyfrom
Conversation
* Run first-boot tasks via systemd so sshd never races with host-key regeneration. The old `rc.local` script ran after network.target, but in parallel with other regular system services, like ssh.service. Therefore, ssh.service often started (and restarted) while `/root/firstboot.sh` was deleting keys. cloud-init’s set-passwords module made this worse by restarting ssh mid-run. * Replace `rc.local` with a oneshot firstboot.service (delete keys, create new keys, reconfigure sysstat) that runs Before=ssh.service and leaves the `/root/firstboot_done` file as a marker. * Add a cloud-config.service drop-in so cloud-init's config stage waits for firstboot.service, and * Update walinuxagent.service to wait for firstboot.service, ensuring ssh keys have been regenerated. This guarantees sshd, cloud-init, and WALinuxAgent all start only after the first-boot tasks succeed.
|
Warning It's important to be aware that this change could affect how the ssh service behaves. If the firstboot script was intended only for host key regeneration, using the |
|
we should not introduce this within jammy. we currently have similar issues on noble as well as we have set bosh-agent to use systemd |
There was a problem hiding this comment.
Pull request overview
This PR fixes a race condition where SSH daemon could start before host keys are regenerated during first boot, causing provisioning failures. The fix replaces the rc.local-based firstboot mechanism with a proper systemd service that establishes explicit ordering dependencies.
Key Changes
- Introduces firstboot.service (oneshot systemd unit) that runs before ssh.service to regenerate host keys and configure sysstat
- Removes the legacy rc.local script and firstboot.sh in favor of systemd-native orchestration
- Updates walinuxagent.service to depend on firstboot.service completion instead of polling for the marker file
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
stemcell_builder/stages/base_ubuntu_firstboot/assets/etc/systemd/system/firstboot.service |
New systemd oneshot service that deletes old SSH keys, generates new ones, and reconfigures sysstat before SSH starts |
stemcell_builder/stages/base_ubuntu_firstboot/assets/etc/rc.local |
Removed legacy rc.local script that previously executed firstboot tasks |
stemcell_builder/stages/base_ubuntu_firstboot/assets/root/firstboot.sh |
Removed shell script containing firstboot logic, now handled by systemd service |
stemcell_builder/stages/base_ubuntu_firstboot/apply.sh |
Updated to install and enable the new firstboot.service instead of copying rc.local and firstboot.sh scripts |
stemcell_builder/stages/system_azure_init/assets/etc/systemd/system/cloud-config.service.d/firstboot-blocker.conf |
New drop-in configuration ensuring cloud-init waits for firstboot.service completion |
stemcell_builder/stages/system_azure_init/assets/etc/waagent/walinuxagent.service |
Replaced polling loop with proper systemd dependency on firstboot.service |
stemcell_builder/stages/system_azure_init/apply.sh |
Added installation of cloud-config.service drop-in to ensure proper ordering |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| [Unit] | ||
| Description=Run first boot tasks | ||
| ConditionPathExists=!/root/firstboot_done | ||
| Before=ssh.service |
There was a problem hiding this comment.
Consider adding "Before=sshd-keygen.service" to ensure firstboot.service completes key regeneration before the system's default SSH key generation service (sshd-keygen.service) attempts to run. While both services likely check for existing keys, explicitly ordering them would prevent any potential race condition between the two key generation mechanisms.
| Before=ssh.service | |
| Before=sshd-keygen.service ssh.service |
| ExecStart=/usr/bin/ssh-keygen -A -v | ||
| ExecStartPost=/usr/sbin/dpkg-reconfigure -fnoninteractive sysstat | ||
| ExecStartPost=/usr/bin/touch /root/firstboot_done | ||
| RemainAfterExit=yes | ||
|
|
There was a problem hiding this comment.
If the ssh-keygen command in ExecStart fails, the ExecStartPost commands (including the creation of the marker file) will still execute. This could leave the system in an inconsistent state where firstboot_done exists but SSH keys were never generated. Consider moving the marker file creation to a separate ExecStartPost command that only runs on success, or use a more robust approach to ensure the service fails if key generation fails.
| ExecStart=/usr/bin/ssh-keygen -A -v | |
| ExecStartPost=/usr/sbin/dpkg-reconfigure -fnoninteractive sysstat | |
| ExecStartPost=/usr/bin/touch /root/firstboot_done | |
| RemainAfterExit=yes | |
| ExecStart=/bin/sh -c '/usr/bin/ssh-keygen -A -v && /usr/bin/touch /root/firstboot_done' | |
| ExecStartPost=/usr/sbin/dpkg-reconfigure -fnoninteractive sysstat | |
| RemainAfterExit=yes |
rc.localscript ran after network.target, but in parallel with other regular system services, like ssh.service. Therefore, ssh.service often started (and restarted) while/root/firstboot.shwas deleting keys. cloud-init’s set-passwords module made this worse by restarting ssh mid-run.rc.localwith a oneshot firstboot.service (delete keys, create new keys, reconfigure sysstat) that runs Before=ssh.service and leaves the/root/firstboot_donefile as a marker.Resolves #458