Skip to content

Operator's guide to boot interface selection (docs/) #2740

Description

@chet

Goal

Write an operator's guide to boot interface selection under docs/, explaining the behind-the-scenes flows that decide which NIC a machine boots from -- and why -- across the explored, predicted, and managed stages. Include a flow diagram.

This is the doc-facing capstone of epic #2660 (boot-interface standardization). It should be accurate to the finished epic, so it is scheduled after #2659 and #2668 land (both change selection logic -- see Dependencies).

Audience & scope

For operators / on-call: enough to reason about "why did this machine pick this NIC to boot from, and how do I steer it?" Not an internal design doc -- favor the observable model, the declaration knobs, and the admin endpoints.

Outline (draft)

  1. Core model

    • MachineBootInterface = (MAC + Redfish interface_id); the "full pair."
    • primary_interface is the boot interface by construction (no separate is_boot flag); derived projections.
    • Store A (explored_endpoints, the pre-ownership explored default) vs Store B (machine_interfaces, authoritative once a machine owns the endpoint).
  2. The selection lifecycle (the spine of the doc)

    • Explore (site-explorer): the explored default -- fetch_host_primary_interface_mac (declared ExpectedHostNic.primary > lowest-PCI DPU host-PF), complete_boot_interfaces, and last-known-good / retained behavior.
    • Predict (predicted_machine_interfaces, the pre-first-lease window): pick_boot_prediction; how predictions are minted and the admin-primary invariant.
    • Own (handoff): predicted -> managed; promotion onto machine_interfaces; one_primary_interface_per_machine partial unique index; the NULL-ownership window.
    • Select (managed): pick_boot_interface precedence -- declared > DPU-takeover > lowest-MAC non-underlay.
  3. site-explorer <-> machine-controller interactions -- who computes what, when; how the explored default feeds preingestion actions and how the controller takes over post-ownership.

  4. Retained boot interfaces -- what "retained" means, when it is kept vs cleared (and the force-delete / re-ingest interaction; power-cycle from feat(site-explorer): power cycle [not just Dell] to apply a queued NIC mode change #2367).

  5. Admin endpoints -- resolve_admin_boot_interface_target, machine_setup, set_dpu_first_boot_order / set_host_boot_order, BIOS/boot-order config, and how an operator overrides a pick.

  6. Declared primary precedence (the Honor a host's declared primary interface when picking its boot device #2657 / Resolve the boot interface from predictions in the machine-controller #2658 / Honor a declared primary interface when computing the explored boot default #2662 work) -- ExpectedHostNic.primary wins across all three stores.

  7. DPU mode effects -- DpuMode / NicMode / NoDpu (zero-DPU) and how each changes selection, including booting a declared integrated NIC while DPUs stay managed (Boot from a declared integrated NIC while keeping its DPUs managed #2668).

  8. Flow diagram -- explored -> predicted -> managed across the actors (site-explorer, machine-controller, admin API), likely mermaid.

Dependencies (why this is scheduled last)

Done when

A reviewed docs/ guide covers items 1-7 with a flow diagram, accurate to merged epic code, and an operator can answer "which NIC will this machine boot from, and how do I change it?" from the doc alone.

Part of #2660.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions