tap_user: ignore ARP probes (sender_ip = 0.0.0.0) so DHCP DAD succeeds#210
Closed
pufit wants to merge 28 commits into
Closed
tap_user: ignore ARP probes (sender_ip = 0.0.0.0) so DHCP DAD succeeds#210pufit wants to merge 28 commits into
pufit wants to merge 28 commits into
Conversation
The existing compile-time USE_ALSA gate hard-wires the host audio backend at build time: Linux builds get ALSA, every other target gets silence. Library consumers that want to route PCM somewhere else — JNI embedders delivering audio to a JVM, WAV-capture test fixtures, alternate cross-platform backends — have no way to hook in. Expose two new public entry points: PUBLIC pci_dev_t* sound_hda_init_ex(pci_bus_t*, write_fn, user_data); PUBLIC pci_dev_t* sound_hda_init_auto_ex(rvvm_machine_t*, write_fn, user_data); When a non-NULL write_fn is supplied, the HDA device routes every PCM chunk through it, skipping the compile-time ALSA path. NULL keeps the existing behaviour (ALSA if compiled, silent otherwise) — so sound_hda_init / sound_hda_init_auto are preserved as wrappers and every existing caller keeps working. The stream worker's USE_ALSA #ifdef is replaced with a NULL-check on subsystem.write, so installations with no backend (macOS today, DOS, etc.) no longer leave a dead code path in a hot loop and the LPIB counter still advances cleanly so the guest's HDA driver doesn't stall.
Java-side consumers that want to receive PCM from the emulated HDA controller (e.g. a Minecraft mod streaming VM audio into the game's client-side sound engine) need a way to register a Java callback without patching librvvm every release. Add Java_lekkit_rvvm_RVVMNative_sound_hda_init_with_sink, which: - caches the JavaVM on first use (GetJavaVM), - resolves the sink's onAudio([B)V method ID, - installs a C trampoline as the backend via sound_hda_init_auto_ex, - packages each PCM chunk into a byte[] and invokes sink.onAudio. The trampoline runs on RVVM's HDA stream worker thread and attaches itself to the JVM as a daemon on the first callback (AttachCurrentThreadAsDaemon). Subsequent callbacks reuse the attachment via GetEnv. Java exceptions thrown from onAudio are cleared locally so a misbehaving sink can't crash the C worker.
JNI direct callback broke on JDK 21 / macOS arm64: AttachCurrentThread(AsDaemon) returns JNI_ERR when called from the RVVM stream_worker pthread, so a Java-side sink never actually received its PCM — even though the C trampoline ran. Switch to a staging model: - sound_hda_init_with_ring attaches the HDA and installs a native write_fn that pushes PCM into a 1 MiB mutex-guarded ring buffer. - sound_hda_poll drains bytes into a Java byte[]; callers invoke it from an already-JVM-attached thread (e.g. a server tick) so no thread-attach is needed. - sound_hda_stats exposes total_pushed / total_popped / dropped / current_occupancy counters for instrumentation and tests. Ring overflow drops oldest bytes (preserves head-of-stream alignment and latency over completeness — same philosophy as ALSA's snd_pcm_recover fallback). The old Java-sink API (sound_hda_init_with_sink) is removed; callers still targeting a JNI-attached callback can be re-added later once the macOS attach issue is understood.
sound_hda_remove was a pure no-op, which meant that when a machine got freed while an HDA output stream was running, the stream worker kept walking the BDL and called pci_send_irq() on a freed pci_func pointer — immediate SIGSEGV. Signal the worker to exit (atomic running=0) and sleep 20 ms before returning from remove(). At 128-frame periods / 192 kHz that's ~30x the worker's per-iteration duration, enough for it to notice the flag and exit cleanly. A proper thread_join would be better but thread_create_task is fire-and-forget. Reproducer before this fix: attach HDA + start an output stream + free the machine. With the new bindings/jni ring-buffer path the crash is easy to hit; it was present before too but harder to trigger because USE_ALSA=0 builds left the worker doing nothing.
Three related fixes turned up while driving the emulator from aplay:
1. **Pace the stream worker to 192 kHz × 2 bytes = 384000 bytes/s.**
Without pacing, non-blocking backends let the worker blast through
the BDL as fast as the CPU allows — LPIB advances faster than the
guest can refill, and aplay trips ALSA's consistency check:
aplay: pcm_plugin.c:547: snd_pcm_plugin_status:
Assertion `status->appl_ptr == *pcm->appl.ptr' failed.
Blocking backends (ALSA's snd_pcm_writei) dodged this accidentally
because writei blocks when the host buffer fills. Now every backend
gets wall-clock pacing for free. Computed per-iteration from
cumulative bytes emitted vs elapsed time, with a 100 ms rebase
floor so pause/resume / long GCs don't trigger a catch-up sprint.
2. **Bail out early if pci_get_dma_ptr returns NULL.** The BDL DMA
pointer can be NULL if the guest wrote run=1 before setting up the
BDL properly (bdl_lo == 0). Dereferencing dma[0] crashes. We now
clear running=0 and exit cleanly; the guest will re-issue the
stream with a valid BDL when userspace opens another handle.
3. **atomic_cas_uint32 guard on worker spawn.** Previously every
run=1 MMIO write spawned a fresh thread unconditionally. With the
paced worker taking ~1 s per BDL iteration, a looping aplay easily
has overlapping workers — two threads walking the same stream
struct, racing on lpib / running. CAS 0→1 makes spawn atomic;
stop writes still 0-store unconditionally.
Combined fix for two related issues: 1. **Codec advertised 8-192 kHz × 8-32 bit × mono/stereo.** The guest driver would pick whatever sample rate matched the application's request, but the stream worker had no per-stream rate awareness — it paced to a fixed bytes/sec. Playing a 48 kHz WAV against a codec that accepted 192 kHz caused 4× slow playback plus aliasing. 2. **Widgets advertised stereo caps.** Linux's HDA driver configures stereo streams by default on stereo-capable widgets, duplicating mono input to L+R in software. The emulator byte rate then doubles while pacing / downstream code assumes mono, producing half-pitch audio and server ring overflow. Fixes: - Codec now advertises ONLY 48 kHz / 16-bit via `CODEC_PARAM_SUPP_PCM_SIZE_RATES`. ALSA handles any format mismatch in software (linear-interp upsample or straight passthrough) and the emulator rate always matches the pacing math. - Widgets drop `CODEC_PARAM_AUDIO_WIDGET_CAPS_STEREO` so the driver configures mono streams only. - Pacing now derives from `stream->fmt` per HDA spec 7.3.3.10 (base rate, multiplier, divisor, bits, channels). Robust to whatever format the guest driver ends up selecting. - Worker downmixes multi-channel PCM to mono before calling the backend (defensive — keeps the backend contract simple even if a future codec advertises stereo again). 16-bit fast path with stack buffer; non-16-bit falls through to raw pass-through.
) Expose a second NS16550A UART whose chardev backend is a pair of 64 KiB ring buffers instead of stdio. Java-side consumers drain guest TX via ns16550a_bridge_poll and inject guest RX via ns16550a_bridge_feed — enough to implement an in-game serial console or a host-driven RPC transport on top of a stock NS16550A, without patching the UART or exposing chardev internals through JNI. Architecture mirrors feat/sound-backend-api's HDA ring pattern: - ns16550a_bridge_init attaches the UART via the existing ns16550a_init_auto(machine, chardev) path and returns an opaque bridge handle. - poll / feed own a spinlock guarding two ringbufs; chardev read/write from the UART's thread and the Java thread contend on the same lock. - chardev_notify is called outside the lock after an RX/TX flag edge, so the UART's IRQ path doesn't re-enter while we hold state. - TX overflow drops oldest bytes (HDA's latency-first policy). RX overflow reports a short write; caller retries. - ns16550a_bridge_stats exposes {pushed, popped, fed, consumed, dropped} for instrumentation and tests. The bridge lifetime is owned by the MMIO device: ns16550a_remove calls chardev_free on our chardev, which frees the bridge struct and destroys the ringbufs. No explicit JNI teardown call. Zero changes to librvvm core — ns16550a.c already accepts a pluggable chardev (ns16550a_init_auto). The stdio default (ns16550a_init_term_auto) is unchanged; this just adds a second factory for embedders.
The bridge's CHARDEV_TX flag means "ring has space for more guest writes" — true continuously for the 64KB ring in practical use. That kept ns16550a_notify short-circuiting on an unchanging flag word, which meant the UART never re-raised its TX-empty IRQ after the kernel enabled IER.THR and handled the first interrupt. Kernel's IRQ-driven 8250 tx path then deadlocked: single-byte `echo > ttyS1` hangs inside close()'s tty drain, while the console port (polled TX, no IRQ dependency) works fine. Confirmed by `setserial ttyS1 irq 0` making writes complete instantly. Fix: whenever Java actually drains bytes out of the ring, issue a short flag-pulse to ns16550a_notify — first with TX cleared, then with TX restored — emulating the THRE low→high transition real hardware exhibits after each transmit. No-op when zero bytes were drained, to avoid spurious IRQ re-fires every tick.
The emulator CPU doesn't implement the `seed` CSR that the Zkr (cryptographic seed) extension requires, but the hardcoded ISA extension string in riscv_exts advertised it. Linux 6.18+ `arch_get_random_seed_longs` trusts the FDT, issues `csrrwi x13, seed, 0` at `start_kernel+0x68a` (before any console is initialized), and the undefined instruction trap kills the kernel before it can even complain (Oops — illegal instruction, then immediate panic). Empirical repro: boot any recent-mainline Linux kernel with Zkr detection enabled; it will oops at `random_init_early`. Confirmed in Scalar Evolution / Alpine riscv64 linux-lts 6.18.22 under this emulator. Drop zkr from the advertised extension list until the CSR is wired up. Other zk* extensions (zbkb, zbkx) stay because they're just bit-manip instructions already covered by the JIT.
Turn the seed CSR from a silently-broken stub into a ratified-Zkr-1.0.1
virtual entropy source (Section 4.4.3) and re-advertise zkr in the FDT
ISA string. Previously riscv_csr_seed returned raw 16 bits with OPST
bits [31:30] implicitly zero (= BIST, "self-test running, retry"),
which made Linux's csr_seed_long loop 100 times through cpu_relax and
then return false — the extension was advertised-but-useless, and
booting Linux 6.18+ with zkr in the FDT appeared to panic.
Three changes:
* riscv_csr.c: riscv_csr_seed returns 0x80000000 | rand16 — that's
OPST=ES16 in the top two bits + 16 fresh entropy bits in [15:0].
Source is rvvm_csprng_bytes (new, see below), not rvvm_randombytes
— the spec requires >=256-bit security, and xorshift64 obviously
isn't that.
* utils.{c,h}: new rvvm_csprng_bytes() pulls from the host OS CSPRNG.
Linux / FreeBSD / Solaris / Illumos: getrandom(2). macOS / OpenBSD:
getentropy(3). Windows: BCryptGenRandom via the system-preferred
provider. Fallback: /dev/urandom for older POSIX. All four paths
retry EINTR, loop on short reads, and log + bail if the syscall
returns a hard error. Expect ~1 syscall per call — not hot-path
material, so don't use for tap_user TCP ISNs etc.; keep
rvvm_randombytes for that.
* rvvm.c: re-add zkr to riscv_exts. The historical drop (commit
52c70d1) was a workaround for the broken stub; now that the CSR
is spec-compliant, the FDT advertisement matches reality again.
mseccfg.SSEED / USEED are already set at hart reset (rvvm_create_machine
thread init, circa line 1223), so this was the missing half — guests
in S-mode or U-mode can now access sentropy without trapping.
Tested: compiles clean with make lib USE_JNI=1 on Linux x86_64. The
runtime path is covered end-to-end by booting a guest kernel that
exercises arch_get_random_seed_longs (Linux 6.18+ does during
start_kernel -> random_init_early, well before console init).
Virtual entropy sources per Section 4.4.3 don't need M-mode firmware
cooperation to grant S-mode the right to poll sentropy — they ARE the
firmware. Gating S-mode access on mseccfg.SSEED in an emulator is
theatre at best and a live-fire footgun at worst:
* We set mseccfg.USEED | mseccfg.SSEED in rvvm_create_machine's hart
init (circa rvvm.c:1223).
* OpenSBI 1.6 clobbers mseccfg during hart bring-up — its PMP
enhancement path writes 0 to the CSR while configuring MML/MMWP,
zeroing our USEED/SSEED bits as collateral.
* By the time Linux 6.18+ runs csrrwi x13, seed, 0 at
start_kernel+0x68a (random_init_early), SSEED is 0, our
riscv_csr_seed_enabled returns false for PRIV_SUPERVISOR, and the
emulator traps the access as illegal instruction. Kernel panics
before the console is even initialized.
Fix: let M and S mode through unconditionally. U-mode still gates on
mseccfg.USEED — that one actually matters (kernels use it to forbid
unprivileged entropy polling), and OpenSBI doesn't interfere with it
post-boot because it's not on the PMP-setup path in the same way.
This keeps the ABI spec-compliant from the kernel's point of view (a
kernel that writes its own SSEED=1 still sees the same behavior) while
removing the OpenSBI ordering hazard.
Empirically verified: before this patch, a Linux 6.18.22 guest oopses
at arch_get_random_seed_longs+0x14 with "Illegal instruction"
(badaddr=015056f3 = csrrwi x13, seed, 0) during start_kernel. After,
it boots through random_init_early and on into userspace.
The sound_hda_init_ex header comment claimed "mono, 192 kHz" but the codec has been locked to 48 kHz since 13baa7d. It also called out "JNI embedders" specifically, which reads as mod-scoped in a public librvvm header. Swap to "48 kHz" and generalise the use-case list to the neutral framing used in the upstream PR (managed runtimes, IPC channels, WAV capture fixtures). Same adjustment mirrored in the sound_hda_init_ex backend-selection comment in sound-hda.c. No behavioural change.
Two build failures in the recent rvvm_csprng_bytes addition, both turned up by ScalarEvolution2's CI matrix: macOS arm64 — getentropy(3) called but not declared. The Apple branch was folded in with OpenBSD under a single #include <unistd.h>. OpenBSD does declare getentropy there; macOS does not — since 10.12 Sierra it lives in <sys/random.h>. Split the Apple and OpenBSD branches so each pulls the header where the symbol actually is. Windows x86_64 — undefined reference to BCryptGenRandom. The Windows branch calls BCryptGenRandom but the Makefile never linked bcrypt on MinGW. Added -lbcrypt to the Windows LDFLAGS via the same check_cc_flags helper used for other Win32 libs, gated on OS=windows. Every Windows target we support (NT 6+ / Vista and later) ships bcrypt.dll, so the link is unconditional. Cross-validated both with zig cc: - target x86_64-windows-gnu: librvvm.dll links, bcrypt.dll imported. - target aarch64-macos: librvvm.dylib links, getentropy imported.
utils: Fix CSPRNG build on macOS and Windows
Addressing a race in the original CAS-on-spawn guard caught by @epoll-reactor-2's review: the CAS gate used the single `running` flag for both "guest wants the stream" and "a worker thread exists". If the guest wrote run=0 then run=1 while the worker was mid-iteration (e.g. inside sleep_ns for pacing), the new spawn could CAS 0→1 on running while the old worker was still finishing a BDL entry. Two workers would then race on stream->lpib / running. Fix: separate `worker_alive` (owned by the worker) from `running` (owned by the guest's MMIO writes). - Spawn publishes run=1 first, then CAS(worker_alive, 0, 1). If the CAS loses, an existing worker is still alive and will observe the run=1 on its next drain iteration. - Worker clears worker_alive only at function return, after its final running-check. A subsequent re-check of running after the clear — with a re-CAS on worker_alive — catches the narrow window where the guest wrote run=1 between the drain's running=0 check and our clear, and no new spawn could claim the slot. Drain logic factored out as sound_hda_stream_drain so the lifetime loop (worker_alive management + re-claim) can wrap it cleanly. Early bailouts on rate=0 or dma=NULL return from the drain without needing to touch worker_alive. Also replace the magic sleep_ms(20) in sound_hda_remove with a poll on worker_alive. The old 20 ms figure was derived from a 192 kHz assumption that no longer holds after the codec was locked to 48 kHz (a 4 KiB BDL entry now paces ~43 ms), so the prior code could return with the worker still mid-sleep. Cap the poll at ~1 s as a liveness guard against a backend blocking inside subsystem.write.
Two host-side fixes pairing with the codec lock in the preceding fix commit: 1. Open the ALSA host PCM at 48 kHz, not 192 kHz. sound-hda.c now advertises only 48 kHz mono 16-bit via CODEC_PARAM_SUPP_PCM_SIZE_RATES. The stream worker paces at whatever rate stream->fmt encodes — which, given a single-option codec, settles to 48 kHz. Meanwhile alsa.c was still opening snd_pcm at 192000. Feeding 48 kHz bytes into a 192 kHz host stream plays audio back 4× fast (two octaves up). 2. Widen the host-side period and buffer. 128-frame periods (~2.67 ms at 48 kHz) were too tight: a single host scheduling hiccup xruns PipeWire, which propagates as back-pressure to the emulator's stream worker — e.g. mpg123 returning 3400/4608 from write() as its own guest-side ALSA buffer fills while PipeWire is stuck refilling. Settle on 960-frame periods (20 ms) with a 4× buffer (80 ms total). These numbers only size the host PipeWire/ALSA ring the emulator feeds into; guest userspace continues to size its own buffers on the emulated HDA DMA BDL independently. 80 ms of host slack is imperceptible for typical playback and rides out normal scheduler jitter cleanly. Resolving snd_pcm_hw_params_set_buffer_size_near through the existing dlib symbol table keeps the optional-libasound loader path consistent with every other snd_pcm_* call in this file.
Provides a hermetic development environment and a proper
nixpkgs-style derivation for RVVM, for NixOS / nix-darwin / any host
with Nix installed.
Highlights worth calling out:
- Dev shell: pulls gcc/clang/gdb/cmake/pkg-config plus GUI deps
(X11, Wayland, xkbcommon) and ALSA/SDL2. `direnv allow` in the
repo root auto-activates it via the .envrc wrapper.
- Package derivation: wires Makefile flags for USE_WAYLAND /
USE_X11 / USE_NET / USE_SOUND / USE_ALSA and installs the rvvm
binary to $out/bin. `nix build` produces a self-contained result
symlink.
- Two PipeWire/ALSA quirks documented inline with why:
1. alsa-lib's global.h redefines struct timespec without
_POSIX_C_SOURCE=200809L — collides with glibc.
2. NixOS /etc/alsa/conf.d points `default` at pipewire, but
alsa-lib only discovers the pipewire ALSA plugin via
ALSA_PLUGIN_DIR. Both the wrapped binary and the dev shell
export that.
- RUNPATH trick: ALSA is pure dlopen (dlib/), nothing references it
at link time, so `ld --as-needed` would drop it. Pin alsa-lib via
NIX_LDFLAGS=-rpath so `dlopen("libasound.so")` resolves at runtime.
Allowlist flake.nix / flake.lock / .envrc in .gitignore alongside
the other explicitly-tracked project files.
Add Nix flake for reproducible dev shell and package build
Three interdependent bugs silently dropped every output-stream IRQ before snd_pcm_period_elapsed ran, leaving ALSA writers blocked in wait_for_avail until a signal unblocked them with -EINTR. Symptom: streaming playback (mpg123) writes a frame or two, then "wrote only N of M (No error information?)". File-backed playback (aplay at 22 kHz mono) slipped through because its byte rate was below the BDL drain rate, so the runtime buffer never filled and wait_for_avail was never entered. 1. INTSTS returned 1<<29. Stream indices follow descriptor order (ISS, OSS, BSS); with 1 input + 1 output the OSD0 mask is 1<<1. Linux's azx_interrupt iterates stream_list with `status & azx_dev->sd_int_sta_mask` — bit 29 never matched, the IRQ was discarded before stream_update ran. 2. BCIS (SDnSTS bit 2) was never latched on IOC. Even if INTSTS had pointed at the right stream, snd_hdac_bus_handle_stream_irq skips the ack callback unless SD_INT_COMPLETE is set in sd_status. 3. WALLCLK read returned 0. On top of the two above, intel.c:azx_position_ok compares `now - start_wallclk` against period_wallclk * 2/3 and declares any faster-than-expected IRQ bogus; a stuck-at-zero counter rejects every IRQ. Now a real 24 MHz monotonic counter off rvtimer_clocksource. Also guard the BDL drain path against a NULL pcm mapping: pci_get_dma_ptr can return NULL for a bad addr or zero-length entry, and feeding NULL to snd_pcm_writei / ringbuf memcpy is UB. Pacing still advances so LPIB keeps moving and the guest recovers.
Previously alsa_sound_write called snd_pcm_writei exactly once and
abandoned whatever didn't land:
- on -EPIPE (xrun) it prepared the PCM but skipped re-writing the
lost chunk, producing an audible ~40 ms silence gap per xrun;
- on a positive short return it ignored the tail frames, same click;
- on -EINTR it also dropped.
On a system where PipeWire's RT-scheduled ALSA client was occasionally
losing its window against the HDA stream worker, this showed up as
intermittent crackle even after the IRQ dispatch chain was fixed.
Loop until every frame is delivered, handling -EPIPE/-ESTRPIPE by
preparing and retrying the same chunk, -EINTR/-EAGAIN by retrying
directly, and giving up on other errors after a handful of xrun
retries so a disconnected device can't spin us forever.
Root cause of guest-side RPC timeouts on ttyS1: ns16550a_dev_t::flags is zero-initialized by safe_new_obj and never synced from the bridge's b->flags until something fires a chardev_notify. The NS16550A's own sync paths (ns16550a_poll) only run after chardev_write on THR or chardev_read after an RX-available RBR read — neither fires during kernel port-startup on a second UART that's not the console. Consequence: when Linux's 8250 driver enables THRI on IER for its first user-space write (uart_start_tx → set_THRI → serial_out(UART_IER, THRI)), ns16550a_update_irq reads uart->flags==0, finds no TX bit set, and lowers the IRQ line. The first TX IRQ never fires. Kernel waits indefinitely in the 8250 IRQ-driven tx path. ttyS0 hid this because kernel printk runs polled writes to THR via serial8250_console_write, and each of those writes bounces through ns16550a_poll which syncs uart->flags. The bug only manifested on secondary UARTs exposed via the JNI bridge — precisely scev's RPC channel. Commit 7638bb7's edge-pulse-on-drain workaround couldn't bootstrap this either: it fires only when `got > 0`, and `got` stays 0 until the kernel writes THR, which it won't do until it gets its first TX IRQ. Chicken-and-egg. Fix: sync uart->flags to b->flags once, immediately after ns16550a attach in bridge_init. After that every IRQ dispatch path keeps the two in lockstep through the existing sync sites. Also removes the edge-pulse workaround in bridge_poll — now redundant. RISC-V PLIC's level-triggered rearm (plic_complete_irq in riscv-plic.c:225) re-pends the line on each kernel completion as long as update_irq keeps it raised, so the kernel drains xmit_fifo via back-to-back serial8250_tx_chars invocations inside a single logical interrupt, at CPU speed. No per-tick pulse required.
Two small robustness fixes in the JNI bridge backing NS16550A. jni_uart_update: always notify with the current flags, not only on a rising-edge delta. The previous `flags & ~prev` gate silently dropped 1→0 transitions and any case where uart->flags had drifted out of sync with b->flags, leaving the UART to self-correct only on its next ns16550a_poll call (THR write / RX-available RBR read) — paths that can be arbitrarily delayed. ns16550a_notify already short- circuits on unchanged state through atomic_swap, so the cost of unconditional notify is one atomic swap per 16 ms eventloop tick when idle, for no behavioral change in the common case and correctness in the drift case. jni_uart_write: track actual bytes written, not requested. The previous code bumped `total_pushed += nbytes` and returned `nbytes` even when the overflow-drop path couldn't make enough room (write larger than the ring size — pathological in practice but possible). Now total_pushed reflects ringbuf_write's actual return, while the function still returns nbytes to the UART: a short return there would make the caller treat the remainder as a retryable partial write and hang the 8250 TX path waiting for THRE. total_pushed and tx_dropped stay accurate for instrumentation.
jni/ns16550a_bridge: fix ttyS1 RPC timeouts + robustness cleanups
Mainlining
Issue LekKit#208 still reproduces on the PR-207 fix because the cooperative halt has two holes. 1. sound_hda_stream_drain only checks stream->running at the top of the outer while-loop. Once a BDL pass starts, the inner for-loop walks every entry (up to lvi+1 × ~43 ms pacing) regardless of running — several seconds of work that outlive the pacing budget. 2. sound_hda_remove caps the worker_alive wait at 1 s. Past the cap, the caller proceeds to rvvm_machine_free and the PCI state is freed while the worker is still in the middle of a BDL pass. The next pci_send_irq / pci_get_dma_ptr then touches a freed pci_func — exactly the SEGV_ACCERR stack in LekKit#208. The prior comment framed the cap as a liveness guard with "worst outcome is a brief thread leak — not a use-after-free." It was the UAF. Fix both: - Add a running check inside the inner for-loop, after the backend write and before pacing / IRQ dispatch. Returns from the drain within one backend call of running=0. - Drop the 1 s cap in sound_hda_remove; wait unbounded on worker_alive. Log a one-shot rvvm_warn after 5 s as a diagnostic for a wedged backend (a separate commit adds the abort() hook that makes that case unreachable for well-behaved backends).
The unbounded wait in sound_hda_remove from the previous commit is
safe against UAF but can stall teardown when the backend's write()
is blocked: ALSA mid-xrun under PipeWire load, IPC sinks without an
internal queue, or any sink whose callback sleeps. With only running=0
to signal shutdown, the worker only notices between backend calls —
useless while it's stuck inside one.
Add an optional abort() hook to sound_subsystem_t, called from
sound_hda_remove before the worker_alive wait. Backends implement
abort() by signalling whatever their write() is blocked on, so the
blocked writei returns promptly and the inner-loop running check
(from the previous commit) takes the worker out of the drain.
ALSA implementation:
- alsa_subsystem_t gains an `aborted` flag checked at the top of
the writei retry loop.
- alsa_sound_abort sets the flag and calls snd_pcm_drop to force
any in-flight writei to return -EPIPE. Without the flag, the
retry loop would snd_pcm_prepare + writei again and re-block.
Public backend API (sound_hda_init_ex / _auto_ex) gains a matching
sound_hda_backend_abort_fn parameter. NULL preserves non-blocking
semantics — the JNI ring sink passes NULL because its write is just
mutex + memcpy and cannot block the worker on external I/O.
With this in place the rvvm_warn("worker still alive after 5 s")
diagnostic in sound_hda_remove is effectively unreachable for any
well-behaved backend.
dhcpcd's Duplicate Address Detection sends an ARP-who-has for the offered IP with sender_ip set to 0.0.0.0 (RFC 5227). The previous handler treated it like any ARP request: since sender (0.0.0.0) != target (offered IP), it replied claiming GATEWAY_MAC owned the offered address. dhcpcd then saw that reply, flagged a duplicate-address conflict, aborted the lease commit, and re-sent a DHCPDISCOVER. Loop forever. Filter ARP probes (sender_ip == 0.0.0.0) out of the reply path. We're user-mode networking — there IS no other host on the virtual LAN — so any offered address is unique by construction, and staying silent on DAD is the right answer. Reported against scev via pufit/ScalarEvolution-21: boot Alpine with a network card, install dhcpcd, watch the console print the offered/probing/claims/DAD/soliciting loop indefinitely. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Author
|
Sorry, AI got a little bit crazy. This PR is not ready yet. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.