diff --git a/crates/openshell-driver-podman/NETWORKING.md b/crates/openshell-driver-podman/NETWORKING.md new file mode 100644 index 000000000..87eef079c --- /dev/null +++ b/crates/openshell-driver-podman/NETWORKING.md @@ -0,0 +1,418 @@ +# Rootless Podman Networking + +Deep-dive into how networking works in the Podman compute driver when running +rootless with pasta as the network backend. Covers the external tooling +(Podman, Netavark, pasta, aardvark-dns), the three nested namespace layers, and +the complete data paths for SSH, outbound traffic, and supervisor-to-gateway +communication. + +For the general Podman driver architecture, lifecycle, API surface, and driver +comparison, see [README.md](README.md). + +## Component Stack + +Podman's networking is composed of four independent projects: + +| Component | Language | Role | +|---|---|---| +| Podman | Go | Container runtime; orchestrates network lifecycle. | +| Netavark | Rust | Network backend; creates interfaces, bridges, firewall rules. | +| aardvark-dns | Rust | Authoritative DNS server for container name resolution. | +| pasta, part of passt | C | User-mode networking; L2-to-L4 socket translation for rootless containers. | + +The key split: rootful containers default to Netavark bridge networking with +real kernel interfaces, while rootless containers commonly use pasta user-mode +networking without needing host privileges. + +## How Netavark Works + +Netavark is invoked by Podman as an external binary. It reads a JSON network +configuration from STDIN and executes one of three commands: + +- `netavark setup ` creates interfaces, assigns IPs, and sets up + firewall rules for NAT and port-forwarding. +- `netavark teardown ` reverses setup and removes interfaces and + firewall rules. +- `netavark create` takes a partial network config and completes it by + assigning subnets and gateways. + +For rootful bridge networking: + +1. Podman creates a network namespace for the container. +2. Podman invokes `netavark setup` with the network config JSON. +3. Netavark creates a bridge, such as `podman0`, if it does not exist. The + default subnet is `10.88.0.0/16`. +4. Netavark creates a veth pair. One end goes into the container's netns and + the other attaches to the bridge. +5. Netavark assigns an IP from the subnet to the container's veth interface. +6. Netavark configures iptables or nftables rules for masquerade and port + mappings. +7. Netavark starts aardvark-dns when DNS is enabled, listening on the bridge + gateway address. + +```text +Host Kernel + | + +-- Bridge interface, such as "podman0" + | | + | +-- veth pair endpoint, host side, container 1 + | +-- veth pair endpoint, host side, container 2 + | + +-- Host physical interface, such as eth0 + | + +-- NAT, iptables or nftables rules managed by Netavark +``` + +Netavark also supports macvlan networks, where the container gets a +sub-interface of a physical host NIC with its own MAC address, and external +plugins via a documented JSON API. + +## How Pasta Works + +Unprivileged users cannot create network interfaces on the host. They cannot +create veth pairs, bridges, or iptables rules. Netavark's bridge approach +cannot work directly for rootless containers without an additional rootless +networking layer. + +Pasta, part of the `passt` project, operates in userspace and translates +between the container's L2 TAP interface and the host's L4 sockets. It requires +no capabilities or privileges. + +```text +Container Network Namespace + | + +-- TAP device, such as "eth0" + | ^ + | | L2 frames, Ethernet + | v + +-- pasta process, userspace + | + | Translation: L2 frames <-> L4 sockets + | + v + Host Network Stack, native TCP/UDP/ICMP sockets +``` + +For an outbound TCP connection from a container: + +1. The application calls `connect()` to an external address. +2. The kernel routes the packet through the default gateway to the TAP device. +3. Pasta reads the raw Ethernet frame from the TAP file descriptor. +4. Pasta parses L2/L3/L4 headers and identifies the TCP SYN. +5. Pasta opens a native TCP socket on the host and calls `connect()` to the + same destination. +6. When the host socket connects, pasta reflects the SYN-ACK back through the + TAP as an L2 frame. +7. For ongoing data transfer, pasta translates between TAP frames and the host + socket, coordinating TCP windows and acknowledgments between the two sides. + +Pasta does not maintain per-connection packet buffers. It reflects observed +sending windows and ACKs directly between peers. This is a thinner translation +layer than a full TCP/IP stack. + +### Built-in Services + +Pasta includes minimal network services so the container stack can +auto-configure: + +| Service | Purpose | +|---|---| +| ARP proxy | Resolves the gateway address to the host's MAC address. | +| DHCP server | Hands out a single IPv4 address, usually matching the host's upstream interface. | +| NDP proxy | Handles IPv6 neighbor discovery and SLAAC prefix advertisement. | +| DHCPv6 server | Hands out a single IPv6 address, usually matching the host's upstream interface. | + +By default there is no NAT. Pasta copies the host's IP addresses into the +container namespace. + +### Local Connection Bypass + +For connections between the container and the host, pasta implements a local +bypass path: + +- Packets with a local destination skip L2 translation. +- TCP uses `splice(2)`. +- UDP uses `recvmmsg(2)` and `sendmmsg(2)`. + +### Port Forwarding + +By default, pasta uses auto-detection. It scans `/proc/net/tcp` and +`/proc/net/tcp6` periodically and automatically forwards ports that are bound +and listening. Port forwarding is configurable through pasta options. + +### Security Properties + +Pasta is designed for rootless use: + +- No dynamic memory allocation after startup. +- All capabilities dropped, except `CAP_NET_BIND_SERVICE` when granted. +- Restrictive seccomp profile. +- Detaches into its own user, mount, IPC, UTS, and PID namespaces. +- No external dependencies beyond libc. + +### Inter-Container Limitation + +Unlike bridge networking, pasta containers are isolated from each other by +default. No virtual bridge connects them. Communication requires port mappings +through the host, pods with a shared network namespace, or opting into rootless +Netavark bridge networking with `podman network create`. + +## Three Nested Namespaces + +The Podman compute driver creates three layers of network isolation: + +```text +Namespace 1: Host + | + pasta manages port forwarding, such as 127.0.0.1: + gateway listens on its configured bind address and port + | +Namespace 2: Rootless Podman network namespace, managed by pasta + | + Bridge "openshell", often 10.89.x.0/24 + aardvark-dns for container name resolution + | + Container netns + supervisor, proxy, and relay client run here + | +Namespace 3: Inner sandbox netns, created by supervisor + | + veth pair, such as 10.200.0.1 <-> 10.200.0.2 + iptables forces ordinary traffic through proxy + user workload runs here +``` + +Pasta bridges namespace 1 and 2. The veth pair bridges namespace 2 and 3. The +proxy at the boundary of namespace 2 and 3 enforces network policy. + +### Layer 1 Pasta + +At driver startup, the driver ensures a Podman bridge network exists: + +```rust +client.ensure_network(&config.network_name).await?; +``` + +This creates a bridge network named `openshell` by default, with DNS enabled. +In rootless mode, this bridge can exist inside a user namespace managed by +pasta. The bridge IP range is not reliably routable from the host. + +```text +Host + | + 127.0.0.1:, pasta binds this on the host + | + pasta process, translates L4 sockets <-> L2 TAP frames + | + rootless network namespace + | + Bridge "openshell", such as 10.89.1.0/24 + | + +-- 10.89.1.1, bridge gateway and aardvark-dns + | + +-- veth to container netns + | + 10.89.1.2, container IP +``` + +### Layer 2 Container Networking + +The container spec configures: + +- `nsmode: "bridge"` to use the Podman bridge network. +- `networks` to attach to the configured bridge, `openshell` by default. +- `portmappings` with `host_port: 0`, `container_port: 2222`, and `protocol: + "tcp"` to publish the SSH compatibility port on an ephemeral host port. +- `hostadd` entries for `host.containers.internal:host-gateway` and + `host.openshell.internal:host-gateway`. + +Pasta is not explicitly configured by the driver. The driver requests bridge +mode and logs the network backend that Podman reports at startup. + +The `host.containers.internal` hostname is injected into `/etc/hosts` so the +supervisor can reach the gateway on the host. If `OPENSHELL_GRPC_ENDPOINT` is +empty, the driver auto-detects: + +```rust +if config.grpc_endpoint.is_empty() { + let scheme = if config.tls_enabled() { + "https" + } else { + "http" + }; + config.grpc_endpoint = + format!("{scheme}://host.containers.internal:{}", config.gateway_port); +} +``` + +The bridge gateway IP is not a stable substitute in rootless mode because it +can live inside the user namespace rather than on the host. + +### Layer 3 Inner Sandbox Network Namespace + +Inside the container, the supervisor creates another network namespace for the +user workload: + +```text +Container on the Podman bridge + | + Supervisor process, running in container's default netns + | + +-- Proxy listener at the inner namespace gateway address + | + +-- veth pair + | + +-- Inner network namespace + | + sandbox-side veth address + | + default route -> supervisor-side veth address + | + user code runs here + | + iptables rules: + ACCEPT -> proxy TCP + ACCEPT -> loopback + ACCEPT -> established/related + LOG -> TCP SYN bypass attempts + REJECT -> TCP + LOG -> UDP bypass attempts + REJECT -> UDP +``` + +The supervisor uses `nsenter --net=` rather than `ip netns exec` to avoid sysfs +remount issues that arise under rootless Podman where real host +`CAP_SYS_ADMIN` is unavailable. + +A tmpfs is mounted at `/run/netns` in the container spec so the supervisor can +create named network namespaces. In rootless Podman this directory does not +exist on the host, so a private tmpfs gives the supervisor its own writable +`/run/netns` without needing host filesystem access. + +## Complete Data Paths + +### SSH Session + +```text +Client, openshell CLI + | + 1. gRPC: CreateSshSession -> gateway, returns token and connect_path + 2. HTTP CONNECT /connect/ssh to gateway + headers: x-sandbox-id, x-sandbox-token + | +Gateway + | + 3. Looks up SupervisorSession for sandbox_id + 4. Sends RelayOpen{channel_id} over ConnectSupervisor bidi stream + | + gRPC traverses host -> pasta translation -> container bridge + | +Supervisor inside container + | + 5. Receives RelayOpen, opens new RelayStream RPC back to gateway + 6. Sends RelayInit{channel_id} on the stream + 7. Connects to Unix socket /run/openshell/ssh.sock + 8. Bidirectional bridge: RelayStream <-> Unix socket + | +SSH daemon inside container, Unix socket only + | + 9. Authenticates. Access is gated by the relay chain. + 10. Spawns shell process + 11. Shell enters inner netns via setns(fd, CLONE_NEWNET) + | +User shell in sandbox netns +``` + +The SSH daemon listens on a Unix socket with restrictive permissions. The +published TCP port mapping exists in the container spec for compatibility and +health/debug paths. Normal SSH communication uses the gRPC reverse-connect relay +pattern. + +### Outbound HTTP Request + +```text +User code in inner netns + | + 1. curl https://api.example.com + HTTP_PROXY points at the local sandbox proxy + | + 2. TCP connect to proxy + allowed by iptables as the only ordinary egress destination + | + 3. HTTP CONNECT api.example.com:443 + | +Supervisor proxy in container netns + | + 4. Policy evaluation with process identity + 5. SSRF check + 6. Optional L7 TLS intercept and HTTP method/path inspection + | + 7. If allowed, TCP connect to api.example.com:443 + from the container netns + | + 8. Through Podman bridge -> pasta -> host -> internet +``` + +### Supervisor gRPC Callback + +The Podman driver auto-detects the callback endpoint scheme based on whether +TLS client certificates are configured. When the RPM's auto-generated PKI is in +place, the endpoint is `https://host.containers.internal:8080` and the +supervisor connects with mTLS. Without TLS configuration, it falls back to +`http://host.containers.internal:8080`. + +```text +Supervisor in container netns + | + 1. Connects to host.containers.internal: + with mTLS when OPENSHELL_TLS_* paths are set + | + 2. Routed through container default gateway + | + 3. Pasta translates L2 frame -> host L4 socket when rootless backend uses pasta + | + 4. Host TCP socket connects to gateway + | +Gateway + | + 5. TLS handshake when enabled + 6. ConnectSupervisor bidirectional stream established + 7. Heartbeats at the interval accepted by the gateway + 8. Reconnects with exponential backoff on failure + 9. Same gRPC channel reused for RelayStream calls +``` + +The gateway binds to `0.0.0.0` by default in the RPM packaging. mTLS prevents +unauthenticated access even though the gateway is reachable from the network. +Client certificates are auto-generated by `init-pki.sh` on first start and +bind-mounted into sandbox containers by the Podman driver. + +## Differences from the Kubernetes Driver + +| Aspect | Kubernetes | Podman, rootless pasta | +|---|---|---| +| Container or pod IP | Routable cluster-wide | Non-routable from the host in common rootless setups. | +| Network reachability | Pod IPs reachable from gateway | Bridge not reliably routable from host; requires host aliases or published ports. | +| Sandbox to gateway | Direct TCP to Kubernetes service or endpoint | `host.containers.internal` through bridge and rootless backend. | +| SSH transport | Reverse gRPC relay | Reverse gRPC relay. | +| Port publishing | Not needed for relay | Ephemeral host port remains in the container spec for compatibility and debug paths. | +| TLS | mTLS via Kubernetes secrets | mTLS via mounted client files, RPM defaults, or explicit configuration. | +| DNS | Kubernetes CoreDNS | Podman bridge DNS through aardvark-dns when DNS is enabled. | +| Network policy | Kubernetes network policy for pod ingress plus supervisor policy | iptables inside inner sandbox netns plus supervisor policy. | +| Supervisor delivery | Kubernetes driver managed pod image or template | OCI image volume mount. | +| Secrets | Kubernetes Secret volume and env vars | Podman `secret_env` for handshake secret, plus mounted TLS files. | + +Both drivers use the same reverse gRPC relay for SSH transport. The most +important Podman-specific difference is network reachability: in rootless +Podman, the bridge network is not reliably routable from the host, so +host-to-container and container-to-host communication must use host aliases, +published ports, or the supervisor relay. + +## Port Assignments + +| Port | Component | Purpose | +|---|---|---| +| `8080` | Gateway | gRPC and HTTP multiplexed default server port. | +| `2222` | Sandbox | Container port mapping default for the SSH compatibility port. | +| `3128` | Sandbox proxy | HTTP CONNECT proxy inside the sandbox network model. | +| `0` | Host | Ephemeral host port requested for the container SSH compatibility port. | diff --git a/crates/openshell-driver-podman/README.md b/crates/openshell-driver-podman/README.md index 1193416e1..d853bb5ea 100644 --- a/crates/openshell-driver-podman/README.md +++ b/crates/openshell-driver-podman/README.md @@ -1,74 +1,354 @@ # openshell-driver-podman -Podman-backed compute driver for rootless and single-machine OpenShell -deployments. +The Podman compute driver manages sandbox containers via the Podman REST API +over a Unix socket. It targets single-machine and developer environments where +rootless container isolation is preferred over a full Kubernetes cluster. The +driver runs in-process within the gateway server and delegates all sandbox +isolation enforcement to the `openshell-sandbox` supervisor binary, which is +sideloaded into each container via an OCI image volume mount. -The driver talks to the Podman libpod REST API over a Unix socket. It runs -in-process with the gateway server and creates one sandbox container per -sandbox. The `openshell-sandbox` supervisor inside the container still owns the -actual agent isolation. +For a rootless networking deep dive, see [NETWORKING.md](NETWORKING.md). -## Runtime Model +## Architecture + +The Podman driver communicates with the Podman daemon over a Unix socket and +delegates sandbox isolation to the supervisor binary running inside each +container. + +```mermaid +graph TB + CLI["openshell CLI"] -->|gRPC| GW["Gateway Server
(openshell-server)"] + GW -->|in-process| PD["PodmanComputeDriver"] + PD -->|HTTP/1.1
Unix socket| PA["Podman API"] + PA -->|OCI runtime
crun/runc| C["Sandbox Container"] + C -->|image volume
read-only| SV["Supervisor Binary
/opt/openshell/bin/openshell-sandbox"] + SV -->|creates| NS["Nested Network Namespace
veth pair + proxy"] + SV -->|enforces| LL["Landlock + seccomp"] + SV -->|gRPC callback| GW +``` + +## Isolation Model + +The Podman driver provides the same protection layers as the other compute +drivers. The driver itself does not implement isolation primitives directly. It +configures the container so that the `openshell-sandbox` supervisor can enforce +them at runtime. + +### Container Security Configuration + +The container spec in `container.rs` sets these security-critical fields: + +| Setting | Value | Rationale | +|---|---|---| +| `user` | `0:0` | The supervisor needs root inside the container for namespace creation, proxy setup, Landlock, seccomp, and filesystem preparation. | +| `cap_drop` | Selected unneeded defaults | Podman's default capability set is already restricted. The driver drops capabilities the supervisor does not need. | +| `cap_add` | `SYS_ADMIN`, `NET_ADMIN`, `SYS_PTRACE`, `SYSLOG`, `DAC_READ_SEARCH` | Grants supervisor-only capabilities required for namespace setup, process identity, and bypass diagnostics. | +| `no_new_privileges` | `true` | Prevents privilege escalation after exec. | +| `seccomp_profile_path` | `unconfined` | The supervisor installs its own policy-aware BPF filter. A container-level profile can block Landlock/seccomp syscalls during setup. | +| `mounts` | Private tmpfs at `/run/netns` | Lets the supervisor create named network namespaces in rootless Podman. | + +The restricted agent child does not retain these supervisor privileges. + +### Capability Breakdown + +| Capability | Purpose | +|---|---| +| `SYS_ADMIN` | seccomp filter installation, namespace creation, and Landlock setup. | +| `NET_ADMIN` | Network namespace veth setup, IP address assignment, routes, and iptables. | +| `SYS_PTRACE` | Reading `/proc//exe` and walking process ancestry for binary identity. | +| `SYSLOG` | Reading `/dev/kmsg` for bypass-detection diagnostics. | +| `DAC_READ_SEARCH` | Reading `/proc//fd/` across UIDs so the proxy can resolve the binary responsible for a connection. | + +The driver intentionally keeps Podman's default `SETUID`, `SETGID`, `CHOWN`, +and `FOWNER` capabilities because the supervisor needs them to drop privileges +and prepare writable sandbox directories. It drops unneeded defaults such as +`DAC_OVERRIDE`, `FSETID`, `KILL`, `NET_BIND_SERVICE`, `NET_RAW`, `SETFCAP`, +`SETPCAP`, and `SYS_CHROOT`. + +## Supervisor Sideloading + +The supervisor binary is delivered to sandbox containers via Podman's OCI image +volume mechanism, distinct from both the Kubernetes pod-volume approach and the +VM's embedded guest bundle. ```mermaid -flowchart LR - GW["Gateway"] -->|"in-process driver"| D["PodmanComputeDriver"] - D -->|"HTTP over Unix socket"| P["Podman API"] - P --> C["Sandbox container"] - C --> S["openshell-sandbox supervisor"] - S --> A["restricted agent child"] +sequenceDiagram + participant D as PodmanComputeDriver + participant P as Podman API + participant C as Sandbox Container + + D->>P: pull_image(supervisor, "missing") + D->>P: create_container(spec with image_volumes) + Note over P: Podman resolves image_volumes at
libpod layer before OCI spec generation + P->>C: Mount supervisor image at /opt/openshell/bin (read-only) + D->>P: start_container + C->>C: entrypoint: /opt/openshell/bin/openshell-sandbox ``` -The container is the runtime boundary. Inside it, the supervisor creates a -nested network namespace, starts the policy proxy, applies Landlock/seccomp, and -launches the agent child as an unprivileged user. +The `supervisor` target in `deploy/docker/Dockerfile.images` copies the +`openshell-sandbox` binary to `/openshell-sandbox` in the supervisor image. +Mounting that image at `/opt/openshell/bin` makes the binary available as +`/opt/openshell/bin/openshell-sandbox`. + +The container spec sets that binary as the entrypoint. This avoids relying on +the sandbox image entrypoint or command, which might otherwise append the +supervisor path as an argument to an image-provided shell. + +## TLS -## Supervisor Delivery +When all three Podman TLS paths are set, the driver treats sandbox callbacks as +mTLS callbacks: -Podman uses an OCI image volume to mount the supervisor binary read-only at -`/opt/openshell/bin`. The supervisor image is built from the `supervisor` target -in `deploy/docker/Dockerfile.images`. +- `OPENSHELL_PODMAN_TLS_CA` +- `OPENSHELL_PODMAN_TLS_CERT` +- `OPENSHELL_PODMAN_TLS_KEY` -This keeps the supervisor outside the mutable sandbox image while avoiding a -hostPath-style bind mount. +The driver validates that the TLS paths are provided as a complete set. Partial +configuration fails early instead of silently falling back to plaintext. -## Rootless Adaptations +When enabled, the driver: -Rootless Podman has stricter capability behavior than Kubernetes. The container -spec drops all capabilities and adds back only the supervisor capabilities it -needs: +1. Switches the auto-detected endpoint scheme from `http://` to `https://`. +2. Bind-mounts the client cert files read-only into the container at + `/etc/openshell/tls/client/`. +3. Sets `OPENSHELL_TLS_CA`, `OPENSHELL_TLS_CERT`, and `OPENSHELL_TLS_KEY` to + the container-side paths. -- `SYS_ADMIN` for namespace and Landlock setup. -- `NET_ADMIN` for nested network namespace routing. -- `SYS_PTRACE` and `DAC_READ_SEARCH` for process identity inspection. -- `SYSLOG` for bypass diagnostics. -- `SETUID` and `SETGID` for dropping to the sandbox user. +The supervisor reads these env vars and uses them to establish an mTLS +connection back to the gateway. On SELinux systems, the bind mounts include +Podman's shared relabel option so the container process can read the files. -The restricted agent child loses these privileges before user code runs. +The RPM packaging auto-generates a self-signed PKI on first start via +`init-pki.sh`. Client certs are placed in the CLI auto-discovery directory +(`~/.config/openshell/gateways/openshell/mtls/`) so the CLI connects with mTLS +without manual configuration. See `deploy/rpm/CONFIGURATION.md` for the full +RPM configuration reference. ## Network Model -The driver creates or reuses a Podman bridge network for container-to-host -communication. The agent child does not use that bridge directly. The supervisor -creates a nested namespace and routes agent egress through the local CONNECT -proxy. +Sandbox network isolation uses a two-layer approach: a Podman bridge network +for container-to-host communication, and a nested network namespace created by +the supervisor for sandbox process isolation. + +```mermaid +graph TB + subgraph Host + GW["Gateway Server
127.0.0.1:8080"] + PS["Podman Socket"] + end + + subgraph Bridge["Podman Bridge Network (10.89.x.x)"] + subgraph Container["Sandbox Container"] + SV["Supervisor
(root in user ns)"] + subgraph NestedNS["Nested Network Namespace"] + SP["Sandbox Process
(sandbox user)"] + VE2["veth1: 10.200.0.2"] + end + VE1["veth0: 10.200.0.1
(CONNECT proxy)"] + SV --- VE1 + VE1 ---|veth pair| VE2 + end + end + + GW -.->|SSH via supervisor relay
gRPC session| SV + SV -->|gRPC callback via
host.containers.internal| GW + SP -->|all egress via proxy| VE1 +``` + +Key points: + +- Bridge network: created by `client.ensure_network()` with DNS enabled. + Containers on the bridge can see each other at L3, but sandbox processes + cannot because they are isolated inside the nested netns. +- Nested netns: the supervisor creates a private `NetworkNamespace` with a veth + pair. Sandbox processes enter this netns via `setns(fd, CLONE_NEWNET)` in the + `pre_exec` hook, forcing ordinary traffic through the CONNECT proxy. +- Port publishing: the container spec still requests `host_port: 0` for the + configured SSH port. The gateway SSH tunnel uses the supervisor relay rather + than connecting directly to the published port. +- Host gateway: `host.containers.internal:host-gateway` and + `host.openshell.internal:host-gateway` in `/etc/hosts` allow containers to + reach services on the gateway host. +- nsenter: the supervisor uses `nsenter --net=` instead of `ip netns exec` for + namespace operations, avoiding the sysfs remount path that fails in rootless + containers. + +See [NETWORKING.md](NETWORKING.md) for the rootless Podman networking deep dive. + +## Supervisor Relay + +Podman follows the same end-to-end contract as the Kubernetes and VM drivers +for the in-container SSH relay: gateway config to `PodmanComputeConfig` to +sandbox environment to supervisor session registration on that path. + +1. `openshell-core` `Config::sandbox_ssh_socket_path` is copied into + `PodmanComputeConfig::sandbox_ssh_socket_path` when the gateway builds the + in-process driver. +2. `build_env()` in `container.rs` sets `OPENSHELL_SSH_SOCKET_PATH` to that + value, alongside required vars such as `OPENSHELL_ENDPOINT` and + `OPENSHELL_SANDBOX_ID`. These driver-controlled entries overwrite template + environment variables to prevent spoofing. +3. The supervisor reads `OPENSHELL_SSH_SOCKET_PATH` and uses it for the Unix + socket the gateway's SSH stack bridges to. + +The standalone `openshell-driver-podman` binary sets the same struct field from +`OPENSHELL_SANDBOX_SSH_SOCKET_PATH`. + +## Credential Injection + +The SSH handshake secret is injected via Podman's `secret_env` API rather than +a plaintext environment variable. + +| Credential | Mechanism | Visible in `inspect`? | Visible in `/proc//environ`? | +|---|---|---|---| +| SSH handshake secret | Podman `secret_env`, created via secrets API and referenced by name | No | Yes, supervisor only, scrubbed from children | +| Sandbox identity | Plaintext env var | Yes | Yes | +| gRPC endpoint | Plaintext env var, override-protected | Yes | Yes | +| Supervisor relay socket path | Plaintext env var, override-protected | Yes | Yes | + +The `build_env()` function inserts user-supplied variables first, then +unconditionally overwrites all security-critical variables to prevent spoofing +via sandbox templates: + +- `OPENSHELL_SANDBOX` +- `OPENSHELL_SANDBOX_ID` +- `OPENSHELL_ENDPOINT` +- `OPENSHELL_SSH_SOCKET_PATH` +- `OPENSHELL_SSH_HANDSHAKE_SKEW_SECS` +- `OPENSHELL_CONTAINER_IMAGE` +- `OPENSHELL_SANDBOX_COMMAND` + +The `PodmanComputeConfig::Debug` implementation redacts the handshake secret as +`[REDACTED]`. + +## Sandbox Lifecycle + +### Creation Flow + +```mermaid +sequenceDiagram + participant GW as Gateway + participant D as PodmanComputeDriver + participant P as Podman API + + GW->>D: create_sandbox(DriverSandbox) + D->>D: validate name + id + D->>D: validated_container_name() + + D->>P: pull_image(supervisor, "missing") + D->>P: pull_image(sandbox_image, policy) + + D->>P: create_secret(handshake) + Note over D: On failure below, rollback secret + + D->>P: create_volume(workspace) + Note over D: On failure below, rollback volume + secret + + D->>P: create_container(spec) + alt Conflict (409) + D->>P: remove_volume + remove_secret + D-->>GW: AlreadyExists + end + Note over D: On failure below, rollback container + volume + secret + + D->>P: start_container + D-->>GW: Ok +``` + +Each step rolls back previously-created resources on failure. The Conflict path +cleans up the volume and secret because they are keyed by the new sandbox's ID, +not the conflicting container's ID. + +### Readiness and Health + +The container `healthconfig` marks the sandbox healthy when any of these +signals succeeds: + +- Legacy marker file `/var/run/openshell-ssh-ready`. +- `test -S` on the configured supervisor Unix socket path. +- The prior TCP check for a listener on the in-container SSH port. + +The Unix socket check allows relay-only readiness when the supervisor exposes +the socket without the old marker or published-port signal. + +### Deletion Flow + +1. Validate `sandbox_name` and stable `sandbox_id` from `DeleteSandboxRequest`. +2. Best-effort inspect cross-checks the container label when present, but + cleanup remains keyed by the request `sandbox_id`. +3. Best-effort stop, ignoring the stop result. +4. Force-remove the container. +5. Remove workspace volume derived from the request `sandbox_id`, warning on + failure and continuing. +6. Remove handshake secret derived from the request `sandbox_id`, warning on + failure and continuing. + +If the container is already gone during inspect or remove, the driver still +performs idempotent volume and secret cleanup using the request `sandbox_id` and +returns `Ok(false)` for the container-delete result. This prevents leaked +Podman resources after out-of-band container removal or label drift. + +## Configuration -`host.containers.internal` is used for callbacks to the host gateway. Rootless -networking may use pasta under the hood; avoid assumptions that require -container-to-container L2 reachability. +| Environment Variable | CLI Flag | Default | Description | +|---|---|---|---| +| `OPENSHELL_PODMAN_SOCKET` | `--podman-socket` | `$XDG_RUNTIME_DIR/podman/podman.sock` on Linux, `$HOME/.local/share/containers/podman/machine/podman.sock` on macOS | Podman API Unix socket path. | +| `OPENSHELL_SANDBOX_IMAGE` | `--sandbox-image` | From gateway config | Default OCI image for sandboxes. | +| `OPENSHELL_SANDBOX_IMAGE_PULL_POLICY` | `--sandbox-image-pull-policy` | `missing` | Pull policy: `always`, `missing`, `never`, or `newer`. | +| `OPENSHELL_GRPC_ENDPOINT` | `--grpc-endpoint` | Auto-detected via `host.containers.internal` | Gateway gRPC endpoint for sandbox callbacks. | +| `OPENSHELL_GATEWAY_PORT` | `--gateway-port` | `8080` | Gateway port used for endpoint auto-detection by the standalone binary. | +| `OPENSHELL_NETWORK_NAME` | `--network-name` | `openshell` | Podman bridge network name. | +| `OPENSHELL_SANDBOX_SSH_PORT` | `--sandbox-ssh-port` | `2222` | SSH compatibility port inside the container. | +| `OPENSHELL_SSH_HANDSHAKE_SECRET` | `--ssh-handshake-secret` | Required standalone, gateway-generated in-process | Shared secret for the NSSH1 handshake. | +| `OPENSHELL_SSH_HANDSHAKE_SKEW_SECS` | `--ssh-handshake-skew-secs` | `300` | Allowed timestamp skew for SSH handshake validation. | +| `OPENSHELL_SANDBOX_SSH_SOCKET_PATH` | `--sandbox-ssh-socket-path` | `/run/openshell/ssh.sock` | Standalone driver only: supervisor Unix socket path in `PodmanComputeConfig`. In-gateway Podman uses server `config.sandbox_ssh_socket_path`. | +| `OPENSHELL_STOP_TIMEOUT` | `--stop-timeout` | `10` | Container stop timeout in seconds. | +| `OPENSHELL_SUPERVISOR_IMAGE` | `--supervisor-image` | `openshell/supervisor:latest` through the gateway, required standalone | OCI image containing the supervisor binary. | +| `OPENSHELL_PODMAN_TLS_CA` | `--podman-tls-ca` | unset | Host path to the CA certificate mounted for sandbox mTLS. | +| `OPENSHELL_PODMAN_TLS_CERT` | `--podman-tls-cert` | unset | Host path to the client certificate mounted for sandbox mTLS. | +| `OPENSHELL_PODMAN_TLS_KEY` | `--podman-tls-key` | unset | Host path to the client private key mounted for sandbox mTLS. | -## Secrets and Environment +## Rootless-Specific Adaptations -The SSH handshake secret is injected with Podman's `secret_env` API rather than -as a plain inspectable environment value. Sandbox identity, callback endpoint, -relay socket path, and command metadata are driver-controlled environment -variables and must override template values. +The Podman driver is designed for rootless operation. The following adaptations +matter compared to cluster or rootful runtimes: -When TLS is configured, the driver mounts the client bundle read-only and sets -the standard `OPENSHELL_TLS_*` environment variables for the supervisor. +1. subuid/subgid preflight check: `check_subuid_range()` in `driver.rs` warns + operators if `/etc/subuid` or `/etc/subgid` entries are missing for the + current user. This is not a hard error because some systems use LDAP or + other mechanisms. +2. cgroups v2 requirement: the driver refuses to start if cgroups v1 is + detected. Rootless Podman requires the unified cgroup hierarchy. +3. `nsenter` for namespace operations: `openshell-sandbox` uses + `nsenter --net=` instead of `ip netns exec` to avoid the sysfs remount path + that requires real `CAP_SYS_ADMIN` in the host user namespace. +4. `DAC_READ_SEARCH` capability: required for the proxy to read + `/proc//fd/` across UIDs within the user namespace. +5. `SETUID` and `SETGID` capabilities: kept from Podman's default capability + set so `drop_privileges()` can call `setuid()` and `setgid()`. +6. `host.containers.internal`: used instead of Docker's `host.docker.internal` + for container-to-host communication. The driver also injects the + OpenShell-owned `host.openshell.internal` alias. +7. Ephemeral port publishing: the SSH compatibility port uses `host_port: 0` + because the bridge network IP is not reliably routable from the host in + rootless mode. +8. tmpfs at `/run/netns`: a private tmpfs lets the supervisor create named + network namespaces via `ip netns add`. -## GPU Support +## Implementation References -GPU sandboxes use CDI device injection when `spec.gpu` is true and NVIDIA CDI -devices are available. The sandbox image must still include the user-space -libraries required by the workload. +- Gateway integration: `crates/openshell-server/src/compute/mod.rs` + (`new_podman` and `PodmanComputeDriver` wiring). +- Server configuration: `crates/openshell-server/src/lib.rs` + (`ComputeDriverKind::Podman` builds `PodmanComputeConfig` including + `sandbox_ssh_socket_path` from gateway `Config`). +- Gateway relay path: `openshell-core` `Config::sandbox_ssh_socket_path` in + `crates/openshell-core/src/config.rs`. +- SSRF mitigation: `crates/openshell-core/src/net.rs`, + `crates/openshell-sandbox/src/proxy.rs`, and + `crates/openshell-server/src/grpc/policy.rs`. +- Sandbox supervisor: `crates/openshell-sandbox/src/` for Landlock, seccomp, + netns, proxy, and relay behavior shared by all drivers. +- Container engine abstraction: `tasks/scripts/container-engine.sh` for + build/deploy support across Docker and Podman. +- Supervisor image build: `deploy/docker/Dockerfile.images`. diff --git a/docs/reference/sandbox-compute-drivers.mdx b/docs/reference/sandbox-compute-drivers.mdx index 9f07c1e40..cc78b3b80 100644 --- a/docs/reference/sandbox-compute-drivers.mdx +++ b/docs/reference/sandbox-compute-drivers.mdx @@ -38,6 +38,8 @@ Common gateway options: The gateway talks to the Docker daemon to create sandbox containers. Docker is also required for local image builds from directories or Dockerfiles. +For maintainer-level implementation details, refer to the [Docker driver README](https://github.com/NVIDIA/OpenShell/blob/main/crates/openshell-driver-docker/README.md). + | Option | Environment variable | Description | |---|---|---| | `--drivers docker` | `OPENSHELL_DRIVERS=docker` | Select the Docker compute driver. | @@ -54,6 +56,8 @@ For GPU-backed Docker sandboxes, configure Docker CDI before starting the gatewa The gateway talks to the Podman API socket. The Podman driver requires Podman 5.x, cgroups v2, rootless networking, and an active Podman user socket. +For maintainer-level implementation details, refer to the [Podman driver README](https://github.com/NVIDIA/OpenShell/blob/main/crates/openshell-driver-podman/README.md) and [Podman networking notes](https://github.com/NVIDIA/OpenShell/blob/main/crates/openshell-driver-podman/NETWORKING.md). + | Option | Environment variable | Description | |---|---|---| | `--drivers podman` | `OPENSHELL_DRIVERS=podman` | Select the Podman compute driver. | @@ -69,6 +73,8 @@ MicroVM-backed sandboxes run inside VM-backed isolation instead of a container b The gateway uses the VM compute driver to create VM-backed sandboxes. MicroVM requires host virtualization support. It uses [libkrun](https://github.com/containers/libkrun) with Apple's [Hypervisor framework](https://developer.apple.com/documentation/hypervisor) on macOS, KVM on Linux, and [QEMU](https://www.qemu.org/) for GPU-backed sandboxes on Linux. +For maintainer-level implementation details, refer to the [VM driver README](https://github.com/NVIDIA/OpenShell/blob/main/crates/openshell-driver-vm/README.md). + | Option | Environment variable | Description | |---|---|---| | `--drivers vm` | `OPENSHELL_DRIVERS=vm` | Select the VM compute driver. VM is never auto-detected. | @@ -85,6 +91,8 @@ Kubernetes-backed sandboxes run as pods in the configured sandbox namespace. Use Helm deployments set Kubernetes driver values through the chart. +For maintainer-level implementation details, refer to the [Kubernetes driver README](https://github.com/NVIDIA/OpenShell/blob/main/crates/openshell-driver-kubernetes/README.md). + | Gateway option | Environment variable | Helm value | Description | |---|---|---|---| | `--drivers kubernetes` | `OPENSHELL_DRIVERS=kubernetes` | Not applicable | Select the Kubernetes compute driver. |