Skip to content

Propagate CA bundle to sandbox trust store on init#2325

Open
sitole wants to merge 10 commits intomainfrom
feat/envd-ca-certificates
Open

Propagate CA bundle to sandbox trust store on init#2325
sitole wants to merge 10 commits intomainfrom
feat/envd-ca-certificates

Conversation

@sitole
Copy link
Copy Markdown
Member

@sitole sitole commented Apr 8, 2026

Problem

Sandboxes have no mechanism to receive a custom CA certificate and
install it into the system trust store at startup. This is required
for scenarios where outbound traffic needs to trust a certificate that
differs per sandbox.

Solution

The orchestrator sends a CA bundle (combined PEM string) to envd via
the existing /init call on sandbox create and resume. The orchestrator
plumbing is a no-op in the open-source version; the real certificate
injection is handled in the EE version.

On the envd side, CACertInstaller appends the received cert to
/etc/ssl/certs/ca-certificates.crt on the critical path, then
removes the previous cert in a background goroutine. Resume with the
same cert is a no-op (in-memory comparison). A state file at
/var/run/e2b/ca-cert.pem ensures the previous cert is correctly
removed even after a process restart.

At startup, BindMountCABundle copies the system bundle to tmpfs and
bind-mounts it back over the original path. All processes continue
using the standard path transparently; writes bypass the NBD-backed
filesystem entirely. CA cert injection is now essentially instant.

Testing

Unit tests in packages/envd/internal/host/cacerts_test.go cover
first install, resume hot path, cert rotation, concurrent resume,
and restart scenarios (same cert and different cert after OOM).

Measured append_duration with the bind-mount in place: ~0.04 ms
(down from ~1.8 ms warm NBD / ~460 ms cold GCS).

sitole added 3 commits April 7, 2026 14:19
EgressProxy gains a CACertificates() method so the orchestrator can
retrieve the per-proxy CA bundle and forward it to the sandbox on
startup and resume. NoopEgressProxy and tcpfirewall.Proxy return nil.

Factory now takes an EgressProxy and stores it; CreateSandbox and
ResumeSandbox copy CACertificates() into the Sandbox struct. The init
body sent to envd gains a CaCert field (combined PEM string built by
combineCACertificates) so envd can install the cert on each /init call.
Each sandbox may receive a CA certificate via the /init caBundle field.
CACertInstaller (packages/envd/internal/host) appends the certificate
to /etc/ssl/certs/ca-certificates.crt on the critical path (~2 ms),
then removes the previous cert in a background goroutine to keep the
bundle clean across rotations.

Two layers of state track what is currently installed:
- lastCACert (memory): makes resume a zero-I/O hit when the same cert
  arrives again after pause/resume.
- /var/run/e2b/ca-cert.pem (disk): survives OOM kills and restarts so
  the background goroutine can identify and remove the previous cert
  even when lastCACert is empty after a process restart.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 8, 2026

PR Summary

Medium Risk
Touches low-level VM startup behavior (bind mounts and rewriting the system CA bundle), so mistakes could break TLS verification or init-time stability despite the added tests and guarded/no-op paths.

Overview
This PR propagates a PEM CA bundle from the orchestrator to envd via the existing /init request (on create and resume) and installs it into the sandbox’s system trust store. On the envd side it introduces a CACertInstaller that appends the new bundle on the critical path, cleans up the previously installed cert asynchronously with a persisted state file for restart safety, and bind-mounts the system CA bundle onto a tmpfs-backed copy at startup to avoid slow writes; the orchestrator plumbing is updated to source the bundle from the EgressProxy interface and forward it through to /init, with accompanying unit tests.

Reviewed by Cursor Bugbot for commit 6f27f7c. Bugbot is set up for automated code reviews on this repo. Configure here.

@sitole sitole marked this pull request as draft April 8, 2026 10:21
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3ebbe78bc1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test review body - please ignore

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 tests/integration/internal/envd/generated.go:190-196 — The integration test generated client sends the wrong JSON field for CA certificates: it defines CaCertificates *[]CACertificate (json:"caCertificates") — an array of {cert, name} objects — while the actual envd API added CaBundle *string (json:"caBundle"). Any integration test that populates CaCertificates will have those certs silently ignored by envd, which reads caBundle and finds it absent. The fix is to regenerate tests/integration/internal/envd/generated.go from the current packages/envd/spec/envd.yaml.

    Extended reasoning...

    What the bug is and how it manifests

    This PR adds CA certificate propagation to the envd /init endpoint. The authoritative API schema (packages/envd/spec/envd.yaml) and the generated server code (packages/envd/internal/api/api.gen.go) correctly define a single string field: CaBundle *string with JSON tag "caBundle,omitempty". The orchestrator's own generated client (packages/orchestrator/pkg/sandbox/envd/envd.gen.go) also correctly reflects CaBundle string with JSON tag "caBundle,omitempty".

    However, the integration test generated client (tests/integration/internal/envd/generated.go) defines a completely different schema: a new struct CACertificate with Cert string and Name string fields, and attaches them as CaCertificates *[]CACertificate (JSON tag "caCertificates,omitempty") to PostInitJSONBody. This is a stale artifact from an earlier design iteration where certificates were individual named objects rather than a single concatenated PEM bundle string.

    The specific code path that triggers it

    An integration test that wants to install a CA cert would set body.CaCertificates = &[]CACertificate{{Cert: pemData, Name: "my-ca"}} and call PostInit. The JSON marshalled over the wire would contain "caCertificates":[...] — a field envd never reads. The field envd actually reads ("caBundle") would be absent/null, so envd's if data.CaBundle != nil && *data.CaBundle != "" guard in SetData() would skip cert installation entirely.

    Why existing code does not prevent it

    No current integration tests exercise the CA cert path (confirmed by grepping test files — neither CaCertificates nor CaBundle is set in any test body). The Go type system cannot catch it because both structs compile independently; the discrepancy only surfaces at HTTP serialisation time when the wrong field name hits the wire.

    Impact

    Any future integration test for the CA cert feature written against this client would silently pass (envd returns 204 regardless) while the cert is never actually installed. This makes the CA cert feature effectively untestable via the integration client, directly undermining the purpose of this PR.

    How to fix it

    Re-run oapi-codegen against packages/envd/spec/envd.yaml targeting tests/integration/internal/envd/generated.go. This will remove the CACertificate struct and replace CaCertificates *[]CACertificate with CaBundle *string (json:"caBundle,omitempty"), matching the actual API.

    Step-by-step proof

    1. PR adds caBundle: {type: string} to packages/envd/spec/envd.yaml.
    2. packages/envd/internal/api/api.gen.go (regenerated correctly): CaBundle *string \json:"caBundle,omitempty"``
    3. envd SetData() handler reads: if data.CaBundle != nil && *data.CaBundle != ""
    4. tests/integration/internal/envd/generated.go (NOT regenerated): CaCertificates *[]CACertificate \json:"caCertificates,omitempty"``
    5. An integration test sets CaCertificates -> JSON on wire: {"caCertificates":[{"cert":"...","name":"..."}]}
    6. envd deserialises the body: CaBundle is nil (field absent); caCertificates is unknown and silently discarded by Go's JSON unmarshaller.
    7. data.CaBundle != nil is false -> cert installation skipped -> test passes but cert was never installed.

sitole added 2 commits April 8, 2026 13:59
c.lastCACert was read outside mu before being compared against the
incoming cert, while the write happens under mu. This is a data race
per Go's memory model; the race detector flags it on concurrent calls.
Move the comparison inside the lock.

When two rotations happen in quick succession (install A then install B),
two background goroutines are queued. If the goroutine for A runs after
the goroutine for B, it reads the state file — which B has already
updated to reflect the new cert — and interprets B's cert as the
previous one to remove, corrupting the bundle. Fix by bailing out of
the goroutine early when lastCACert no longer matches the cert it was
spawned for, meaning a newer install has taken over.

Also regenerate tests/integration/internal/envd/generated.go to align
the integration client with the current envd spec (caBundle string
instead of the stale caCertificates array).
The CA bundle (/etc/ssl/certs/ca-certificates.crt) lives on the
NBD-backed root filesystem. Measured syscall breakdown on the append:

  open:  1.83 ms warm (overlayfs copy-up + NBD round-trip)
         394 ms cold (blocks not yet in orchestrator's local GCS cache)
  write: 0.52 ms (one NBD round-trip regardless of data size)

At startup, BindMountCABundle() copies the bundle to /run/e2b/ (tmpfs)
and bind-mounts it back over the original path. All processes continue
using /etc/ssl/certs/ca-certificates.crt unchanged; the underlying
file is now in RAM. Measured result after the change: 0.037 ms total
(both cold and warm, no GCS or NBD involved).

Also make removeCertFromBundle write atomically via a temp file and
rename so the bundle is never empty from a concurrent reader's
perspective during the rewrite.
os.CreateTemp in the same directory as the bind-mounted bundle path
creates the temp file on the NBD-backed parent filesystem, while the
rename target lives on the tmpfs backing the bind mount. os.Rename
across different devices fails with EXDEV.

Pass filepath.Dir(statePath) as the temp directory instead. The state
file is in E2BRunDir (the same tmpfs as the bind mount source), so the
temp file and the rename target are always on the same device.

Also fix the permissions: os.CreateTemp creates with 0600. After the
atomic rename replaces the bundle, non-root processes could no longer
read it. Call Chmod(0o644) on the temp file before writing.
On restart the bind mount already exists, so CaBundlePath and
caBundleTmpfsPath are the same inode. The previous code opened
CaBundlePath for reading, then opened caBundleTmpfsPath with O_TRUNC —
truncating the same file — then tried to copy from the now-empty source.
The EBUSY guard came after the damage, leaving an empty CA bundle and
breaking all TLS connections in the sandbox after a restart.

Replace os.Open + io.Copy with os.ReadFile to load the full content into
memory before any write. os.WriteFile then truncates and rewrites the
same file, but the content is already safe in the buffer.
@sitole sitole marked this pull request as ready for review April 8, 2026 13:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1dd8a54d61

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6f27f7c. Configure here.

return fmt.Errorf("close temp file: %w", err)
}

if err := os.Rename(tmpPath, bundlePath); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename across mount points always fails with EXDEV

High Severity

The os.Rename in removeCertFromBundle fails with EXDEV because the temporary file is created on a tmpfs mount (/run/e2b), while the target file's parent directory (/etc/ssl/certs) is on a different filesystem. This prevents atomic replacement of the CA bundle, causing old certificates to accumulate.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6f27f7c. Configure here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is valid. I was able to reproduce it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not caught with current test suite because tests operate on the same partition.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But integration tests for cert injection and cert rotation would definitely make sense.

@sitole sitole requested a review from arkamar April 9, 2026 09:01
//
// Must be called once at startup, before any /init handler runs. No-op if the
// bind mount is already in place (safe to call after a process restart).
func BindMountCABundle() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why is the bind-mount used here, but it could be problematic in user scenarios, where they will try to manage certs with common tools, e.g. update-ca-certificates. Those tools use mv inside, which will fail with following error:

Updating certificates in /etc/ssl/certs...
mv: cannot move '/etc/ssl/certs/ca-certificates.crt.new' to 'ca-certificates.crt': Device or resource busy

Are you sure we want to apply this potentially breaking change?

It also affects apt-get install --reinstall ca-certificates which internally uses update-ca-certificates.

Updating certificates in /etc/ssl/certs...
mv: cannot move '/etc/ssl/certs/ca-certificates.crt.new' to 'ca-certificates.crt': Device or resource busy
dpkg: error processing package ca-certificates (--configure):
 installed ca-certificates package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 ca-certificates
E: Sub-process /usr/bin/dpkg returned an error code (1)

I am bit worried here.

Copy link
Copy Markdown
Contributor

@arkamar arkamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a brief discussion I believe we should not bind-mount /etc/ssl/certs/ca-certificates.cert file only but copy of whole /etc/ssl/certs directory instead. This should make original workflow safe and fast, plus common tools should work as well.

Cert injection and rotation should be covered with integration tests, they would catch problem with rename() on different partitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants