Propagate CA bundle to sandbox trust store on init#2325
Propagate CA bundle to sandbox trust store on init#2325
Conversation
EgressProxy gains a CACertificates() method so the orchestrator can retrieve the per-proxy CA bundle and forward it to the sandbox on startup and resume. NoopEgressProxy and tcpfirewall.Proxy return nil. Factory now takes an EgressProxy and stores it; CreateSandbox and ResumeSandbox copy CACertificates() into the Sandbox struct. The init body sent to envd gains a CaCert field (combined PEM string built by combineCACertificates) so envd can install the cert on each /init call.
Each sandbox may receive a CA certificate via the /init caBundle field. CACertInstaller (packages/envd/internal/host) appends the certificate to /etc/ssl/certs/ca-certificates.crt on the critical path (~2 ms), then removes the previous cert in a background goroutine to keep the bundle clean across rotations. Two layers of state track what is currently installed: - lastCACert (memory): makes resume a zero-I/O hit when the same cert arrives again after pause/resume. - /var/run/e2b/ca-cert.pem (disk): survives OOM kills and restarts so the background goroutine can identify and remove the previous cert even when lastCACert is empty after a process restart.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 6f27f7c. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3ebbe78bc1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
tests/integration/internal/envd/generated.go:190-196— The integration test generated client sends the wrong JSON field for CA certificates: it definesCaCertificates *[]CACertificate(json:"caCertificates") — an array of {cert, name} objects — while the actual envd API addedCaBundle *string(json:"caBundle"). Any integration test that populatesCaCertificateswill have those certs silently ignored by envd, which readscaBundleand finds it absent. The fix is to regeneratetests/integration/internal/envd/generated.gofrom the currentpackages/envd/spec/envd.yaml.Extended reasoning...
What the bug is and how it manifests
This PR adds CA certificate propagation to the envd
/initendpoint. The authoritative API schema (packages/envd/spec/envd.yaml) and the generated server code (packages/envd/internal/api/api.gen.go) correctly define a single string field:CaBundle *stringwith JSON tag"caBundle,omitempty". The orchestrator's own generated client (packages/orchestrator/pkg/sandbox/envd/envd.gen.go) also correctly reflectsCaBundle stringwith JSON tag"caBundle,omitempty".However, the integration test generated client (
tests/integration/internal/envd/generated.go) defines a completely different schema: a new structCACertificatewithCert stringandName stringfields, and attaches them asCaCertificates *[]CACertificate(JSON tag"caCertificates,omitempty") toPostInitJSONBody. This is a stale artifact from an earlier design iteration where certificates were individual named objects rather than a single concatenated PEM bundle string.The specific code path that triggers it
An integration test that wants to install a CA cert would set
body.CaCertificates = &[]CACertificate{{Cert: pemData, Name: "my-ca"}}and callPostInit. The JSON marshalled over the wire would contain"caCertificates":[...]— a field envd never reads. The field envd actually reads ("caBundle") would be absent/null, so envd'sif data.CaBundle != nil && *data.CaBundle != ""guard inSetData()would skip cert installation entirely.Why existing code does not prevent it
No current integration tests exercise the CA cert path (confirmed by grepping test files — neither
CaCertificatesnorCaBundleis set in any test body). The Go type system cannot catch it because both structs compile independently; the discrepancy only surfaces at HTTP serialisation time when the wrong field name hits the wire.Impact
Any future integration test for the CA cert feature written against this client would silently pass (envd returns 204 regardless) while the cert is never actually installed. This makes the CA cert feature effectively untestable via the integration client, directly undermining the purpose of this PR.
How to fix it
Re-run
oapi-codegenagainstpackages/envd/spec/envd.yamltargetingtests/integration/internal/envd/generated.go. This will remove theCACertificatestruct and replaceCaCertificates *[]CACertificatewithCaBundle *string(json:"caBundle,omitempty"), matching the actual API.Step-by-step proof
- PR adds
caBundle: {type: string}topackages/envd/spec/envd.yaml. packages/envd/internal/api/api.gen.go(regenerated correctly):CaBundle *string \json:"caBundle,omitempty"``- envd
SetData()handler reads:if data.CaBundle != nil && *data.CaBundle != "" tests/integration/internal/envd/generated.go(NOT regenerated):CaCertificates *[]CACertificate \json:"caCertificates,omitempty"``- An integration test sets
CaCertificates-> JSON on wire:{"caCertificates":[{"cert":"...","name":"..."}]} - envd deserialises the body:
CaBundleis nil (field absent);caCertificatesis unknown and silently discarded by Go's JSON unmarshaller. data.CaBundle != nilis false -> cert installation skipped -> test passes but cert was never installed.
- PR adds
c.lastCACert was read outside mu before being compared against the incoming cert, while the write happens under mu. This is a data race per Go's memory model; the race detector flags it on concurrent calls. Move the comparison inside the lock. When two rotations happen in quick succession (install A then install B), two background goroutines are queued. If the goroutine for A runs after the goroutine for B, it reads the state file — which B has already updated to reflect the new cert — and interprets B's cert as the previous one to remove, corrupting the bundle. Fix by bailing out of the goroutine early when lastCACert no longer matches the cert it was spawned for, meaning a newer install has taken over. Also regenerate tests/integration/internal/envd/generated.go to align the integration client with the current envd spec (caBundle string instead of the stale caCertificates array).
The CA bundle (/etc/ssl/certs/ca-certificates.crt) lives on the
NBD-backed root filesystem. Measured syscall breakdown on the append:
open: 1.83 ms warm (overlayfs copy-up + NBD round-trip)
394 ms cold (blocks not yet in orchestrator's local GCS cache)
write: 0.52 ms (one NBD round-trip regardless of data size)
At startup, BindMountCABundle() copies the bundle to /run/e2b/ (tmpfs)
and bind-mounts it back over the original path. All processes continue
using /etc/ssl/certs/ca-certificates.crt unchanged; the underlying
file is now in RAM. Measured result after the change: 0.037 ms total
(both cold and warm, no GCS or NBD involved).
Also make removeCertFromBundle write atomically via a temp file and
rename so the bundle is never empty from a concurrent reader's
perspective during the rewrite.
os.CreateTemp in the same directory as the bind-mounted bundle path creates the temp file on the NBD-backed parent filesystem, while the rename target lives on the tmpfs backing the bind mount. os.Rename across different devices fails with EXDEV. Pass filepath.Dir(statePath) as the temp directory instead. The state file is in E2BRunDir (the same tmpfs as the bind mount source), so the temp file and the rename target are always on the same device. Also fix the permissions: os.CreateTemp creates with 0600. After the atomic rename replaces the bundle, non-root processes could no longer read it. Call Chmod(0o644) on the temp file before writing.
On restart the bind mount already exists, so CaBundlePath and caBundleTmpfsPath are the same inode. The previous code opened CaBundlePath for reading, then opened caBundleTmpfsPath with O_TRUNC — truncating the same file — then tried to copy from the now-empty source. The EBUSY guard came after the damage, leaving an empty CA bundle and breaking all TLS connections in the sandbox after a restart. Replace os.Open + io.Copy with os.ReadFile to load the full content into memory before any write. os.WriteFile then truncates and rewrites the same file, but the content is already safe in the buffer.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dd8a54d61
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6f27f7c. Configure here.
| return fmt.Errorf("close temp file: %w", err) | ||
| } | ||
|
|
||
| if err := os.Rename(tmpPath, bundlePath); err != nil { |
There was a problem hiding this comment.
Rename across mount points always fails with EXDEV
High Severity
The os.Rename in removeCertFromBundle fails with EXDEV because the temporary file is created on a tmpfs mount (/run/e2b), while the target file's parent directory (/etc/ssl/certs) is on a different filesystem. This prevents atomic replacement of the CA bundle, causing old certificates to accumulate.
Reviewed by Cursor Bugbot for commit 6f27f7c. Configure here.
There was a problem hiding this comment.
This one is valid. I was able to reproduce it.
There was a problem hiding this comment.
It is not caught with current test suite because tests operate on the same partition.
There was a problem hiding this comment.
But integration tests for cert injection and cert rotation would definitely make sense.
| // | ||
| // Must be called once at startup, before any /init handler runs. No-op if the | ||
| // bind mount is already in place (safe to call after a process restart). | ||
| func BindMountCABundle() error { |
There was a problem hiding this comment.
I understand why is the bind-mount used here, but it could be problematic in user scenarios, where they will try to manage certs with common tools, e.g. update-ca-certificates. Those tools use mv inside, which will fail with following error:
Updating certificates in /etc/ssl/certs...
mv: cannot move '/etc/ssl/certs/ca-certificates.crt.new' to 'ca-certificates.crt': Device or resource busy
Are you sure we want to apply this potentially breaking change?
It also affects apt-get install --reinstall ca-certificates which internally uses update-ca-certificates.
Updating certificates in /etc/ssl/certs...
mv: cannot move '/etc/ssl/certs/ca-certificates.crt.new' to 'ca-certificates.crt': Device or resource busy
dpkg: error processing package ca-certificates (--configure):
installed ca-certificates package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
ca-certificates
E: Sub-process /usr/bin/dpkg returned an error code (1)
I am bit worried here.
arkamar
left a comment
There was a problem hiding this comment.
After a brief discussion I believe we should not bind-mount /etc/ssl/certs/ca-certificates.cert file only but copy of whole /etc/ssl/certs directory instead. This should make original workflow safe and fast, plus common tools should work as well.
Cert injection and rotation should be covered with integration tests, they would catch problem with rename() on different partitions.


Problem
Sandboxes have no mechanism to receive a custom CA certificate and
install it into the system trust store at startup. This is required
for scenarios where outbound traffic needs to trust a certificate that
differs per sandbox.
Solution
The orchestrator sends a CA bundle (combined PEM string) to envd via
the existing
/initcall on sandbox create and resume. The orchestratorplumbing is a no-op in the open-source version; the real certificate
injection is handled in the EE version.
On the envd side,
CACertInstallerappends the received cert to/etc/ssl/certs/ca-certificates.crton the critical path, thenremoves the previous cert in a background goroutine. Resume with the
same cert is a no-op (in-memory comparison). A state file at
/var/run/e2b/ca-cert.pemensures the previous cert is correctlyremoved even after a process restart.
At startup,
BindMountCABundlecopies the system bundle to tmpfs andbind-mounts it back over the original path. All processes continue
using the standard path transparently; writes bypass the NBD-backed
filesystem entirely. CA cert injection is now essentially instant.
Testing
Unit tests in
packages/envd/internal/host/cacerts_test.gocoverfirst install, resume hot path, cert rotation, concurrent resume,
and restart scenarios (same cert and different cert after OOM).
Measured
append_durationwith the bind-mount in place: ~0.04 ms(down from ~1.8 ms warm NBD / ~460 ms cold GCS).