feat(ingestion): verify events captured with the secret API key#62888
feat(ingestion): verify events captured with the secret API key#62888Gilbert09 wants to merge 10 commits into
Conversation
Capture enforces quota limits by the raw ingestion token, so a team sending events with its phs_ secret API token would bypass billing limits keyed only on the public api_token. Write every ingestion token (api_token, secret_api_token, secret_api_token_backup) to the quota-limit and quota-limiting-suspended zsets; change detection stays keyed on the always-present public token.
Events captured with a team's secret API token (phs_, primary or backup) now resolve to the team during ingestion instead of being dropped as invalid_token, and get a server-set $verified: true property. $verified is server-controlled: any client-supplied value on public-token events is stripped, so it cannot be forged. Applied in the shared resolve-team step so both the analytics and error-tracking pipelines (and the heatmap subpipeline) are covered. Also registers $verified in the taxonomy so it renders in the UI.
… token An SDK configured with the phs_ secret API token as its api_key (for verified event capture) would otherwise lose feature flags: the team metadata HyperCache and PG lookup are keyed by the public token only, so phs_ tokens landed in the negative cache and 401'd. Branch on the phs_ prefix and reuse the cached secret-token validation already used by /flag_definitions, resolving the team via its public token (or by id for legacy cache entries).
👀 Auto-assigned reviewersThese soft owners were skipped because they only have minor changes here. Nothing blocks merge, so self-assign if you'd like a look:
Soft owners come from |
|
Size Change: 0 B Total Size: 72.9 MB ℹ️ View Unchanged
|
|
Reviews (1): Last reviewed commit: "feat(flags): authenticate /flags request..." | Re-trigger Greptile |
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 1 · PR risk: 0/10 |
/flags now accepts a phs_ secret token as api_key, so the decoded request body can carry one. The body logger recorded the decoded body verbatim for opted-in teams, which would let a log reader recover the secret. Redact phs_-prefixed tokens before logging; public phc_ tokens are world-readable by design and left intact. Generated-By: PostHog Code Task-Id: c79724ba-e6b2-4100-ab76-9e05351b46bf
Generated-By: PostHog Code Task-Id: c79724ba-e6b2-4100-ab76-9e05351b46bf
Generated-By: PostHog Code Task-Id: c79724ba-e6b2-4100-ab76-9e05351b46bf
…date The MaxMind GeoLite2 DB now returns postal code 44192 (was 44199) for the test IP, breaking the inline snapshots in the geoip transformation tests. Refresh the expected values; same maintenance pattern as the previous GeoIP snapshot update. Unrelated to the secret-token feature, but the tests run in this PR's Node.js shard. Generated-By: PostHog Code Task-Id: c79724ba-e6b2-4100-ab76-9e05351b46bf
Generated-By: PostHog Code Task-Id: c79724ba-e6b2-4100-ab76-9e05351b46bf
The scheduled quota refresh already writes every ingestion token (public + secret) to Redis, but the manual limit/unlimit and token-change paths did not, so they could drift from the quota contract this PR introduces. - Route `Organization.limit_product_until_end_of_billing_cycle` / `unlimit_product` through `get_team_ingestion_tokens` so secret tokens are limited/unlimited alongside the public token. - Add `sync_team_quota_limited_tokens`, a shared helper that re-points a team's active quota limits onto its current ingestion tokens, and call it from the public token reset and the secret token rotation/backup-deletion paths. This closes the gap where a newly rotated token could ingest freely until the next scheduled refresh. Generated-By: PostHog Code Task-Id: a7bc377f-4621-42b8-846e-fdbec703e2b7
Generated-By: PostHog Code Task-Id: a7bc377f-4621-42b8-846e-fdbec703e2b7
|
Sorry, just noticed this. Will take a look from the feature flags side. |
haacked
left a comment
There was a problem hiding this comment.
Looks good from the feature flags side. Just some non-blocking suggestions.
| if (row.secret_api_token) { | ||
| resultRecord[row.secret_api_token] = team | ||
| } | ||
| if (row.secret_api_token_backup) { |
There was a problem hiding this comment.
question: This resolves teams by the backup secret token for ingestion, and Rust /flags resolves the backup too, but Django's get_team_from_cache_or_secret_api_token still matches only secret_api_token, not the backup. After rotate_secret_token_and_save promotes the old primary into the backup slot, that token keeps authenticating ingestion and /flags but stops working anywhere the Django resolver feeds (local flag eval, conversations API).
Is the backup deliberately unsupported on the Django path, the way session replay deliberately rejects phs_? If so, a one-line comment on that method would stop the next person filing it as a bug. If not, it's a follow-up.
| if (sentWithSecretToken) { | ||
| event.properties = { ...(event.properties ?? {}), $verified: true } | ||
| verifiedPropertyCounter.labels({ action: 'verified' }).inc() | ||
| } else if (event.properties && '$verified' in event.properties) { |
There was a problem hiding this comment.
suggestion: This strips a client-supplied $verified from public-token events, including teams that never generated a secret token, so the description's "byte-identical behaviour" for NULL-secret teams isn't quite accurate. $-prefixed keys are only a naming convention, not reserved (the SDKs pass them through), so a client could have sent $verified before this PR.
The real-world impact looks nil though: a sampled scan of prod (~13B events over 3 days) plus an unsampled 30-minute window both found zero events carrying $verified. So not a migration concern, just naming hygiene: worth a short changelog / docs note that $verified is now reserved and server-controlled so a future sender doesn't expect it to pass through.
| // body can carry one. Redact it before logging — a body-log reader must | ||
| // never be able to recover a secret token. | ||
| let decoded_body = String::from_utf8_lossy(truncated); | ||
| let request_body = redact_secret_tokens(&decoded_body); |
There was a problem hiding this comment.
suggestion: redact_secret_tokens has unit tests, but nothing sends a phs_ body through log_response and asserts the emitted line doesn't contain the raw token. This call is the only thing keeping a secret token out of the body log (the leak @veria-ai flagged); if it were ever dropped, every existing test still passes. log_response_emits_event_for_opted_in_team just above already feeds a phc_abc body and asserts it appears, so add the negative twin: a phs_ body, asserting the captured line is redacted and the raw token is absent.
| it('a secret-token lookup also warms the cache for the public token, backup token and id', async () => { | ||
| const newTeamId = await createTeam(postgres, organizationId, 'phc_warmcache', { | ||
| secret_api_token: 'phs_warmcache', | ||
| secret_api_token_backup: 'phs_warmcache_backup', |
There was a problem hiding this comment.
suggestion: The tests cover resolving by both secret columns and the cache warm-through, but none asserts that after a rotation the old phs_ token stops resolving. A stale-cache bug there would let a rotated-out secret keep authenticating until the TTL expires. Mirror the warm-cache test right here: rotate secret_api_token, re-fetch, and assert the old token returns null while the new one resolves.
| write_pipe = redis_client.pipeline() | ||
| has_writes = False | ||
| for index, zset_key in enumerate(zset_keys): | ||
| token_scores = raw_scores[index * len(known_tokens) : (index + 1) * len(known_tokens)] |
There was a problem hiding this comment.
suggestion: This pairs the pipeline results to zsets by index arithmetic: the build loop queues one zscore per (zset, token), and the read re-chunks the flat result with raw_scores[index * len(known_tokens) : (index + 1) * len(known_tokens)]. If anyone later queues another command on score_pipe, reorders the loops, or appends a token mid-list, the slice window shifts and existing_score silently reads another zset's results, with no error raised.
Computing the slices once next to the build loop keeps the two in lockstep:
raw_scores = score_pipe.execute()
scores_by_zset = {
zset_key: raw_scores[i * len(known_tokens) : (i + 1) * len(known_tokens)]
for i, zset_key in enumerate(zset_keys)
}
🔒 Security review notes (via Claude)I reviewed this PR against its core invariant — 1.
|
|
👋 hey folks
Looks like this isn't an immediate blocker to this PR (no capture change yet) but FYI: Just to confirm - at the moment no If we opt to move quota checks entirely downstream, we will ship those events (and future ones...sometimes this can really spike) into the topic just to drop them downstream. Not a hard blocker, but something to consider |
Problem
Server-side SDKs authenticate capture with the public project API key (
phc_), which is world-readable by design — so nothing distinguishes an event genuinely sent by trusted server infrastructure from one forged by anyone holding the public key. For workflows that act automatically on ingested events (e.g. error-tracking automations), we need a way to know an event came from a holder of a server-side secret.An earlier approach signed individual
$exceptionevents with per-team Ed25519 keys (#62750, #62751, posthog-python#657). This PR supersedes that with a much simpler transport-level mechanism: send events with the project's secret API token (phs_, already used for flags local evaluation and the conversations API), and every event ingested that way is marked verified.Changes
TeamManagernow also resolves teams bysecret_api_tokenandsecret_api_token_backup(so token rotation keeps working), and the shared resolve-team step sets a server-controlled$verified: trueproperty on events captured with a secret token. Any client-supplied$verifiedon public-token events is stripped, so the property cannot be forged. Covers the analytics, error-tracking, and heatmap pipelines; a counter metric tracks verified/stripped events.phs_capture would bypass billing limits./flagsrequests now accept aphs_token as theapi_key(reusing the existing cached secret-token validation used by/flag_definitions), so an SDK configured with the secret key keeps evaluating flags. Unknownphs_tokens 401 without polluting the public-token negative cache.$verifiedregistered as a boolean event property.No SDK changes are required — server SDKs send
api_keyverbatim, so settingapi_key="phs_..."works as-is. Capture already acceptsphs_tokens at the edge (format-only validation) and the token is not persisted to ClickHouse.Everything is inert until a team generates a secret token (existing rotate endpoint) and points an SDK at it; teams with
NULLsecret tokens see byte-identical behaviour. No migrations — unique indexes on both secret-token columns already exist.Notes for reviewers:
phc_andphs_as separate keys — accepted.phs_token too to fully block a team.phs_tokens (its team service is public-token-only; server SDKs don't do replay).How did you test this code?
nodejs: new parameterized unit tests forapplyVerifiedPropertyand the resolve-team step (secret/backup/forged/no-properties/null-secret cases), team-manager integration tests for secret-token resolution and cache warming, and two end-to-end ingestion tests (phs_capture lands in ClickHouse with$verified: true; forged$verifiedviaphc_is stripped). Fullevent-preprocessingand error-tracking pipeline suites pass.ee/billing: parameterized pytest coverage over no-secret / primary-only / primary+backup token sets forupdate_org_billing_quotas,update_all_orgs_billing_quotas, and the token helpers; fulltest_quota_limiting.pysuite passes (74 tests).rust/feature-flags: integration tests for/flagsauthenticated with the primary and backup secret token (200 + flags evaluated) and an unknownphs_token (401 with Django-compatible error body).Automatic notifications
Docs update