Skip to content

feat(multi-node): engine DaemonSet bundling + Caddy sticky LB + GPU/CPU pool placement#80

Open
aucahuasi wants to merge 13 commits intomainfrom
dev/distributed-streamgl-gpu
Open

feat(multi-node): engine DaemonSet bundling + Caddy sticky LB + GPU/CPU pool placement#80
aucahuasi wants to merge 13 commits intomainfrom
dev/distributed-streamgl-gpu

Conversation

@aucahuasi
Copy link
Copy Markdown
Contributor

@aucahuasi aucahuasi commented Apr 22, 2026

Summary

Enables Graphistry to run correctly across multiple GPU nodes from a single Helm release. All GPU-bound services (nginx, forge-etl-python, dask-cuda-worker, streamgl-gpu, streamgl-viz, streamgl-sessions) are consolidated into a single engine DaemonSet pod per GPU node, fronted by Caddy as a sticky-LB L7. The prior multi-node model (leader + follower helm releases per node) is retired in favour of a single-namespace deployment.

The architectural change

engine DaemonSet (one bundled pod per GPU node)

New templates/engine/engine-daemonset.yaml colocates nginx, fep, dask, and the streamgl-* services as sibling containers in one pod per GPU node. Tier-aware: 3 containers (nginx + fep + dask) at analytics, 6 containers (adds streamgl-{viz,sessions,gpu}) at viz/full. Intra-pod hostAliases pin the app-layer hostnames (streamgl-viz, streamgl-gpu, forge-etl-python, dask-cuda-worker) to 127.0.0.1, so every intra-stack HTTP hop is localhost. streamgl-gpu's PM2 localhost IPC continues to work unchanged because all gpu-router and PM2-forked gpu-worker children are intra-pod.

Companion templates:

  • templates/engine/engine-nginx-cfg.yml -- supplementary nginx conf.d ConfigMap that adds an intra-pod :8080 Host-header dispatcher, loaded alongside the production default.conf.template.
  • templates/engine/engine-service.yaml -- engine-headless Service for the Caddy upstream pool (gated on tier.analytics) plus four shim Services (streamgl-viz, streamgl-sessions, streamgl-gpu, forge-etl-python, gated on tier.viz) so the production nginx FQDN-suffixed hostnames keep resolving through CoreDNS for any path that does not go through hostAliases.

This bundle supersedes the earlier streamgl-gpu router/spawner split (added in 511876d, retired in 699bc4f). Bundling keeps PM2 IPC working unchanged, makes every intra-stack hop localhost (lower latency, better data locality), and has the same blast radius as a router-split design because browser sessions are cookie-pinned across engine pods anyway. Per-node redundancy comes from running multiple engine pods (one per GPU node) under the headless Service.

The engine DaemonSet's forge-etl-python container now sets GRAPHISTRY_DASK_LOCAL_AFFINITY=1 so each fep submission's persist/compute carries a soft worker-affinity hint pinning it to the dask-cuda-worker on the same pod (matched by HOSTNAME prefix). This is a cross-repo dependency on graphistry/graphistry#3097, which adds the dask_affinity helper that reads this env var and produces the kwargs. With multiple GPU nodes registered to the scheduler, the hint eliminates the cross-node shuffle path that would otherwise ship ~256 MiB partitions over the cluster overlay (~2x ETL speedup measured at 4/92/512 MiB). The hint is soft (allow_other_workers=True) and decays to a no-op on miss, so single-node deployments are byte-identical to pre-change.

Caddy as sticky-LB L7

Caddy now load-balances browser sessions across engine-headless using Caddy's dynamic a resolver (live pod-IP refresh, refresh 10s) and lb_policy cookie graphistry_sticky for session stickiness. The cookie is HMAC-signed on engine.cookieSecret; WebSocket upgrades to /graph/socket.io and /streamgl/* ride the same pin.

New operator-tunable knobs in values.yaml:

  • caddy.enabled -- toggle Caddy + caddy-ingress entirely. false lets operators front engine-headless with their own ingress controller (Pattern B); the operator owns TLS termination and cookie affinity in that case. The render gate on caddy-cfg.yml, caddy-deployment.yaml, and caddy-ingress.yml is now tier.analytics AND caddy.enabled.
  • caddy.tls.mode -- external | self | off. external trusts X-Forwarded-Proto: https from private_ranges so the sticky cookie keeps Secure+SameSite=None. self terminates from existingSecret or ACME. off is plain HTTP only.
  • caddy.lb.fallback -- first-time-assignment policy when no cookie is set yet (default round_robin).
  • caddy.service -- type / loadBalancerIP / nodePort / nodePortHttps / annotations / externalTrafficPolicy. Cloud, Tanzu, MetalLB, NodePort all selectable from values; per-platform annotation hints inline.
  • caddy.upstreamImage -- escape hatch to use caddy:2.10-alpine directly while the bundled wrapper image lags upstream PR reverseproxy: cookie should be Secure and SameSite=None when TLS caddyserver/caddy#6115 (the cookie-LB Secure+SameSite=None fix). Liveness probe switches to httpGet when set (no curl in the official image).

Caddy Pod template gains a checksum/config annotation that hashes the rendered Caddyfile into the Pod spec, so any change to TLS mode / lb.fallback / cookieSecret triggers an automatic rollout on helm upgrade. Without this, ConfigMap-only changes left Caddy with stale parsed config in memory until something forced a pod bounce.

Heterogeneous cluster placement: GPU/CPU pools + dedicated-tenant taints

Two new operator-tunable values let the chart land correctly on clusters that mix GPU and CPU node pools, or that use NodePool/managed taints to keep workloads off shared infra:

  • engine.nodeSelector (default {}) -- when non-empty, the engine DaemonSet uses this selector instead of global.nodeSelector (falls back to global when empty). Lets operators target the engine pod at GPU-labelled nodes (graphistry.io/role=gpu) while keeping the chart's CPU-side workloads (caddy, nexus, redis, dask-scheduler, notebook, pivot, gak-{private,public}) on a separate pool via global.nodeSelector (graphistry.io/role=cpu).
  • global.tolerations (default []) -- applied to every chart-rendered workload (11 templates: caddy, dask-scheduler, gak-private/public, http-tools netshoot+whoami, nexus, notebook, pivot, redis, engine DaemonSet). Lets operators opt the chart's pods into nodes carrying operator-defined taints: NVIDIA GPU Operator's nvidia.com/gpu=true:NoSchedule, GKE/EKS managed GPU node-pool taints (nvidia.com/gpu=present:NoSchedule), dedicated-tenant taints (dedicated=graphistry:NoSchedule), etc. Empty [] keeps current behaviour byte-identical. Tolerations are permissive (not directive), so adding a GPU-pool toleration to global.tolerations is harmless for non-GPU workloads -- they still land on whichever node global.nodeSelector admits them to.

Bug fix: dcgm-exporter and http-tools (netshoot, whoami) were the only chart workloads that did not honour global.nodeSelector; now they do. Pre-fix these would leak onto non-GPU nodes in mixed-pool clusters even when the operator constrained the rest of the chart to GPU nodes.

PVC tier-gating dropped (correctness fix)

gak-private, gak-public, and uploads-files PersistentVolumeClaims are no longer gated on tier.full / tier.analytics. With Retain reclaim policy, gating made tier downgrades leave PVs Released with stale claimRef; later upgrades created new PVCs with fresh UIDs that no longer matched, and the new PVCs sat Pending indefinitely. Consuming Deployments stay tier-gated; only the storage object is unconditional.

Validation

End-to-end on a 2-node k3s cluster (node1 = k3s server with NFS server colocated, node2 = k3s agent / NFS client) with NFS RWX storage, tier viz, two engine pods (one per node), Caddy as ClusterIP fronted by k3s Traefik on node1.

  • ETL 5-call sequence at analytics tier: /readarrow, /upload, /preshape, /properties, /download all returned 200 on both nodes. Local-worker affinity hint (added in graphistry_master) was exercised: each fep submission's persist/compute kwargs named the dask-cuda-worker on its own node, verified by HOSTNAME-prefix match against scheduler worker info. Behavior decayed to a no-op when a fep call landed on a node without a registered local worker, so single-node deployments are byte-identical to pre-change.
  • Multi-node session test at viz tier: browsers opened sessions against the public Caddy endpoint on node1. Each browser pinned to one engine pod via the graphistry_sticky cookie across page reloads and WebSocket reconnects; sessions on node1 were undisturbed by traffic landing on node2, and vice versa. Verified sticky distribution by tab count vs cookie value. Long-idle sessions (60s+ inactivity) survived without disconnect.
  • Tier transitions: upgraded viz -> analytics -> viz against a release with non-empty PVCs. gak/uploads PVCs survived the round trip and re-bound to their existing PVs (PV/PVC UID stable, claimRef intact). Pre-fix this would have left the new PVCs Pending.
  • Caddyfile config rollout: edited only caddy.lb.fallback in values and re-ran helm upgrade. Caddy Pod rolled automatically because the rendered Caddyfile checksum changed; ConfigMap-only edits no longer require manual pod bounce.
  • Concurrent upload test (carried over from earlier validation in this branch): 1.77 GB of concurrent uploads across both ingress replicas (now both inside engine pods) with zero errors; the only render glitch observed on a 7th concurrent tab was the HTTP/1.1 6-connections-per-origin browser limit on plain-HTTP localhost (documented upstream, dissolves under HTTPS/HTTP/2), not chart or backend.
  • Telemetry stack on multi-node k3s (3 GPUs across 2 nodes): Pre-fix, Grafana's DCGM dashboard showed metrics for only 1 GPU because prometheus scraped the Service VIP and kube-proxy load-balanced to a single DaemonSet pod per scrape. Post-fix (kubernetes_sd_configs role: pod + relabel rules), all 3 GPUs visible with stable per-node labels. node-exporter validated equivalently. The new prometheus-rbac.yaml ServiceAccount + Role + RoleBinding satisfies prometheus's apiserver pod-discovery; automountServiceAccountToken: true is explicit on the prometheus pod for clusters that flip the namespace default.
  • Platform-tier nexus-proxy end-to-end (tier=platform, postgres + nexus + nexus-proxy slice): All five v1-to-v2 deprecation shims (/etl, /api/check, /api/encrypt, /api/decrypt, /api/v1/etl/vgraph/*) returned 410 Gone with the documented upgrade message. Live /api/v1/* routes (/datasets/, /files/, /organization/, /team/, /named-endpoint/, /my/user/entitlements/) returned 200 with real data after a Django session login at /accounts/login/. v2 ETL routes routed correctly through nginx (returned upstream errors at platform tier as expected — the GPU backends streamgl-gpu and forge-etl-python intentionally don't render at this tier). Tier transition platform -> analytics removed nexus-proxy and brought up the engine DaemonSet's nginx container; analytics -> platform reversed it cleanly.
  • Caddy stability: Two crash modes fixed. Single-line handle @grafana { reverse_proxy ... } block syntax tripped Caddy v2's parser ("Unexpected next token after '{' on same line") — expanded to multi-line form. The /caddy/health/ endpoint was being intercepted by the catch-all handle and proxied to engine-headless (no endpoints) after respond wrote 200, because respond is non-terminal in Caddy v2; wrapping in a terminal handle /caddy/health/ { respond 200 ... } broke the resulting 30s SIGTERM-then-restart liveness loop. New templates/caddy/_helpers.tpl with graphistry.caddy.healthHandle and graphistry.caddy.telemetryHandles defines removes 3-4× duplication of identical handle blocks across the tls.mode = external | self | off branches.

Telemetry as a properly-structured subchart

telemetry is now a Helm subchart consumed via dependencies: with condition: global.ENABLE_OPEN_TELEMETRY. Helm fully prunes the subchart's templates when the flag is off (was previously per-template {{- if }} gates inside the parent).
Telemetry-specific values move to a top-level telemetry: block in the parent's values.yaml (subchart-canonical layout); only cross-cutting values (OTLP endpoint, instance name, image-pull config, default scheduling, storage class) stay under global.*.

The parent's shared _helpers.tpl (graphistry.tier.* helpers) is extracted into a new graphistry-common library subchart so the telemetry subchart can reuse the same tier gating via a dependency entry.

New operator-tunable knobs:

  • telemetry.dcgmExporter.useExternal + telemetry.dcgmExporter.externalEndpoint (default false / "") — when true the chart skips its own dcgm-exporter DaemonSet+Service and points prometheus + otel-collector at an externally-managed endpoint.
    Use cases: GKE with NVIDIA GPU Operator (the bundled exporter image fails on Container-Optimized OS), or any cluster already running the GPU Operator's DCGM module (avoids two DaemonSets scraping the same GPUs). Format host:port, no scheme.
  • telemetry.nodeExporter.useExternal + telemetry.nodeExporter.externalEndpoint — same pattern for clusters already running kube-prometheus-stack's node-exporter; avoids two DaemonSets per node.
  • telemetry.prometheus.enableAdminAPI (default false) — when true passes --web.enable-admin-api to prometheus, enabling tsdb/delete_series, tsdb/clean_tombstones, snapshot, and shutdown. Mirrors kube-prometheus-stack's prometheusSpec.enableAdminAPI default. Off in production; flip in dev/test to drop stale series after scrape-config refactors.
  • telemetry.prometheus.retention (default 15d) — local TSDB retention.
  • telemetry.{prometheus,jaeger,grafana}.persistence.{enabled,size,storageClassName} — per-component PVC config for the otel-collector backend stack. Storage class falls back to global.storageClassNameOverride then retain-sc. Without persistence,
    Grafana dashboards/sessions and Jaeger traces are lost on every pod restart.

Topology + correctness changes:

  • Prometheus is now a single-replica Deployment with Recreate strategy + RWO PVC (was per-node DaemonSet, no global view, no retention). Multi-replica HA is out of scope; users who need it should add Thanos or remote_write to a managed backend in their overrides.
  • prometheus scrape config: dcgm-exporter and node-exporter jobs switched from Service-VIP static_configs to kubernetes_sd_configs with role: pod + relabel rules that set the node label from pod_node_name and __address__ from pod_ip. Pre-fix prometheus scraped the Service VIP and kube-proxy load-balanced to one DaemonSet pod per scrape, silently dropping the other nodes' GPU/host metrics from dashboards. Post-fix every DaemonSet pod is a distinct scrape target with a stable per-node label.
  • New templates/prometheus-rbac.yaml: ServiceAccount + namespace-scoped Role + RoleBinding granting read-only pods / services / endpoints get / list / watch. Required by prometheus's kubernetes_sd_configs apiserver discovery. The prometheus pod sets automountServiceAccountToken: true explicitly — the K8s default is true, but locked-down clusters (some Tanzu/OpenShift profiles, hardened GKE namespaces) flip it to false at namespace level, which would silently break pod discovery.
  • otel-collector ConfigMap unified: previously otel-collector-cloud-configmap.yaml and otel-collector-configmap.yaml rendered separately based on OTEL_CLOUD_MODE; now one templated ConfigMap with cloud-mode/self-hosted branches inline. Cloud-mode credentials read from a pre-created Secret via secretKeyRef (never inlined into values.yaml).
  • checksum/config annotation on otel-collector / prometheus / grafana / jaeger pod templates — ConfigMap content changes now trigger a rollout on helm upgrade (was previously a no-op until manual pod delete).
  • Per-component telemetry Ingresses (grafana-ingress.yaml, jaeger-ingress.yaml, prometheus-ingress.yaml) removed; replaced by Caddy path routes (/grafana, /jaeger, /prometheus) on the parent chart's main Ingress.

Platform-tier nexus-proxy (nginx fronting nexus, v1-to-v2 endpoint rewrites)

New templates/nexus-proxy/nexus-proxy-deployment.yaml + nexus-proxy-service.yaml: platform-tier-only nginx Deployment + Service that fronts nexus and provides the v1-to-v2 endpoint rewrites + deprecated-endpoint 410 shims baked into the graphistry/nginx image. Renders only when global.tier == "platform" (exact match, not >=); at analytics+ the engine DaemonSet's nginx container plays the same role with intra-pod localhost dispatch to the streamgl/forge backends, so the standalone Deployment would be redundant.

Why it exists: the v1-to-v2 rewrites Graphistry clients depend on (deprecation 410s on /etl, /api/check, /api/encrypt, /api/decrypt, /api/v1/etl/vgraph/*; live forwarding for /api/v1/{datasets,files,organization,team,named-endpoint,...}; parallel /api/v2/etl/... routes) live in the graphistry/nginx image's default.conf.template rendered by render_templates.sh. Pre-fix, tier=platform rendered postgres + nexus only, so a nexus-only deployment had no externally-reachable HTTP surface honouring the v1 paths.

The platform tier now ships a deployable slice of postgres + nexus + nexus-proxy with no transitive dependencies on analytics-tier services. The Deployment's only init container waits for the nexus Service alone (not redis / dask-scheduler / streamgl), so the slice is genuinely self-contained — useful for downstream products that need only the auth foundation (e.g. as the Nexus-only auth backend behind another product's chart).

In-cluster reachability (the actual ask for service-to-service integrations like Louie):

  • http://nexus-proxy.<ns>.svc.cluster.local:80 — both v1 and v2 paths
  • http://nexus.<ns>.svc.cluster.local:8000 — raw nexus, v2 only

External reachability is deliberately not chart-managed at platform tier (no Ingress, no Caddy). Operators bring their own L7 (Pulumi, Ansible, kustomize, manual Ingress), or use port-forward for dev/test:

kubectl -n <ns> port-forward --address 0.0.0.0 svc/nexus-proxy 8080:80

Reuses existing values and PVCs: NginxResources, global.{nodeSelector,tolerations,imagePullSecrets,restartPolicy}, postgres secret refs, and the local-media-mount + data-mount PVCs. The image's content-directory expectations for /streamgl, /pivot, and /upload paths are satisfied by emptyDir mounts — those routes return 404 at platform tier, which is the correct behaviour when the backends don't exist.

Complementary changes (details in CHANGELOG.md)

  • Storage-agnostic chart: PVC templates unified on global.storage.accessMode + global.storageClassNameOverride. Removed ENABLE_CLUSTER_MODE, IS_FOLLOWER, multiNode, clusterVolume, provisioner, REDIS_URL_NEXUS_FEP, longhornDashboard, and the hardcoded datamount-longhorn / postgres-longhorn / retain-sc-cluster StorageClass names. Longhorn becomes just another backend operators can point a SC at.
  • Dedicated retain-sc-postgres StorageClass for the postgres-cluster chart, per Crunchy PGO's documented pattern.
  • Dask Kubernetes Operator removed: dask-cluster.yml CRD, dask.operator toggle, operator sections from all platform READMEs and Sphinx docs, ArgoCD app, CD subchart, chart bundler entries, ACR import script, dev-compose setup script.
  • OTEL / Redis Service hardening: otel-collector Service flipped to ClusterIP + internalTrafficPolicy: Local (DaemonSet collector model, per opentelemetry-operator#1401); Redis Service flipped to ClusterIP (was LoadBalancer, a latent security default).
  • charts/values-overrides/examples/cluster/ rewritten end-to-end (~+800 lines): legacy leader/follower multi-namespace example deleted. New single-namespace multi-node guide adds:
    • Architecture diagram of the engine-DaemonSet topology and Caddy sticky-LB ingress
    • cookieSecret rotation procedure (HMAC key for graphistry_sticky)
    • Two-axis scheduling model (nodeSelector for where-allowed vs tolerations for what-taints-accepted) with a worked 5-node A100 walkthrough
    • Cost-optimised variant (mixed GPU/CPU pools using engine.nodeSelector + global.tolerations)
    • Three approaches for pinning CPU singletons (caddy, nexus, redis, postgres) to a small CPU pool
    • End-to-end verification commands (pod placement, sticky-cookie distribution, ETL 5-call)
    • cluster/retain-sc-nfs.yaml for operators who prefer to apply the Retain StorageClass separately
  • k3s README updated: new "Postgres StorageClass" / "Graphistry StorageClass" subsections, two-SC install flow, same-namespace invariant note. PV-cleanup runbook reorganized into three explicit options (Graphistry-only / postgres-only / both) so operators don't accidentally wipe the wrong chart's data.
  • Chart bug fixes: dcgm-exporter DaemonSet and netshoot / whoami http-tools gained the missing nodeSelector blocks (they were the only workload templates that did not honour global.nodeSelector).

Chart version: graphistry-helm 0.4.3 -> 0.4.4.

Test plan

  • 2-node k3s + NFS RWX: helm install succeeds end-to-end
  • ETL 5-call sequence (analytics tier) succeeds on both nodes; local-worker affinity exercised and decays to no-op on miss
  • Multi-node viz sessions distributed across engine pods via graphistry_sticky cookie; reconnects stay pinned
  • Tier transitions (viz -> analytics -> viz) preserve PVC binding
  • Caddyfile config edits trigger automatic Pod rollout via checksum/config on helm upgrade
  • dcgm-exporter respects global.nodeSelector (prior bug: leaked onto non-GPU nodes)
  • helm uninstall + reinstall rebinds PVCs via volumeName workflow, preserves data
  • Concurrent multi-GB uploads across nodes, zero backend errors
  • Mixed GPU/CPU node-pool test: engine.nodeSelector=graphistry.io/role=gpu + CPU singletons pinned via global.nodeSelector=graphistry.io/role=cpu; verify engine pods land only on GPU nodes and caddy/nexus/redis/dask-scheduler land only on CPU nodes
  • global.tolerations validation against NVIDIA GPU Operator taint (nvidia.com/gpu=true:NoSchedule) on a managed cluster (GKE/EKS GPU node-pool)
  • Dedicated-tenant taint validation (dedicated=graphistry:NoSchedule) -- chart pods schedule onto tainted nodes only when the toleration is set
  • Telemetry stack on multi-node k3s: all GPUs and all nodes visible via per-node labels (post kubernetes_sd_configs switch)
  • Platform-tier slice (tier=platform): postgres + nexus + nexus-proxy renders standalone; v1-to-v2 deprecation 410s + authenticated /api/v1/* round-trips return real data; tier transitions to/from analytics are clean
  • Caddy stable on tls.mode = external | self | off paths; /caddy/health/ no longer trips the 30s SIGTERM liveness loop
  • Production-scale load
  • Longhorn RWX integration

Related

…emove Dask Operator

Replace dask-cuda-worker Deployment (with K8s-level replicas scaling) with a
DaemonSet matching the same pattern used by forge-etl-python and streamgl-gpu.
This aligns the Helm chart with the docker-compose GPU architecture where all
GPU services run one pod per node and scale workers internally via env vars
(DASK_NUM_WORKERS, DCW_CUDA_VISIBLE_DEVICES) rather than K8s replicas.

The previous Deployment model (dask.workers replicas) contradicted the app-level
multi-GPU configuration: multiple replicas on the same node would see the same
CUDA_VISIBLE_DEVICES and compete for GPU memory. The DaemonSet model ensures
one pod per GPU node with app-controlled worker processes and round-robin GPU
assignment, consistent with forge-etl-python and streamgl-gpu.

Remove the Dask Kubernetes Operator integration (dask.operator toggle,
DaskCluster CRD template, operator install docs, Argo CD app, ACR import,
chart-bundler gathering, dev-compose setup). The operator's pod-level scaling
model conflicts with Graphistry's app-level GPU management where services
control their own GPU assignment and worker counts via environment variables.

Remove forgeWorkers Helm value — FORGE_NUM_WORKERS is now controlled exclusively
via env vars (default: 4), matching how DASK_NUM_WORKERS and STREAMGL_NUM_WORKERS
already work. This gives operators a single, consistent interface for GPU worker
configuration across all services.

Templates changed:
- dask-cuda-worker-daemonset.yaml: rewritten as DaemonSet (was Deployment)
- dask-cluster.yml: deleted (DaskCluster operator mode)
- dask-scheduler-deployment.yaml: removed dask.operator guard
- forge-etl-python-daemonset.yaml: removed forgeWorkers Helm value reference

Values cleaned:
- Removed dask.workers, dask.operator, forgeWorkers from values.yaml
- Removed forgeWorkers from k3s example values

Docs/infra cleaned (23 files):
- Removed Dask Operator install/troubleshoot sections from all READMEs
  (k3s, gke, tanzu, cluster, troubleshooting.md)
- Removed dask-kubernetes-operator-docs.rst and index.rst reference
- Removed dask.workers and forgeWorkers from graphistry-helm-docs.rst
- Rewrote troubleshooting.md Dask architecture section to document
  DaemonSet model and app-level GPU/worker configuration
- Removed dask-operator-cd.yaml Argo CD app
- Removed dask operator from cd/repo/Chart.yaml, bundler.sh,
  helm-dev-setup-deploy.sh, ACR import script, docs Makefile

Tested: deployed on jorge7 GKE cluster, dask-cuda-worker DaemonSet pod
starts correctly, detects 2 GPUs (CUDA_VISIBLE_DEVICES=0,1), forge-etl-python
connects to dask-scheduler and initializes 4 workers with round-robin GPU
assignment (0,1,0,1).
…streamgl-gpu router split

Make the chart correctly deploy Graphistry across multiple GPU nodes
from a single Helm release. Three interlocking chart changes land the
core of the multi-node story; everything else in this commit is
complementary refactor cleanup.

1. streamgl-gpu split (router Deployment + spawner DaemonSet)
---------------------------------------------------------------
New `streamgl-gpu-deployment.yaml` deploys a cluster-wide router
(replicas=1) that holds the session-to-worker registry in memory and
proxies `/streamgl/*` WebSocket traffic to whichever worker owns a
session. New `streamgl-gpu-networkpolicy.yaml` restricts the router's
`/internal/*` endpoints (register, deregister, worker-event) to
in-cluster callers as defense-in-depth on top of the nginx-level
denial of `/streamgl/internal/`.

Workers stay per-node as a DaemonSet, but the in-pod spawner now
registers each worker with the remote router over HTTP rather than
relying on PM2's localhost event bus. That removes the PM2 IPC
limitation that blocked multi-node in every prior design. Workers on
bob are now discoverable and schedulable from jorge7 (and vice versa)
via `/internal/register`, and sessions fan out across all GPU nodes.

New Helm value `StreamglGpuWorkerResources` separates the DaemonSet
(GPU) resource block from `StreamglGpuResources` (router, no GPU),
since the two workloads have very different resource profiles. Worker
count and GPU visibility stay on env vars (`STREAMGL_NUM_WORKERS`,
`CUDA_VISIBLE_DEVICES`) consistent with the `DASK_NUM_WORKERS` and
`FORGE_NUM_WORKERS` pattern.

Corresponds to the app-layer PR graphistry/graphistry#3087.

2. nginx: Deployment -> DaemonSet
---------------------------------
nginx was a single-replica Deployment, which on a multi-node cluster
funnelled all external traffic through one node and every upload
through one `forge-etl-python` pod via the load-balanced Service.
That pinning broke the upload path on NFS because nginx on node A
wrote the request body to the RWX PVC and `forge-etl-python` on node
B read it as 0 bytes until node B's NFS client attribute cache
refreshed, well past `FORGE_MAX_FILE_WAIT_MS`. Uploads 500'd with
"Waited longer than 10000 ms for from_path ... to be populated".

Running nginx as a DaemonSet gives each GPU node its own ingress pod.
caddy still reverse-proxies to the `nginx` Service with the default
Cluster policy, so kube-proxy distributes external traffic across
every node's nginx replica. Each node's nginx then hands off to its
local `forge-etl-python` via the Local routing below, so writer and
reader share a single kubelet's NFS client. Load distribution is
automatic, matches the DaemonSet count, and requires no `replicas:`
tuning or HPA. The `rollingUpdate`/`maxSurge`/`Recreate` branch is
gone (DaemonSets only support `RollingUpdate`/`OnDelete`); rolling
updates now proceed one node at a time with `maxUnavailable: 1`.

3. forge-etl-python Service: internalTrafficPolicy: Local
---------------------------------------------------------
Pairs with the nginx DaemonSet above. Every Service call to
`forge-etl-python` routes to the DaemonSet pod on the caller's node,
so nginx's write to `uploads-files` PVC is read back through the same
kubelet's NFS client. The NFS cross-node coherence race goes away.

Trade-off: nginx on node A cannot fall back to `forge-etl-python` on
node B if node A's pod is unhealthy; DaemonSet per-node liveness
probes cover that failure mode. Single-node deployments are
unaffected (there is only one endpoint). This is the same pattern
already applied to `otel-collector` for the same data-locality reason
(upstream OpenTelemetry Operator guidance).

---

Validated end-to-end on a 2-node k3s cluster (jorge7 + bob) with NFS
RWX: 1.77 GB of concurrent uploads across both nginx replicas with
zero errors; 6 active WebSocket sessions distributed 3/3 across GPU
nodes; cross-tab node-move sync working across the streamgl-gpu
router; no retry amplification under sustained load.

Complementary changes in this commit (details in CHANGELOG.md):

- Storage-agnostic chart: PVC templates unified on
  `global.storage.accessMode` + `global.storageClassNameOverride`.
  Removed `ENABLE_CLUSTER_MODE`, `IS_FOLLOWER`, `multiNode`,
  `clusterVolume`, `provisioner`, `REDIS_URL_NEXUS_FEP`,
  `longhornDashboard`, and the hardcoded `datamount-longhorn` /
  `postgres-longhorn` / `retain-sc-cluster` StorageClass names.
  Longhorn becomes just another backend operators can point a SC at.
- Dedicated `retain-sc-postgres` StorageClass for the postgres-cluster
  chart, per Crunchy PGO's documented pattern.
- `LEADER_OTEL_EXPORTER_OTLP_ENDPOINT` removed; `otel-collector`
  Service flipped to `ClusterIP` + `internalTrafficPolicy: Local`;
  Redis Service flipped to `ClusterIP` (was LoadBalancer).
- `charts/values-overrides/examples/cluster/` replaced: legacy
  leader/follower multi-namespace example deleted, new single-
  namespace multi-node guide added with NFS as the documented default
  plus a `retain-sc-nfs.yaml` manifest for operators who prefer to
  apply the Retain StorageClass separately.
- k3s README: new "Postgres StorageClass" and "Graphistry
  StorageClass" subsections, two-SC install flow, same-namespace
  invariant note, Cleanup filter extended to cover both charts.
- Fixes: `dcgm-exporter` DaemonSet and `netshoot`/`whoami` http-tools
  gained the missing `nodeSelector` blocks (were the only templates
  that did not honour `global.nodeSelector`).
@aucahuasi aucahuasi self-assigned this Apr 22, 2026
@aucahuasi aucahuasi changed the title Dev/distributed streamgl gpu feat(multi-node): streamgl-gpu router split, DaemonSet nginx, and NFS-coherent forge-etl-python routing Apr 22, 2026
@aucahuasi aucahuasi changed the title feat(multi-node): streamgl-gpu router split, DaemonSet nginx, and NFS-coherent forge-etl-python routing feat(multi-node): streamgl-gpu router split, DaemonSet nginx, and distributed FS coherent forge-etl-python routing Apr 22, 2026
… docs

Follow-up to 511876d, picking up cleanups and operator-facing docs
surfaced during multi-node validation on the k3s cluster.

- templates/dask: remove vestigial dask-cuda-worker Service (dead since
  66f1c63 "DKO HOTFIX" 2023-02-14); Dask workers register by pod IP +
  ephemeral port with the scheduler's in-memory registry, never through
  a Service. Inline comment documents why Dask bypasses Services by
  design, and why the dashboard stays bound to localhost.

- templates/forge-etl: init container wait switched from
  "service dask-cuda-worker" to "pod -lio.kompose.service=..." since
  the Service is gone; pod-label readiness is strictly stronger than
  Service-existence as an init gate (a Service can render with zero
  ready backends).

- templates/streamgl NetworkPolicy: inline DEV-MODE GAP and CNI
  REQUIREMENT caveats. Policy is gated on global.devMode == false and
  is inert at runtime on non-enforcing CNIs (vanilla flannel, stock
  AWS VPC CNI without the add-on); nginx L7 deny is the only remaining
  defense on those clusters.

- examples/cluster/README: operator-facing NetworkPolicy CNI section
  listing enforcers (kube-router in k3s >=1.25, Calico, Cilium, Antrea,
  Weave, GKE with --enable-network-policy) vs non-enforcers, with a
  pointer to test_1_networkpolicy.md for a 5-minute verification.

- examples/troubleshooting: document HTTP/1.1 6-connections-per-origin
  browser limit that hangs the 7th viz tab on plain-HTTP localhost; fix
  is HTTP/2 multiplex via Caddy "tls internal" (dev), mkcert for a
  named host, or automatic ACME (prod). Diagnosed by Manfred on hub.

- docs/configure-storageclass: new "StorageClass defaults per backend"
  table (9 backends x default volumeBindingMode x default reclaimPolicy
  x override-needed column) so operators know up front which knobs
  must be set at SC creation rather than discovering it from a pending
  PVC.
…vices into single bundled pod per node

Replaces the per-service Deployments/DaemonSets (nginx, forge-etl-python,
dask-cuda-worker, streamgl-gpu, streamgl-viz, streamgl-sessions) with a
single `engine` DaemonSet pod per GPU node that colocates all of them as
sibling containers. Referred to internally as "fatpod" during development;
final name is `engine` for both single-node and multi-node deployments at
analytics+ tier.

Why this design over the previous per-service / streamgl-gpu router split

  - Latency: every intra-stack HTTP hop (viz <-> gpu, fep <-> dask,
    nginx <-> any) is now localhost. The streaming hot path no longer
    crosses the CNI overlay on viz frames or ETL submissions.
  - Data locality: dask-cuda-worker, fep, and streamgl-gpu run on the
    same physical host as their consumers; partitioned cuDF/dask_cudf
    data lives next to compute. Pairs with the local-worker affinity
    hint added to fep ETL submissions in graphistry_master.
  - Resilience: equivalent to previous designs. Browser sessions are
    pinned across engine pods via Caddy's HMAC-signed `graphistry_sticky`
    cookie; node loss has the same blast radius as before.
  - Simplicity: streamgl-gpu's PM2 localhost IPC continues to work
    unchanged because gpu-router and PM2-forked gpu-worker children
    are intra-pod. No router/spawner HTTP split, no app-layer changes.

New templates
-------------

  templates/engine/engine-daemonset.yaml   pod definition, tier-aware:
                                           3 containers at `analytics`
                                           (nginx + fep + dask), 6 at
                                           `viz`/`full` (adds
                                           streamgl-{viz,sessions,gpu})
  templates/engine/engine-nginx-cfg.yml    supplementary :8080 Host-
                                           header dispatcher inside the
                                           pod, loaded as conf.d
                                           alongside the production
                                           default.conf
  templates/engine/engine-service.yaml     engine-headless for Caddy
                                           upstreams (gated tier.analytics)
                                           plus 4 shim Services
                                           (streamgl-viz, streamgl-sessions,
                                           streamgl-gpu, forge-etl-python,
                                           gated tier.viz) so the
                                           production nginx FQDN-suffixed
                                           hostnames still resolve

Removed templates (subsumed by engine)
--------------------------------------

  templates/nginx/nginx-deployment.yaml
  templates/nginx/nginx-log-exporter-configmap.yaml  (sidecar dropped)
  templates/forge-etl/forge-etl-python-daemonset.yaml
  templates/dask/dask-cuda-worker-daemonset.yaml
  templates/streamgl/streamgl-gpu-daemonset.yaml
  templates/streamgl/streamgl-gpu-deployment.yaml      (router-split design)
  templates/streamgl/streamgl-gpu-networkpolicy.yaml   (router-split design)
  templates/streamgl/streamgl-sessions-deployment.yaml
  templates/streamgl/streamgl-viz-deployment.yaml

Caddy as the L7 layer
---------------------

Caddy now load-balances browser sessions across engine-headless with
cookie stickiness (`graphistry_sticky`, HMAC-signed on
`engine.cookieSecret`) using Caddy's `dynamic a` resolver against the
headless Service for live pod-IP refresh.

New operator knobs in values.yaml:

  caddy.enabled        toggle Caddy + caddy-ingress entirely. `false`
                       lets operators front engine-headless with their
                       own ingress controller (Pattern B). The render
                       gate on caddy-cfg.yml, caddy-deployment.yaml,
                       and caddy-ingress.yml is now
                       `tier.analytics AND caddy.enabled`.
  caddy.tls.mode       `external` | `self` | `off`. external trusts
                       XFP=https from private_ranges so the sticky
                       cookie keeps Secure+SameSite=None. self
                       terminates from existingSecret or ACME. off is
                       plain HTTP only.
  caddy.lb.fallback    first-time-assignment policy when no cookie is
                       set (default round_robin).
  caddy.service        type / loadBalancerIP / nodePort /
                       nodePortHttps / annotations /
                       externalTrafficPolicy. Cloud, Tanzu, MetalLB,
                       NodePort all selectable from values.
  caddy.upstreamImage  escape hatch to use upstream `caddy:2.10-alpine`
                       directly while the bundled wrapper image lags
                       upstream PR caddyserver/caddy#6115. Liveness
                       switches to httpGet when set (no curl in the
                       official image).

Caddy Pod template gains a `checksum/config` annotation that hashes the
rendered Caddyfile ConfigMap into the Pod spec, so any change to TLS
mode / lb.fallback / cookieSecret triggers an automatic rollout on
`helm upgrade`. Without this, ConfigMap-only changes left Caddy with
stale parsed config in memory until something forced a pod bounce.

PVC tier-gating dropped (correctness fix)
-----------------------------------------

`gak-private`, `gak-public`, and `uploads-files` PersistentVolumeClaims
are no longer gated on tier.full / tier.analytics. With Retain reclaim
policy, gating made tier downgrades leave PVs Released with stale
claimRef; later upgrades created new PVCs with fresh UIDs that no
longer matched, and the new PVCs sat Pending indefinitely. Consuming
Deployments stay tier-gated; only the storage object is unconditional.

Other cleanup
-------------

  - Dask Kubernetes Operator removed: dask-cluster.yml CRD,
    `dask.operator` toggle, operator sections from all platform READMEs
    and Sphinx docs, ArgoCD app, CD subchart, chart bundler entries,
    ACR import script, dev-compose setup script.
  - Cluster mode (ENABLE_CLUSTER_MODE / IS_FOLLOWER) wiring stripped
    from all Deployment/DaemonSet templates. Legacy
    cluster/{cluster-storage,follower,global-common,leader}.yaml
    replaced by a single-namespace multi-node guide and
    cluster/retain-sc-nfs.yaml.
  - Redis Service: LoadBalancer -> ClusterIP (latent security default).
  - otel-collector Service: LoadBalancer -> ClusterIP +
    internalTrafficPolicy: Local (matches DaemonSet collector model;
    opentelemetry-operator#1401).
  - dcgm-exporter, netshoot, whoami: added missing `nodeSelector`
    blocks honouring global.nodeSelector.

values.yaml additions
---------------------

  - `engine` block: cookieSecret, uploadsScratchSizeLimit.
  - `caddy` sub-blocks: enabled, tls{mode, existingSecret, acmeEmail,
    domains}, lb{fallback}, service{type, loadBalancerIP, nodePort,
    nodePortHttps, annotations, externalTrafficPolicy}, upstreamImage.
  - `global.storage.accessMode` (default ReadWriteOnce).
  - TOPOLOGY note covering Pattern A (Caddy as L7) vs Pattern B
    (operator's ingress controller as L7) with explicit operator
    responsibilities under each pattern.

k3s example values
------------------

  - Exercises the new knobs end-to-end: caddy.enabled: true,
    caddy.tls.mode: "off", caddy.lb.fallback: round_robin,
    caddy.service.type: ClusterIP, caddy.upstreamImage:
    caddy:2.10-alpine, engine.cookieSecret,
    engine.uploadsScratchSizeLimit. Tier set to `viz`.
  - PV-cleanup runbook in the README reorganized into three explicit
    options (Graphistry-only / postgres-only / both) to prevent
    operators accidentally wiping the wrong chart's data.

Verification
------------

Two-node test bed: node1 (k3s server, NFS server colocated) and
node2 (k3s agent, NFS client). RWX storage backed by NFS, tier
`viz`, two engine pods (one per node) load-balanced by Caddy
(ClusterIP) fronted by k3s Traefik on node1.

  ETL 5-call sequence -- analytics tier
    /readarrow, /upload, /preshape, /properties, /download all returned
    200 on both nodes. Local-worker affinity hint exercised: each fep
    submission's persist/compute kwargs named the dask-cuda-worker on
    its own node (verified by HOSTNAME-prefix match against scheduler
    worker info). Behavior decayed to no-op when a fep landed on a node
    without a registered local worker.

  Session test -- viz tier, multi-node
    Browsers opened sessions against the public Caddy endpoint on
    node1. Each browser pinned to one engine pod via the
    `graphistry_sticky` cookie across page reloads and WebSocket
    reconnects; sessions on node1 were undisturbed by traffic
    landing on node2, and vice versa. Verified sticky distribution by
    tab count vs cookie value. Long-idle sessions (60s+ inactivity)
    survived without disconnect.

  Tier transitions
    Upgraded `viz -> analytics -> viz` against a release with
    non-empty PVCs. gak/uploads PVCs survived the round trip and
    re-bound to their existing PVs (PV/PVC UID stable, claimRef
    intact). Pre-fix this would have left the new PVCs Pending.

  Caddyfile config rollout
    Edited only `caddy.lb.fallback` in values and re-ran
    `helm upgrade`. Caddy Pod rolled automatically because the
    rendered Caddyfile checksum changed; ConfigMap-only edits no
    longer require manual pod bounce.
@aucahuasi aucahuasi changed the title feat(multi-node): streamgl-gpu router split, DaemonSet nginx, and distributed FS coherent forge-etl-python routing feat(multi-node): engine DaemonSet bundling GPU services + Caddy sticky LB Apr 26, 2026
…erations) + dask local-affinity wiring + cluster README rewrite

Three coherent changes shipping together to support heterogeneous K8s
clusters with mixed GPU/CPU node pools and dedicated-tenant taints.

1. Multi-node placement infrastructure
--------------------------------------

* charts/graphistry-helm/values.yaml:
  - New `engine.nodeSelector` (default `{}`). When set, the engine DaemonSet
    uses this selector instead of `global.nodeSelector` (falls back to
    global when empty). Lets operators target the engine pod at GPU-labelled
    nodes while keeping the chart's CPU-side workloads (caddy, nexus, redis,
    dask-scheduler, notebook, pivot, gak-{private,public}) on a separate
    pool.
  - New `global.tolerations` (default `[]`). Applied to every chart-
    rendered workload. Lets operators opt the chart's pods into nodes
    carrying operator-defined taints: NVIDIA GPU Operator's
    `nvidia.com/gpu=true:NoSchedule`, GKE/EKS managed GPU node-pool taints,
    dedicated-tenant taints (`dedicated=foo:NoSchedule`), etc. Empty `[]`
    keeps current behaviour byte-identical.
  - Inline docstrings explain the recommended `graphistry.io/role=gpu/cpu`
    labelling convention, the GPU-Operator and managed-pool taint patterns,
    and why tolerations are permissive (not directive) so adding a GPU-pool
    toleration to global.tolerations is harmless for non-GPU workloads.

* templates/{caddy,dask,engine,graph-app-kit,http-tools,nexus,notebook,
  pivot,redis}/*-deployment.yaml + engine-daemonset.yaml:
  Eleven templates now emit a `tolerations:` block when
  `.Values.global.tolerations` is non-empty (standard `{{- with ... }}`
  guard pattern). engine-daemonset.yaml additionally rewires nodeSelector
  to `(engine.nodeSelector | default global.nodeSelector)` so engine can
  override placement independently.

2. Cluster deployment guide rewrite
-----------------------------------

* charts/values-overrides/examples/cluster/README.md (+943 / -130):
  Near-rewrite of the multi-node deployment guide. New sections:

  - "Cluster architecture: engine DaemonSet and Caddy" -- bird's-eye view
    of the bundled engine pod, Caddy sticky LB across pods, why both
    pieces are necessary together.
  - `engine.cookieSecret` production hardening + rotation behaviour
    (existing browser sessions are dropped on secret change).
  - "Node placement: GPU vs CPU pools, dedicated tenants" -- the two
    scheduling axes (nodeSelector for where-allowed, tolerations for
    what-taints-accepted), the chart's scheduling values, recommended
    labelling convention.
  - Worked walkthrough: 5-node A100 cluster with a dedicated LLM tenant
    on node 5 (label + taint + toleration), step-by-step from cluster-
    admin labelling through values file authoring through the LLM team
    deploying their own workload separately.
  - Cost-optimised variant (separate GPU/CPU pools).
  - Three approaches for pinning CPU singletons to one specific node
    (label-one-node / pin-by-hostname / dedicated-pin-label), with
    tradeoffs and gotchas, plus a list of approaches not recommended.
  - Defensive matrix: what each safety net catches (selector-only vs
    selector+taint vs taint-without-toleration).
  - Common per-distro pre-baked taints reference.
  - Verification commands matrix to confirm placement after install.

  The existing NFS / `global.storage.accessMode: ReadWriteMany` Step 4
  content is preserved.

3. Dask local-affinity wiring
-----------------------------

* templates/engine/engine-daemonset.yaml (forge-etl-python container env
  block): set `GRAPHISTRY_DASK_LOCAL_AFFINITY=1`. Wires the chart-side
  half of the dask local-worker affinity feature shipping in
  graphistry_master PR #3097. fep's `server/util/dask_affinity.py:
  persist_kwargs()` reads this env var to gate per-submission
  `scheduler_info()` round-trips and pin `client.persist`/`compute` calls
  to the co-located dask-cuda-worker on the same node (~2x ETL speedup
  empirically validated at 4 MiB / 92 MiB / 512 MiB upload sizes). Hint
  is soft -- `allow_other_workers=True` is always set fep-side, so the
  scheduler still falls back to a remote worker when the local one is
  busy or down. Hardcoded "1" matches the precedent of other engine-
  DaemonSet-required constants like `OTEL_SERVICE_NAME` and `PORT`. If
  the env var is missing, fep's helper short-circuits and the deployment
  runs byte-identically to pre-change.
@aucahuasi aucahuasi changed the title feat(multi-node): engine DaemonSet bundling GPU services + Caddy sticky LB feat(multi-node): engine DaemonSet bundling + Caddy sticky LB + GPU/CPU pool placement Apr 26, 2026
…loss on mixed pools

Bug
---
The graph-app-kit-{public,private} Deployments and the Jupyter notebook
Deployment hardcoded `nodeSelector: global.nodeSelector` with no override
hook. On a mixed GPU/CPU node-pool deployment -- the recommended pattern
where global.nodeSelector pins the platform-tier majority (caddy, nexus,
redis, dask-scheduler) to a CPU pool -- this routed the GPU-bound gak
Streamlit views and the notebook RAPIDS kernel to CPU nodes. Pods came up
green but lost CUDA capability:

  - graph-app-kit (both public + private) runs the same RAPIDS-on-CUDA
    runtime as forge-etl-python (cudf, pygraphistry); the Streamlit
    dashboards drive graph layout against the GPU stack. compose pins
    both `runtime: nvidia` for this reason.
  - notebook ships a `Python 3.8 (RAPIDS)` ipykernel (cudf/cuml/cugraph)
    that fails to load CUDA when the kernel container has no GPU
    access. compose pins it `runtime: nvidia`.

The hardcoded selector blocked any per-workload override, so operators
running mixed pools could not pin gak/notebook to GPU nodes without
forking the chart.

Fix
---
Three templates now use the override-with-fallback Helm idiom:

  templates/graph-app-kit/graph-app-kit-public-deployment.yaml
  templates/graph-app-kit/graph-app-kit-private-deployment.yaml
  templates/notebook/notebook-deployment.yaml

      nodeSelector: {{- (.Values.X.nodeSelector | default .Values.global.nodeSelector)
                       | toYaml | nindent 8 }}

New Helm values (default `{}`):

  - graphAppKit.nodeSelector  -- applies to both gak Deployments
  - notebook.nodeSelector     -- applies to the notebook Deployment

Empty default falls back to global.nodeSelector, so single-pool
deployments are byte-identical to pre-fix. Mixed-pool operators set
these explicitly to a GPU-labelled pool; the values.yaml docstrings
spell out the pattern with a `graphistry.io/role: gpu` example.

Pivot is intentionally not changed: it is a Node.js HTTP-only service
that embeds the streamgl-viz iframe in browser-side pages. No GPU
bindings server-side, so it stays correctly tied to global.nodeSelector.

Docs
----
  - cluster/README.md: corrected the GPU-vs-CPU classification (notebook
    and gak were previously listed as CPU-only, which was wrong -- both
    are GPU-bound). Split the topology diagram into GPU-bound vs
    CPU-only boxes, added a per-workload breakdown table, and added a
    defensive matrix row describing the silent-runtime-regression
    failure mode this fix addresses.
  - troubleshooting.md: same misclassification fix in the multi-node
    placement guidance.
  - k3s/k3s_example_values.yaml: illustrative usage of the two new
    nodeSelector overrides for the mixed-pool example.
  - CHANGELOG.md: bug-fix entries for both new values.
Telemetry now ships as a properly-structured Helm subchart consumed via
dependencies + condition: global.ENABLE_OPEN_TELEMETRY, with a top-level
`telemetry:` block in the parent's values.yaml for subchart-specific knobs
and `global.*` reserved for cross-cutting values (OTLP endpoint, instance,
image-pull, scheduling, storage class). The shared `_helpers.tpl` is
extracted into a new `graphistry-common` library subchart so the telemetry
subchart can reuse `graphistry.tier.*` gating via a dependency entry.

Multi-node Prometheus scraping is fixed. The dcgm-exporter and
node-exporter jobs switched from Service-VIP `static_configs` to
`kubernetes_sd_configs` with `role: pod` and relabel rules that set the
`node` label from `pod_node_name` and `__address__` from `pod_ip`. With
the old static-target config, kube-proxy load-balanced to one DaemonSet
pod per scrape and silently dropped the other nodes' GPU/host metrics.
Validated on a 2-node 3-GPU k3s cluster: pre-fix 1 GPU visible, post-fix
all 3 with stable per-`node` labels. New `templates/prometheus-rbac.yaml`
(ServiceAccount + namespace-scoped Role + RoleBinding for read-only
`pods`/`services`/`endpoints`) covers apiserver discovery; the prometheus
pod sets `automountServiceAccountToken: true` explicitly so locked-down
clusters (some Tanzu/OpenShift profiles, hardened GKE namespaces) that
flip the default to false don't silently break discovery.

Telemetry-stack persistence is normalized. Prometheus is now a
single-replica Deployment with `Recreate` strategy + RWO PVC (was
per-node DaemonSet, no global view, no retention) with
`telemetry.prometheus.retention=15d` and a new
`telemetry.prometheus.enableAdminAPI` knob (default false; mirrors
kube-prometheus-stack's default; flip in dev to drop stale series after
scrape-config refactors). Grafana and Jaeger gain
`persistence.{enabled,size,storageClassName}` blocks matching prometheus.
The otel-collector cloud and self-hosted ConfigMaps are merged into one
templated ConfigMap with cloud-mode credentials sourced via
`secretKeyRef`. New `useExternal` + `externalEndpoint` knobs on
`telemetry.dcgmExporter` and `telemetry.nodeExporter` let operators
defer to NVIDIA GPU Operator's exporter (mandatory on GKE
Container-Optimized OS, where the bundled image fails) or
kube-prometheus-stack's node-exporter, avoiding double DaemonSets.
`checksum/config` annotations on otel-collector / prometheus / grafana /
jaeger pod templates make `helm upgrade` actually roll workloads when
their ConfigMaps change.

Caddy stability: two crash modes fixed. The single-line
`handle @grafana { reverse_proxy ... }` blocks tripped Caddy v2's parser
("Unexpected next token after '{' on same line") -- expanded to
multi-line form. The `/caddy/health/` endpoint was being routed by the
catch-all `handle` to `engine-headless` (no endpoints) after `respond`
wrote 200, because `respond` is non-terminal in Caddy v2; wrapped in a
terminal `handle /caddy/health/ { respond 200 ... }` to break the 30s
SIGTERM-then-restart liveness loop. New
`templates/caddy/_helpers.tpl` defines (`graphistry.caddy.healthHandle`,
`graphistry.caddy.telemetryHandles`) remove 3-4x duplication of identical
handle blocks across the `tls.mode = external | self | off` branches in
`caddy-cfg.yml`.

Per-component telemetry Ingresses (grafana/jaeger/prometheus) are removed
in favour of Caddy path routes on the parent chart's main Ingress;
NOTES.txt updates "Ingress paths" wording to "Caddy paths". The gke
README's "Fix DCGM GPU Metrics on GKE" section is rewritten to use
`telemetry.dcgmExporter.useExternal: true` instead of an out-of-band
`kubectl patch`. The k3s example values exercise the new caddy/engine
knobs and enable telemetry self-hosted by default.
Adds a platform-tier-only nginx Deployment + Service (`nexus-proxy`) that
fronts nexus and provides the v1-to-v2 endpoint rewrites and deprecated-
endpoint 410 shims that live in the `graphistry/nginx` image. Renders only
when `global.tier == "platform"` (exact match, not `>=`); at analytics+
the engine DaemonSet's nginx container plays the same role with intra-pod
localhost dispatch to streamgl/forge backends, so the standalone
Deployment would be redundant at higher tiers.

The platform tier now ships a deployable slice of postgres + nexus +
nexus-proxy with no transitive dependencies on analytics-tier services.
The Deployment's only init container waits for the nexus Service alone
(not redis / dask-scheduler / streamgl) so the slice is genuinely
self-contained. Reuses existing values and PVCs: `NginxResources`,
`global.{nodeSelector,tolerations,imagePullSecrets,restartPolicy}`,
postgres secret refs, and the `local-media-mount` + `data-mount` PVCs.
The image's content-directory expectations for `/streamgl`, `/pivot`,
and `/upload` paths are satisfied by emptyDir mounts -- those routes
return 404 at platform tier, which is the correct behaviour when the
backends don't exist.

End-to-end validation at `tier=platform`: the deprecation shims (`/etl`,
`/api/check`, `/api/encrypt`, `/api/decrypt`, `/api/v1/etl/vgraph/*`)
return 410 Gone with the documented upgrade message; live `/api/v1/*`
routes (`/datasets/`, `/files/`, `/organization/`, `/team/`,
`/named-endpoint/`) return 200 with real data after a Django session
login at `/accounts/login/`; v2 ETL routes (`/api/v2/etl/vgraph/`,
`/api/v2/etl/datasets/<id>/{gfql,kepler}/...`) route through nginx as
expected (upstream errors at platform tier because GPU backends are
absent by design). Switching back to `tier=analytics` correctly removes
the standalone nexus-proxy in favour of the engine DaemonSet's nginx
container.

NOTES.txt is now tier-aware: at platform it lists `nexus-proxy` in
DEPLOYED SERVICES, prints both in-cluster URLs and a
`port-forward --address 0.0.0.0` recipe for browser access, and skips
the ACCESS and TELEMETRY blocks (Caddy and the telemetry stack don't
render at this tier). At analytics+ both blocks render as before. The
tier descriptions are also updated to reflect the engine-DaemonSet
collapse from 0.4.4 -- the single DaemonSet with tier-conditional
sibling containers replaces the per-service Deployments listed in the
old text.

Documentation updated to match: the k3s README's Deployment Tiers
section now documents `nexus-proxy` in the platform-tier row and
rewrites analytics+ rows around the engine DaemonSet shape; the
"Services per tier" table replaces standalone
`nginx`/`forge-etl-python`/`dask-cuda-worker`/`streamgl-*` rows with
`engine` DaemonSet container rows showing the tier-conditional sibling
set. troubleshooting.md Section 9 "Accessing Graphistry" gains a new
"Tier matters for the access path" subsection covering the platform-
tier path (no Caddy/Ingress; `nexus-proxy` is the entry point;
port-forward + curl smoke tests for v1/v2 routing); the existing
Caddy/Ingress flow is scoped as "Verification (analytics+ tier)".

The k3s example values file's tier comment records the platform-tier
validation run and the port-forward recipe. The active value is
`tier: "analytics"` for normal usage.
Caddy now renders at every tier (was tier >= analytics) so platform-tier
deployments share the same TLS / Ingress / caddy.service.{type,...}
machinery as analytics+. Tier promotions no longer churn external-facing
config: the Service name, Ingress, port-forward target, and cert-manager
wiring are byte-identical from platform through full. The catch-all
reverse_proxy upstream switches by tier via a new
graphistry.caddy.upstreamHandle helper -- nexus-proxy at platform tier
(single replica, no sticky-cookie ceremony; mirrors main-branch's
caddy -> nginx:80 production shape, just retargeted at nexus-proxy which
carries the v1-to-v2 endpoint shims), dynamic engine-headless with HMAC
sticky-cookie LB at analytics+. The 3 inline catch-all blocks across the
external/self/off tlsMode branches collapse to one include each.

Adds ingress.enabled (default true) for operators whose external L7 is
at the Service layer rather than the K8s Ingress layer (Brad/Dell on
Tanzu NSX-T pointing an external LB at caddy.service.type=NodePort,
BBAI's Pulumi-managed LB, service mesh, or dev port-forward). Default
true preserves today's behavior so existing values overlays render an
unchanged Ingress. The Bitnami-style ingress.{className, hosts, tls,
annotations} keys are intentionally not added; equivalents already exist
under global.ingressClassName, global.domain, caddy.tls.{existingSecret,
acmeEmail, domains}, caddy.service.annotations, and
ingress.management.annotations. values.yaml now carries an explicit
Bitnami-shape -> chart-shape mapping table so operators coming from
other charts can find the right keys.

Drive-by fix to a pre-existing bug: ingress.management.annotations was
documented and exemplified with service.beta.kubernetes.io/* cloud-LB
annotations, but those are Service-resource annotations -- they're inert
on an Ingress. Replaced the comment + commented-out examples with
annotations that actually take effect on an Ingress (nginx-ingress
internal class, ALB scheme, GKE gce-internal, cert-manager cluster
issuer, body-size). Cloud-LB annotations belong on
caddy.service.annotations, where they're already correctly documented.

Verified on jorge7 k3s at platform tier: all 5 v1 deprecation shim paths
return byte-identical 410s through Caddy vs. direct nexus-proxy, /
returns 200, /accounts/login/ returns 200, Caddy /caddy/health/ returns
{"success": true}, Ingress is picked up by Traefik. helm template lints
clean across 4 tiers x 3 tls.mode combinations.
`make html` now copies the 10 canonical chart READMEs into
docs/source/ with relative-link rewriting, runs frigate via a Python
wrapper that bypasses its broken --no-deps CLI (frigate archived
2024-12), and renders the result. 7 stale hand-written .rst pages
deleted; 3 frigate-generated .rst files gitignored as derived
artifacts. index.rst restructured into 5 captioned sections;
10mins-to-k8s.rst rewritten as platform-agnostic 8-step skeleton
with placeholders.

Other docs cleanup: TROUBLESHOOTING.md moved to repo root,
CLUSTER.md into the chart dir; new postgres-cluster, cd, aks
READMEs; top-level README repository-structure HTML -> nested
markdown list. Complete improvement in organization!
@aucahuasi aucahuasi requested a review from albarralnunez April 28, 2026 09:00
…nt guide

Telemetry storage fixes
- Prometheus, Jaeger, and Grafana pod templates now set securityContext.fsGroup
  per workload (65534 / 10001 / 472, matching each upstream image's hardcoded
  UID); kubelet chowns the PVC mount on first attach. Pre-fix the three pods
  crashlooped with permission-denied on Longhorn (and any other block CSI
  driver) because the data directory inherited root:root 0755 ownership and
  the non-root container processes could not write to it. Configurable per
  workload via telemetry.<workload>.securityContext.fsGroup; set to {} to
  disable on NFS deployments that prefer chmod 0777 + no_root_squash.

  workload via telemetry.<workload>.securityContext.fsGroup; set to {} to
  disable on NFS deployments that prefer chmod 0777 + no_root_squash.

Longhorn deployment documentation (charts/graphistry-helm/CLUSTER.md)
- New end-to-end Longhorn 1.11.x section: architecture (control plane vs data
  plane, CSI vs iSCSI vs iscsid), when to choose Longhorn over NFS, per-node
  prerequisites (open-iscsi / iscsid running, ext4/XFS data path, shared mount
  propagation), Helm install with defaultReplicaCount=2, defaultDataPath, and
  persistence.defaultClass=false, Node CR multi-disk patches, retain-sc-longhorn
  StorageClass creation (RWO replicated block) plus optional RWX share-manager
  class, smoke test, and cleanup runbook.
- Top-level README volumeBindingMode table grew a Replicated-block row covering
  Longhorn RWO, Rook-Ceph RBD, OpenEBS Mayastor / cStor, and Portworx; the
  existing Longhorn entry now reads "Longhorn RWX (share-manager NFSv4.1
  re-export)" so the two modes are no longer conflated.

Docs style sweep across all six READMEs (top-level, postgres-cluster,
graphistry-helm, CLUSTER, telemetry, TROUBLESHOOTING): semantic chart-name +
section-name link text replaces file-path-as-link-text in prose; em-dashes,
double-dashes, and prose arrows replaced with semicolons, colons, parentheses,
and English connectives in narrative prose; tables, ASCII diagrams, and code
blocks keep their structural symbols. Five pre-existing broken inbound links
(../troubleshooting.md, ../cluster/README.md, ../k3s/, ../gke/, ../tanzu/)
repaired in-flight.
@aucahuasi aucahuasi marked this pull request as ready for review April 29, 2026 08:58
Copy link
Copy Markdown

@albarralnunez albarralnunez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants