From 89b97fceada3c9fa1259e5fa5c31fee45592c005 Mon Sep 17 00:00:00 2001
From: MengjieLee <limengjiejj@hotmail.com>
Date: Mon, 22 Jun 2026 07:52:40 +0000
Subject: [PATCH] Add A100 validation documentation

Document a reproducible NVIDIA A100 smoke validation run and exclude local agent workspace artifacts from version control.
---
 .gitignore                                    |   7 +
 docs/.nav.yml                                 |   1 +
 .../a100_benchmark_notes.ipynb                | 301 ++++++++++++++++++
 .../nvidia-a100-validation/index.md           | 152 +++++++++
 4 files changed, 461 insertions(+)
 create mode 100644 docs/getting_started/nvidia-a100-validation/a100_benchmark_notes.ipynb
 create mode 100644 docs/getting_started/nvidia-a100-validation/index.md

diff --git a/.gitignore b/.gitignore
index 442e699..fc72def 100644
--- a/.gitignore
+++ b/.gitignore
@@ -206,3 +206,10 @@ cython_debug/
 marimo/_static/
 marimo/_lsp/
 __marimo__/
+
+# Local personal workspace and agent artifacts
+/.agents/
+/.claude/
+/local_workspace/
+/ultragoal/
+/skills-lock.json
diff --git a/docs/.nav.yml b/docs/.nav.yml
index e9ebaf0..fcb8777 100644
--- a/docs/.nav.yml
+++ b/docs/.nav.yml
@@ -8,6 +8,7 @@ nav:
       - getting_started/quickstart.md
       - getting_started/installation.md
       - getting_started/faq.md
+      - getting_started/nvidia-a100-validation/index.md
     - Operators:
       - operators/README.md
       - operators/fused-logp.md
diff --git a/docs/getting_started/nvidia-a100-validation/a100_benchmark_notes.ipynb b/docs/getting_started/nvidia-a100-validation/a100_benchmark_notes.ipynb
new file mode 100644
index 0000000..d6ec9a7
--- /dev/null
+++ b/docs/getting_started/nvidia-a100-validation/a100_benchmark_notes.ipynb
@@ -0,0 +1,301 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# A100 Smoke Benchmark Notes\n",
+    "\n",
+    "This notebook records a lightweight RL-Kernel benchmark validation run on one isolated NVIDIA A100 device. The goal is to make the benchmark process reproducible and to validate whether the generated reports accurately describe the backend that was measured.\n",
+    "\n",
+    "Scope:\n",
+    "\n",
+    "- Record visible GPU hardware and runtime state.\n",
+    "- Select exactly one idle A100 with `CUDA_VISIBLE_DEVICES`.\n",
+    "- Record physical GPU index and logical CUDA mapping.\n",
+    "- Run a minimal benchmark shape to validate the profiling path.\n",
+    "- Capture generated CSV/JSON reports.\n",
+    "- Compare benchmark status fields with runtime logs.\n",
+    "- Keep latency and throughput observations as smoke diagnostics only.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Benchmark Method\n",
+    "\n",
+    "This run uses a smoke-scale configuration. It is intended to validate benchmark wiring and report semantics, not to produce final performance claims.\n",
+    "\n",
+    "Method:\n",
+    "\n",
+    "- Select one idle physical A100 device with `CUDA_VISIBLE_DEVICES=0`.\n",
+    "- PyTorch sees the selected physical GPU as logical CUDA device 0.\n",
+    "- Use a small tensor shape to keep the run quick and easy to repeat.\n",
+    "- Use `warmup=0` and `repeat=1` for initial validation.\n",
+    "- Treat latency and throughput values as diagnostic observations only.\n",
+    "- Use larger shapes and repeated runs before publishing performance numbers.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "id": "2dnihuovf6s",
+   "source": "from pathlib import Path\nimport os\n\ncwd = Path.cwd().resolve()\nrepo_root = next(\n    (path for path in (cwd, *cwd.parents) if (path / 'scripts/run_profile_suite.py').exists()),\n    None,\n)\nif repo_root is None:\n    raise FileNotFoundError(\n        'Could not locate the RL-Kernel repository root containing scripts/run_profile_suite.py. '\n        'Run this notebook from inside the cloned repository.'\n    )\n\nos.chdir(repo_root)\nprint('repo_root', repo_root)",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0, NVIDIA A100 80GB PCIe, 0, 81920, 0\n",
+      "1, NVIDIA A100 80GB PCIe, 0, 81920, 0\n",
+      "2, NVIDIA A100 80GB PCIe, 0, 81920, 0\n",
+      "3, NVIDIA A100 80GB PCIe, 0, 81920, 0\n",
+      "4, NVIDIA A100 80GB PCIe, 0, 81920, 0\n",
+      "5, NVIDIA A100 80GB PCIe, 0, 81920, 0\n",
+      "6, NVIDIA A100 80GB PCIe, 3, 81920, 0\n",
+      "7, NVIDIA A100 80GB PCIe, 0, 81920, 0\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Check visible GPU state before selecting a device.\n",
+    "!nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu --format=csv,noheader,nounits\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CUDA_VISIBLE_DEVICES 0\n",
+      "python 3.12.3 (main, Jan  8 2026, 11:30:50) [GCC 13.3.0]\n",
+      "torch 2.9.1+cu129\n",
+      "cuda_available True\n",
+      "cuda_version 12.9\n",
+      "hip_version None\n",
+      "device_count 1\n",
+      "logical_device 0 NVIDIA A100 80GB PCIe (8, 0)\n",
+      "rocminfo_path None\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Record the one-GPU CUDA mapping used for validation.\n",
+    "import os, sys, shutil, subprocess\n",
+    "print('CUDA_VISIBLE_DEVICES', os.environ.get('CUDA_VISIBLE_DEVICES'))\n",
+    "print('python', sys.version)\n",
+    "try:\n",
+    "    import torch\n",
+    "    print('torch', torch.__version__)\n",
+    "    print('cuda_available', torch.cuda.is_available())\n",
+    "    print('cuda_version', torch.version.cuda)\n",
+    "    print('hip_version', getattr(torch.version, 'hip', None))\n",
+    "    print('device_count', torch.cuda.device_count())\n",
+    "    for i in range(torch.cuda.device_count()):\n",
+    "        print('logical_device', i, torch.cuda.get_device_name(i), torch.cuda.get_device_capability(i))\n",
+    "except Exception as exc:\n",
+    "    print('torch_error', repr(exc))\n",
+    "print('rocminfo_path', shutil.which('rocminfo'))\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Smoke Benchmark Command\n",
+    "\n",
+    "The command below uses a minimal workload to exercise the benchmark runner, backend registry, CSV writer, and JSON writer. It is useful for checking whether the profiler records the correct backend status before running larger benchmark suites.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "INFO 06-17 16:56:37 [RL-Kernel]: RL-Engine initialized with NVIDIA CUDA backend (Version: 12.9)\n",
+      "INFO 06-17 16:56:46 [RL-Kernel]: Profiling 'logp_native' | shape=(2,16,1024)\n",
+      "INFO 06-17 16:56:46 [RL-Kernel]: KernelRegistry initialized for cuda\n",
+      "INFO 06-17 16:56:47 [RL-Kernel]: Successfully linked to precompiled _C.fused_logp fallback kernel.\n",
+      "INFO 06-17 16:56:47 [RL-Kernel]: Profiling 'logp_fused' | shape=(2,16,1024)\n",
+      "INFO 06-17 16:56:47 [RL-Kernel]: Profiling 'sampling_native' | shape=(2,1024)\n",
+      "INFO 06-17 16:56:48 [RL-Kernel]: Performance report saved to reports/a100-smoke-stage1/perf_report_NVIDIA_A100_80GB_PCIe_20260617_165648.json\n",
+      "INFO 06-17 16:56:48 [RL-Kernel]: CSV metrics appended to reports/a100-smoke-stage1/perf_report_NVIDIA_A100_80GB_PCIe.csv\n",
+      "\n",
+      "========================================================================================================================\n",
+      "                                    RL-KERNEL AUTOMATED PERFORMANCE PROFILING SUITE                                     \n",
+      "GPU: NVIDIA A100 80GB PCIe | Arch: Ampere | Backend: cuda\n",
+      "========================================================================================================================\n",
+      "╒═════════════════╤═════════╤══════════╤═════════╤═══════════════╤══════════════╤══════════╤═════════════════╤══════════╕\n",
+      "│ Benchmark       │   Batch │   SeqLen │   Vocab │   Latency(ms) │ Tokens/sec   │   TFLOPS │   Peak VRAM(GB) │ Status   │\n",
+      "╞═════════════════╪═════════╪══════════╪═════════╪═══════════════╪══════════════╪══════════╪═════════════════╪══════════╡\n",
+      "│ logp_native     │       2 │       16 │    1024 │        597.3  │ 54           │        0 │               0 │ pass     │\n",
+      "├─────────────────┼─────────┼──────────┼─────────┼───────────────┼──────────────┼──────────┼─────────────────┼──────────┤\n",
+      "│ logp_fused      │       2 │       16 │    1024 │          3.69 │ 8,671        │        0 │               0 │ pass     │\n",
+      "├─────────────────┼─────────┼──────────┼─────────┼───────────────┼──────────────┼──────────┼─────────────────┼──────────┤\n",
+      "│ sampling_native │       2 │       16 │    1024 │       1076.81 │ 30           │        0 │               0 │ pass     │\n",
+      "╘═════════════════╧═════════╧══════════╧═════════╧═══════════════╧══════════════╧══════════╧═════════════════╧══════════╛\n",
+      "========================================================================================================================\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!CUDA_VISIBLE_DEVICES=0 python3 scripts/run_profile_suite.py \\\n",
+    "  --device cuda \\\n",
+    "  --dtype float32 \\\n",
+    "  --batch-sizes 2 \\\n",
+    "  --seq-lens 16 \\\n",
+    "  --vocab-sizes 1024 \\\n",
+    "  --workloads logp-native,logp-fused,sampling-native \\\n",
+    "  --warmup 0 \\\n",
+    "  --repeat 1 \\\n",
+    "  --output-dir reports/a100-smoke-stage1 \\\n",
+    "  --csv \\\n",
+    "  --json\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Equivalent command from the repository root:\n",
+    "\n",
+    "```bash\n",
+    "CUDA_VISIBLE_DEVICES=0 python3 scripts/run_profile_suite.py \\\n",
+    "  --device cuda \\\n",
+    "  --dtype float32 \\\n",
+    "  --batch-sizes 2 \\\n",
+    "  --seq-lens 16 \\\n",
+    "  --vocab-sizes 1024 \\\n",
+    "  --workloads logp-native,logp-fused,sampling-native \\\n",
+    "  --warmup 0 \\\n",
+    "  --repeat 1 \\\n",
+    "  --output-dir reports/a100-smoke-stage1 \\\n",
+    "  --csv \\\n",
+    "  --json\n",
+    "```\n",
+    "\n",
+    "Generated outputs include:\n",
+    "\n",
+    "- `perf_report_NVIDIA_A100_80GB_PCIe.csv`\n",
+    "- `perf_report_NVIDIA_A100_80GB_PCIe_20260617_165648.json`\n",
+    "\n",
+    "These generated report artifacts were used for inspection and are not final documentation artifacts.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "    benchmark_name               gpu_name gpu_architecture gpu_backend  gpu_compute_capability  gpu_device_index  batch_size  seq_len  vocab_size   latency_ms  tokens_per_sec  peak_vram_gb status  notes\n",
+       "0      logp_native  NVIDIA A100 80GB PCIe           Ampere        cuda                     8.0                 0           2       16        1024   597.303284       53.574124      0.000367   pass    NaN\n",
+       "1       logp_fused  NVIDIA A100 80GB PCIe           Ampere        cuda                     8.0                 0           2       16        1024     3.690496     8670.921274      0.000246   pass    NaN\n",
+       "2  sampling_native  NVIDIA A100 80GB PCIe           Ampere        cuda                     8.0                 0           2       16        1024  1076.806641       29.717499      0.000073   pass    NaN\n"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from pathlib import Path\n",
+    "import pandas as pd\n",
+    "\n",
+    "csv_path = Path('reports/a100-smoke-stage1/perf_report_NVIDIA_A100_80GB_PCIe.csv')\n",
+    "df = pd.read_csv(csv_path)\n",
+    "df[[\n",
+    "    'benchmark_name',\n",
+    "    'gpu_name',\n",
+    "    'gpu_architecture',\n",
+    "    'gpu_backend',\n",
+    "    'gpu_compute_capability',\n",
+    "    'gpu_device_index',\n",
+    "    'batch_size',\n",
+    "    'seq_len',\n",
+    "    'vocab_size',\n",
+    "    'latency_ms',\n",
+    "    'tokens_per_sec',\n",
+    "    'peak_vram_gb',\n",
+    "    'status',\n",
+    "    'notes',\n",
+    "]]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fznxipa9ag",
+   "metadata": {},
+   "source": [
+    "## Result Summary\n",
+    "\n",
+    "This smoke run completed successfully on physical GPU 0, exposed to PyTorch as logical CUDA device 0 through `CUDA_VISIBLE_DEVICES=0`. The machine-readable reports record all three requested workloads as `status = pass`.\n",
+    "\n",
+    "The `logp_fused` row should still be interpreted conservatively. The runtime log says `Successfully linked to precompiled _C.fused_logp fallback kernel.`, so this validates that the profiler can execute and report the available `_C.fused_logp` path in this environment. It does not claim a full production fused-kernel benchmark, and the single-iteration smoke timing is not a publishable performance number.\n",
+    "\n",
+    "The generated CSV/JSON reports were inspected for status fields and backend metadata. They should remain disposable validation artifacts unless a later stage explicitly promotes them.\n",
+    "\n",
+    "## Next Checks\n",
+    "\n",
+    "For stricter NVIDIA validation, build the extension and require fused dispatch explicitly:\n",
+    "\n",
+    "```bash\n",
+    "MAX_JOBS=2 python setup.py build_ext --inplace\n",
+    "python examples/grpo_single_gpu.py \\\n",
+    "  --device cuda \\\n",
+    "  --require-fused-logp \\\n",
+    "  --steps 2 \\\n",
+    "  --num-prompts 1 \\\n",
+    "  --samples-per-prompt 2 \\\n",
+    "  --prompt-len 2 \\\n",
+    "  --completion-len 3 \\\n",
+    "  --vocab-size 16 \\\n",
+    "  --hidden-dim 8\n",
+    "```\n",
+    "\n",
+    "Only report that strict path as validated after running it on the target hardware.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
\ No newline at end of file
diff --git a/docs/getting_started/nvidia-a100-validation/index.md b/docs/getting_started/nvidia-a100-validation/index.md
new file mode 100644
index 0000000..e41de01
--- /dev/null
+++ b/docs/getting_started/nvidia-a100-validation/index.md
@@ -0,0 +1,152 @@
+# NVIDIA A100 Environment Validation
+
+This page records one verified RL-Kernel onboarding run on an NVIDIA A100 cluster node.
+It is a concrete reference for CUDA users, not a general hardware support matrix.
+
+The accompanying notebook is available at [a100_benchmark_notes.ipynb](a100_benchmark_notes.ipynb).
+
+## Verified Environment
+
+| Item | Observed value |
+| --- | --- |
+| GPUs | 8 x NVIDIA A100 80GB PCIe |
+| NVIDIA driver | 535.230.02 |
+| `nvidia-smi` CUDA version | 12.2 |
+| Python | 3.12.3 |
+| PyTorch | 2.9.1+cu129 |
+| `torch.cuda.is_available()` | `True` |
+| PyTorch CUDA devices | 8 under full-node visibility; smoke validation isolated one physical A100 with `CUDA_VISIBLE_DEVICES=0`, so PyTorch reported `device_count 1` and exposed it as logical CUDA device 0 |
+| Device capability | 8.0 |
+| ROCm | unavailable: `rocminfo` not found, `torch.version.hip` is `None` |
+
+## Environment Probe Commands
+
+```bash
+nvidia-smi
+```
+
+```bash
+python - <<'PY'
+import sys
+print('python', sys.version)
+try:
+    import torch
+    print('torch', torch.__version__)
+    print('cuda_available', torch.cuda.is_available())
+    print('cuda_version', torch.version.cuda)
+    print('hip_version', getattr(torch.version, 'hip', None))
+    if torch.cuda.is_available():
+        print('device_count', torch.cuda.device_count())
+        for i in range(torch.cuda.device_count()):
+            print('device', i, torch.cuda.get_device_name(i), torch.cuda.get_device_capability(i))
+except Exception as exc:
+    print('torch_error', repr(exc))
+PY
+```
+
+## Commands Run
+
+### A100 Profiler Smoke Command
+
+The notebook refresh used one isolated physical GPU, exposed as logical CUDA device 0:
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python3 scripts/run_profile_suite.py \
+  --device cuda \
+  --dtype float32 \
+  --batch-sizes 2 \
+  --seq-lens 16 \
+  --vocab-sizes 1024 \
+  --workloads logp-native,logp-fused,sampling-native \
+  --warmup 0 \
+  --repeat 1 \
+  --output-dir reports/a100-smoke-stage1 \
+  --csv \
+  --json
+```
+
+Result excerpt:
+
+```text
+GPU: NVIDIA A100 80GB PCIe | Arch: Ampere | Backend: cuda
+logp_native      batch=2 seq=16 vocab=1024 status=pass latency=597.30 ms
+logp_fused       batch=2 seq=16 vocab=1024 status=pass latency=3.69 ms
+sampling_native  batch=2 seq=16 vocab=1024 status=pass latency=1076.81 ms
+```
+
+The run log reported `Successfully linked to precompiled _C.fused_logp fallback kernel.` The `logp_fused` row is therefore a smoke validation of the available `_C.fused_logp` path in this environment, not a publishable full benchmark claim.
+
+### Dispatch Smoke Tests
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python -m pytest rl_engine/tests/test_dispatch.py -v
+```
+
+Result:
+
+```text
+3 passed in 14.02s
+```
+
+### Profiler Unit Tests
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python -m pytest tests/test_profiler.py -v
+```
+
+Result:
+
+```text
+9 passed in 10.67s
+```
+
+### Documentation Build
+
+```bash
+mkdocs build --strict -f mkdocs.yaml
+```
+
+Result:
+
+```text
+Documentation built successfully in strict mode.
+```
+
+The build emitted the existing Material for MkDocs 2.0 warning and noted that
+`operators/grpo-loss.md` is not included in the nav configuration.
+
+## What This Validates
+
+- PyTorch can see and use CUDA on this A100 node.
+- RL-Kernel dispatch smoke tests pass in this environment.
+- RL-Kernel profiler unit tests pass in this environment.
+- The small A100 profiler smoke run completes with the requested `logp-native`, `logp-fused`, and `sampling-native` workloads.
+- The documentation site builds in strict mode.
+
+## What This Does Not Claim
+
+- This does not validate AMD ROCm; ROCm was unavailable on this node.
+- This does not validate H100, SM90, or TMA-specific fused LogP behavior; A100 is SM80.
+- This does not reproduce the full benchmark tables in the project README.
+- This does not claim that every CUDA, driver, or PyTorch combination is supported.
+- This does not validate strict fused mode with `--require-fused-logp`; run the strict command separately when that is the target.
+
+## Next Checks
+
+For stricter NVIDIA validation, build the extension and require fused dispatch explicitly:
+
+```bash
+MAX_JOBS=2 python setup.py build_ext --inplace
+python examples/grpo_single_gpu.py \
+  --device cuda \
+  --require-fused-logp \
+  --steps 2 \
+  --num-prompts 1 \
+  --samples-per-prompt 2 \
+  --prompt-len 2 \
+  --completion-len 3 \
+  --vocab-size 16 \
+  --hidden-dim 8
+```
+
+Only report that strict path as validated after running it on the target hardware.