Create-Inc · danielchen0 · May 17, 2026 · May 17, 2026 · May 17, 2026 · May 17, 2026
diff --git a/.prettierignore b/.prettierignore
@@ -1,3 +1,4 @@
 dist/
 node_modules/
 package-lock.json
+paper/eval/results/
diff --git a/knip.json b/knip.json
@@ -1,3 +1,3 @@
 {
-  "ignore": ["dist/**"]
+  "ignore": ["dist/**", "paper/eval/results/**"]
 }
diff --git a/package.json b/package.json
@@ -21,6 +21,10 @@
     "lint:fix": "eslint --fix . && prettier --write .",
     "format": "prettier --write .",
     "format:check": "prettier --check .",
+    "eval:prompt-grid": "npm run build && tsx scripts/run-prompt-grid-eval.ts",
+    "eval:repair-loop": "npm run build && tsx scripts/run-repair-loop-eval.ts",
+    "paper:stats": "tsx scripts/paper-stats.ts",
+    "paper:tables": "tsx scripts/paper-stats.ts --eval paper/eval/artifacts/full-grid-2026-05-17/results.json --latex-out paper/generated/full-grid-tables.tex --repair-eval paper/eval/artifacts/repair-loop-2026-05-27/results.json --repair-latex-out paper/generated/repair-loop-tables.tex",
     "sync": "tsx scripts/sync.ts",
     "sync:check": "tsx scripts/sync.ts && git diff --exit-code -- src/rules/index.ts README.md"
   },

diff --git a/paper/.gitignore b/paper/.gitignore
@@ -0,0 +1,9 @@
+*.aux
+*.bbl
+*.blg
+*.fdb_latexmk
+*.fls
+*.log
+*.out
+*.pdf
+eval/results/
diff --git a/paper/Makefile b/paper/Makefile
@@ -0,0 +1,9 @@
+PDF=main.pdf
+
+.PHONY: all clean
+
+all:
+	latexmk -pdf -interaction=nonstopmode main.tex
+
+clean:
+	latexmk -C main.tex
diff --git a/paper/README.md b/paper/README.md
@@ -0,0 +1,124 @@
+# Laint Paper Draft
+
+This directory contains an initial arXiv-style paper draft for laint.
+
+## Current Shape
+
+The draft is intentionally framed as a research/tool paper, not a product announcement. The strongest publishable angle is:
+
+> Agent-oriented linting for generated JSX/TSX applications catches framework-specific web, mobile, and backend failures earlier than conventional build/type/runtime feedback.
+
+## Before Submission
+
+- Add real authors and affiliations.
+- Decide whether this targets arXiv only, a workshop, or both.
+- Run the prompt-to-code detector-quality evaluation described in `main.tex`.
+- Replace the evaluation-plan section with measured results.
+- Add citations to relevant program-repair and LLM-code-generation work.
+- Build the PDF from `main.tex` and inspect it before submission.
+
+## Version Pinning
+
+This draft pins its rule counts and reported benchmark artifacts to `main` commit
+`6a60a0295955ee6cc1d639c88955ea50722e3516` from 2026-05-14.
+
+For future papers or follow-up benchmark runs, record:
+
+- The exact `main` commit or benchmark tag used for the laint rule corpus.
+- The prompt suite version.
+- The model IDs and provider versions used for generation.
+- The run date and output directory.
+
+A future tag scheme such as `benchmark/agent-oriented-linting-2026-05` or
+`paper/agent-oriented-linting-v1` would make these runs easier to cite without
+depending on floating branch names.
+
+## Reproducing Paper Numbers
+
+Every numeric claim in the draft should either be calculated from repository
+source or from a checked-in benchmark artifact.
+
+Rule corpus counts, severity counts, platform counts, and the category table are
+calculated from `src/rules/*` metadata:
+
+```bash
+npm run paper:stats
+```
+
+The preliminary prompt-grid numbers in `main.tex` are calculated from the
+archived run artifact at `paper/eval/artifacts/initial-grid/results.json`:
+
+```bash
+npm run paper:stats -- --eval paper/eval/artifacts/initial-grid/results.json
+```
+
+There is also a larger raw grid artifact at
+`paper/eval/artifacts/full-grid-2026-05-17/results.json`:
+
+```bash
+npm run paper:stats -- --eval paper/eval/artifacts/full-grid-2026-05-17/results.json
+```
+
+The expanded-grid tables included by `main.tex` are generated from that artifact:
+
+```bash
+npm run paper:tables
+```
+
+This rewrites `paper/generated/full-grid-tables.tex` and
+`paper/generated/repair-loop-tables.tex`, which are checked in so the paper
+source can build directly while still keeping the table values reproducible from
+the archived JSON artifacts.
+
+This raw run covers 6 prompts and 7 configured model aliases. Moonshot/Kimi failed
+all 6 generations due provider authentication or network infrastructure errors,
+so the paper scopes model comparisons to the 6 working aliases and keeps the Kimi
+rows only as infrastructure-failure evidence. A scored Kimi comparison should
+rerun the grid after the Moonshot credential path is fixed.
+
+The repair-loop pilot uses the full-grid artifact as its baseline and is archived
+at `paper/eval/artifacts/repair-loop-2026-05-27/results.json`. The archived JSON
+artifact is the exact source for the current paper tables. The command below
+creates a fresh stochastic rerun with Doppler-provided model keys; it should not
+be expected to reproduce byte-identical outputs:
+
+```bash
+doppler run --project flux-worker --config dev -- npm run eval:repair-loop -- --max-turns 3 --out paper/eval/results/repair-loop-2026-05-27
+```
+
+The generated app files under `paper/eval/results/` remain ignored because they
+are working outputs. If a benchmark run contributes numbers to a paper, archive
+the corresponding `results.json` under `paper/eval/artifacts/<run-name>/` or
+attach it to a tagged release before citing the numbers. Archived artifacts
+should include top-level `metadata` with the run name, artifact date, source
+commit, runner script, prompt IDs, model aliases, model IDs, and token/turn
+limits.
+
+## Suggested Evaluation Data
+
+- A prompt suite covering web, mobile, and backend app-building tasks.
+- Generated JSX/TSX outputs from one or more LLMs.
+- Laint findings for each generated output.
+- Human labels for whether each finding is valid, invalid, or ambiguous.
+- Missed-defect labels for recall, when an independently reviewed corpus is available.
+- TypeScript, framework build, web preview, mobile simulator/device preview, and runtime outcomes.
+- Diagnostic-compliance outcomes after lint feedback: net finding reduction, rule-level resolved findings, newly introduced findings, turns to a lint-clean state, parse errors, and repair iteration counts.
+
+## Prompt Grid
+
+Run a small prompt-to-code grid with Doppler-provided model keys:
+
+```bash
+doppler run --project flux-worker --config dev -- npm run eval:prompt-grid
+```
+
+Useful options:
+
+```bash
+npm run eval:prompt-grid -- --limit 2
+npm run eval:prompt-grid -- --models openai-gpt-5.5,anthropic-sonnet-4.6,google-3.1-pro
+npm run eval:prompt-grid -- --out paper/eval/results/my-run
+```
+
+The runner writes raw generated files, `results.json`, `summary.md`, and `labels.todo.jsonl`
+under `paper/eval/results/`. That directory is intentionally ignored by git.
diff --git a/paper/eval/artifacts/full-grid-2026-05-17/results.json b/paper/eval/artifacts/full-grid-2026-05-17/results.json
diff --git a/paper/eval/artifacts/initial-grid/results.json b/paper/eval/artifacts/initial-grid/results.json
diff --git a/paper/eval/artifacts/repair-loop-2026-05-27/results.json b/paper/eval/artifacts/repair-loop-2026-05-27/results.json
diff --git a/paper/eval/prompts.json b/paper/eval/prompts.json
@@ -0,0 +1,44 @@
+[
+  {
+    "id": "taskflow-web",
+    "platform": "web",
+    "source": "refactor-bench",
+    "description": "React task management component with CRUD, search, filtering, modals, themes, and persistent UI preferences.",
+    "outputFile": "app/page.tsx"
+  },
+  {
+    "id": "chat-web",
+    "platform": "web",
+    "source": "refactor-bench",
+    "description": "Realtime chat application page with auth gate, message history, typing indicators, local draft persistence, and theme switching.",
+    "outputFile": "app/page.tsx"
+  },
+  {
+    "id": "event-planner-mobile",
+    "platform": "expo",
+    "source": "refactor-bench",
+    "description": "React Native event planning app screen with event browsing, RSVP management, calendar view, category filtering, search, location display, attendee lists, event creation modal, notifications, and user profiles.",
+    "outputFile": "src/screens/HomeScreen.tsx"
+  },
+  {
+    "id": "beauty-shop-mobile",
+    "platform": "expo",
+    "source": "refactor-bench",
+    "description": "Beauty and cosmetics shopping mobile app screen with wishlist, brand discovery, product categories, search, profile access, and bottom-tab navigation.",
+    "outputFile": "src/app/(tabs)/shop.tsx"
+  },
+  {
+    "id": "wallet-api-backend",
+    "platform": "backend",
+    "source": "custom",
+    "description": "Next.js route handler for wallet transfers with request validation, balance lookup, transaction creation, retry handling, and JSON responses.",
+    "outputFile": "app/api/wallet/transfer/route.ts"
+  },
+  {
+    "id": "insurance-reports-backend",
+    "platform": "backend",
+    "source": "refactor-bench",
+    "description": "Next.js route handler for insurance report aggregation with role-based access checks, filters, conversion-rate calculations, CSV export support, and error logging.",
+    "outputFile": "app/api/reports/route.ts"
+  }
+]
diff --git a/paper/generated/full-grid-tables.tex b/paper/generated/full-grid-tables.tex
@@ -0,0 +1,108 @@
+% Generated by npm run paper:tables.
+% Source artifact: paper/eval/artifacts/full-grid-2026-05-17/results.json
+
+\begin{table}[ht]
+  \centering
+  \begin{tabular}{lr}
+    \toprule
+    Metric & Value \\
+    \midrule
+    Prompts & 6 \\
+    Model aliases & 7 \\
+    Grid slots & 42 \\
+    Completed generations & 36 \\
+    Parse errors & 2 \\
+    Generation errors & 6 \\
+    Reported findings & 476 \\
+    \bottomrule
+  \end{tabular}
+  \caption{Expanded raw prompt-to-code benchmark run before detector-quality labeling.}
+  \label{tab:expanded-grid}
+\end{table}
+
+\begin{table}[ht]
+  \centering
+  \small
+  \begin{tabular}{lrrrrr}
+    \toprule
+    Model alias & Linted runs & Parse errors & Gen. errors & Findings & Findings/linted \\
+    \midrule
+    \texttt{anthropic-sonnet-4.6} & 6 & 0 & 0 & 127 & 21.2 \\
+    \texttt{anthropic-opus-4.6} & 6 & 0 & 0 & 123 & 20.5 \\
+    \texttt{openai-gpt-5.5} & 6 & 0 & 0 & 78 & 13.0 \\
+    \texttt{openai-gpt-5.4} & 5 & 1 & 0 & 59 & 11.8 \\
+    \texttt{google-3.1-pro} & 6 & 0 & 0 & 47 & 7.8 \\
+    \texttt{google-2.5-flash} & 5 & 1 & 0 & 42 & 8.4 \\
+    \texttt{moonshot-kimi-k2.6} & 0 & 0 & 6 & 0 & $--$ \\
+    \bottomrule
+  \end{tabular}
+  \caption{Expanded-grid findings by model. Linted runs exclude generation failures and parse failures.}
+  \label{tab:expanded-by-model}
+\end{table}
+
+\begin{table}[ht]
+  \centering
+  \small
+  \begin{tabular}{llrrrr}
+    \toprule
+    Prompt & Platform & Linted runs & Parse errors & Gen. errors & Findings \\
+    \midrule
+    \texttt{taskflow-web} & web & 6 & 0 & 1 & 125 \\
+    \texttt{chat-web} & web & 5 & 1 & 1 & 92 \\
+    \texttt{insurance-reports-backend} & backend & 6 & 0 & 1 & 84 \\
+    \texttt{event-planner-mobile} & expo & 6 & 0 & 1 & 77 \\
+    \texttt{beauty-shop-mobile} & expo & 6 & 0 & 1 & 61 \\
+    \texttt{wallet-api-backend} & backend & 5 & 1 & 1 & 37 \\
+    \bottomrule
+  \end{tabular}
+  \caption{Expanded-grid findings by prompt and target platform.}
+  \label{tab:expanded-by-prompt}
+\end{table}
+
+\begin{table}[ht]
+  \centering
+  \scriptsize
+  \begin{tabular}{llrr}
+    \toprule
+    Rule & Category & Findings & Share \\
+    \midrule
+    \texttt{no-inline-styles} & Tailwind CSS & 102 & 21.4\% \\
+    \texttt{no-silent-skip} & Code Style & 87 & 18.3\% \\
+    \texttt{no-type-assertion} & Code Style & 64 & 13.4\% \\
+    \texttt{no-emoji-icons} & Code Style & 37 & 7.8\% \\
+    \texttt{no-optional-props} & Code Style & 29 & 6.1\% \\
+    \texttt{scrollview-horizontal-flexgrow} & React Native / Expo & 25 & 5.3\% \\
+    \texttt{prefer-named-params} & Code Style & 17 & 3.6\% \\
+    \texttt{no-safeareaview} & React Native / Expo & 14 & 2.9\% \\
+    \texttt{browser-api-in-useeffect} & React / JSX & 12 & 2.5\% \\
+    \texttt{no-stylesheet-create} & React Native / Expo & 12 & 2.5\% \\
+    \texttt{textinput-keyboard-avoiding} & React Native / Expo & 12 & 2.5\% \\
+    \texttt{catch-must-log-to-sentry} & Error Handling & 11 & 2.3\% \\
+    Other rules & -- & 54 & 11.3\% \\
+    \bottomrule
+  \end{tabular}
+  \caption{Most frequent expanded-grid reported findings by rule. The top twelve rules account for most raw findings.}
+  \label{tab:expanded-by-rule}
+\end{table}
+
+\begin{table}[ht]
+  \centering
+  \scriptsize
+  \begingroup
+  \setlength{\tabcolsep}{3pt}
+  \begin{tabular}{lrrrrrrr}
+    \toprule
+    Prompt & GPT-5.5 & GPT-5.4 & Sonnet 4.6 & Opus 4.6 & G-3.1-Pro & G-2.5-Flash & Kimi K2.6 \\
+    \midrule
+    \texttt{taskflow-web} & 14 & 7 & 38 & 44 & 11 & 11 & G \\
+    \texttt{chat-web} & 14 & 17 & 37 & 11 & 13 & P & G \\
+    \texttt{event-planner-mobile} & 11 & 18 & 16 & 17 & 7 & 8 & G \\
+    \texttt{beauty-shop-mobile} & 11 & 10 & 8 & 16 & 7 & 9 & G \\
+    \texttt{wallet-api-backend} & 10 & P & 14 & 5 & 4 & 4 & G \\
+    \texttt{insurance-reports-backend} & 18 & 7 & 14 & 30 & 5 & 10 & G \\
+    \bottomrule
+  \end{tabular}
+  \endgroup
+  \caption{Run-level expanded-grid finding counts. P denotes a generated file that failed parsing; G denotes a generation failure.}
+  \label{tab:expanded-grid-matrix}
+\end{table}
diff --git a/paper/generated/repair-loop-tables.tex b/paper/generated/repair-loop-tables.tex
@@ -0,0 +1,66 @@
+% Generated by npm run paper:tables.
+% Source artifact: paper/eval/artifacts/repair-loop-2026-05-27/results.json
+
+\begin{table}[ht]
+  \centering
+  \begin{tabular}{lr}
+    \toprule
+    Metric & Value \\
+    \midrule
+    Baseline records & 42 \\
+    Skipped baseline generation errors & 6 \\
+    Attempted repairs & 36 \\
+    Maximum repair turns & 3 \\
+    Baseline reported findings & 476 \\
+    Final reported findings & 101 \\
+    Net finding reduction & 375 (78.8\%) \\
+    Rule-level findings resolved & 445 \\
+    Rule-level findings introduced & 70 \\
+    Baseline parse errors & 2 \\
+    Final parse errors & 1 \\
+    Clean after one turn & 7 \\
+    Clean after max turns & 18 \\
+    Repair generation errors & 0 \\
+    \bottomrule
+  \end{tabular}
+  \caption{Diagnostic-compliance repair-loop pilot over the expanded grid. Each attempted repair feeds laint diagnostics back to the same model for up to three turns.}
+  \label{tab:repair-summary}
+\end{table}
+
+\begin{table}[ht]
+  \centering
+  \scriptsize
+  \begin{tabular}{lrrrrrr}
+    \toprule
+    Model & Initial & Final & Net red. & New & Clean final & Avg. turns \\
+    \midrule
+    Opus 4.6 & 123 & 11 & 91.1\% & 2 & 2/6 & 1.5 \\
+    GPT-5.5 & 78 & 8 & 89.7\% & 8 & 4/6 & 1.3 \\
+    Sonnet 4.6 & 127 & 63 & 50.4\% & 55 & 2/6 & 3.0 \\
+    GPT-5.4 & 59 & 6 & 89.8\% & 0 & 4/6 & 2.0 \\
+    G-3.1-Pro & 47 & 5 & 89.4\% & 2 & 4/6 & 1.5 \\
+    G-2.5-Flash & 42 & 8 & 81.0\% & 3 & 2/6 & 2.5 \\
+    \bottomrule
+  \end{tabular}
+  \caption{Diagnostic-compliance outcomes by model, excluding baseline generation failures. Net red. is net reported-finding reduction; New counts rule-level findings introduced during repair. Average turns is computed over runs that reached zero findings and no parse error.}
+  \label{tab:repair-by-model}
+\end{table}
+
+\begin{table}[ht]
+  \centering
+  \scriptsize
+  \begin{tabular}{llrrrr}
+    \toprule
+    Prompt & Platform & Initial & Final & Net red. & New \\
+    \midrule
+    \texttt{taskflow-web} & web & 125 & 4 & 96.8\% & 2 \\
+    \texttt{chat-web} & web & 92 & 5 & 94.6\% & 0 \\
+    \texttt{insurance-reports-backend} & backend & 84 & 5 & 94.0\% & 1 \\
+    \texttt{beauty-shop-mobile} & expo & 61 & 13 & 78.7\% & 6 \\
+    \texttt{wallet-api-backend} & backend & 37 & 4 & 89.2\% & 2 \\
+    \texttt{event-planner-mobile} & expo & 77 & 70 & 9.1\% & 59 \\
+    \bottomrule
+  \end{tabular}
+  \caption{Diagnostic-compliance outcomes by prompt and platform. New counts rule-level findings introduced during repair.}
+  \label{tab:repair-by-prompt}
+\end{table}