Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
b8b6e69
docs: draft agent-oriented linting paper
danielchen0 May 17, 2026
3111272
docs: format paper readme
danielchen0 May 17, 2026
da9695f
docs: include mobile app framing in paper
danielchen0 May 17, 2026
4624ac9
docs: describe prompt-to-code eval
danielchen0 May 17, 2026
286679d
docs: add prompt grid eval harness
danielchen0 May 17, 2026
70afa7b
docs: add preliminary eval numbers
danielchen0 May 17, 2026
1be8c14
docs: frame laint as llm benchmark
danielchen0 May 17, 2026
fba83fd
docs: clarify laint benchmark framing
danielchen0 May 17, 2026
d3146e5
docs: describe benchmark behavioral signals
danielchen0 May 17, 2026
759ddfd
docs: add Arnav Surve as paper author
danielchen0 May 17, 2026
5be1551
docs: rename paper validity section to limitations
danielchen0 May 17, 2026
c72b770
docs: clarify local heuristic tradeoff
danielchen0 May 17, 2026
2504637
docs: clarify paper terminology
danielchen0 May 17, 2026
22fed69
docs: add recall to detector metrics
danielchen0 May 17, 2026
e162974
docs: add f-score detector metric
danielchen0 May 17, 2026
c74d8f0
docs: pin paper benchmark version
danielchen0 May 17, 2026
f9a403f
docs: clarify rule category source
danielchen0 May 17, 2026
fb7af44
docs: make paper numbers reproducible
danielchen0 May 17, 2026
182c08f
docs: archive full prompt grid artifact
danielchen0 May 17, 2026
0c5d616
docs: highlight edit-time repair loop
danielchen0 May 17, 2026
afa6a41
docs: add expanded grid data to paper
danielchen0 May 17, 2026
77b4992
docs: add generated result tables to paper
danielchen0 May 27, 2026
d0ca77f
docs: add repair loop pilot to paper
danielchen0 May 27, 2026
e7da143
docs: frame repair loop as diagnostic compliance
danielchen0 May 27, 2026
5e7eae4
docs: tighten benchmark pilot framing
danielchen0 May 27, 2026
eda7f36
docs: address paper review comments
danielchen0 May 29, 2026
bfc9ca5
docs: tighten benchmark reproducibility claims
danielchen0 May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .prettierignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
dist/
node_modules/
package-lock.json
paper/eval/results/
2 changes: 1 addition & 1 deletion knip.json
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
{
"ignore": ["dist/**"]
"ignore": ["dist/**", "paper/eval/results/**"]
}
4 changes: 4 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@
"lint:fix": "eslint --fix . && prettier --write .",
"format": "prettier --write .",
"format:check": "prettier --check .",
"eval:prompt-grid": "npm run build && tsx scripts/run-prompt-grid-eval.ts",
"eval:repair-loop": "npm run build && tsx scripts/run-repair-loop-eval.ts",
"paper:stats": "tsx scripts/paper-stats.ts",
"paper:tables": "tsx scripts/paper-stats.ts --eval paper/eval/artifacts/full-grid-2026-05-17/results.json --latex-out paper/generated/full-grid-tables.tex --repair-eval paper/eval/artifacts/repair-loop-2026-05-27/results.json --repair-latex-out paper/generated/repair-loop-tables.tex",
"sync": "tsx scripts/sync.ts",
"sync:check": "tsx scripts/sync.ts && git diff --exit-code -- src/rules/index.ts README.md"
},
Expand Down
9 changes: 9 additions & 0 deletions paper/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*.aux
*.bbl
*.blg
*.fdb_latexmk
*.fls
*.log
*.out
*.pdf
eval/results/
9 changes: 9 additions & 0 deletions paper/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
PDF=main.pdf

.PHONY: all clean

all:
latexmk -pdf -interaction=nonstopmode main.tex

clean:
latexmk -C main.tex
124 changes: 124 additions & 0 deletions paper/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Laint Paper Draft

This directory contains an initial arXiv-style paper draft for laint.

## Current Shape

The draft is intentionally framed as a research/tool paper, not a product announcement. The strongest publishable angle is:

> Agent-oriented linting for generated JSX/TSX applications catches framework-specific web, mobile, and backend failures earlier than conventional build/type/runtime feedback.

## Before Submission

- Add real authors and affiliations.
- Decide whether this targets arXiv only, a workshop, or both.
- Run the prompt-to-code detector-quality evaluation described in `main.tex`.
- Replace the evaluation-plan section with measured results.
- Add citations to relevant program-repair and LLM-code-generation work.
- Build the PDF from `main.tex` and inspect it before submission.

## Version Pinning

This draft pins its rule counts and reported benchmark artifacts to `main` commit
`6a60a0295955ee6cc1d639c88955ea50722e3516` from 2026-05-14.

For future papers or follow-up benchmark runs, record:

- The exact `main` commit or benchmark tag used for the laint rule corpus.
- The prompt suite version.
- The model IDs and provider versions used for generation.
- The run date and output directory.

A future tag scheme such as `benchmark/agent-oriented-linting-2026-05` or
`paper/agent-oriented-linting-v1` would make these runs easier to cite without
depending on floating branch names.

## Reproducing Paper Numbers

Every numeric claim in the draft should either be calculated from repository
source or from a checked-in benchmark artifact.

Rule corpus counts, severity counts, platform counts, and the category table are
calculated from `src/rules/*` metadata:

```bash
npm run paper:stats
```

The preliminary prompt-grid numbers in `main.tex` are calculated from the
archived run artifact at `paper/eval/artifacts/initial-grid/results.json`:

```bash
npm run paper:stats -- --eval paper/eval/artifacts/initial-grid/results.json
```

There is also a larger raw grid artifact at
`paper/eval/artifacts/full-grid-2026-05-17/results.json`:

```bash
npm run paper:stats -- --eval paper/eval/artifacts/full-grid-2026-05-17/results.json
```

The expanded-grid tables included by `main.tex` are generated from that artifact:

```bash
npm run paper:tables
```

This rewrites `paper/generated/full-grid-tables.tex` and
`paper/generated/repair-loop-tables.tex`, which are checked in so the paper
source can build directly while still keeping the table values reproducible from
the archived JSON artifacts.

This raw run covers 6 prompts and 7 configured model aliases. Moonshot/Kimi failed
all 6 generations due provider authentication or network infrastructure errors,
so the paper scopes model comparisons to the 6 working aliases and keeps the Kimi
rows only as infrastructure-failure evidence. A scored Kimi comparison should
rerun the grid after the Moonshot credential path is fixed.

The repair-loop pilot uses the full-grid artifact as its baseline and is archived
at `paper/eval/artifacts/repair-loop-2026-05-27/results.json`. The archived JSON
artifact is the exact source for the current paper tables. The command below
creates a fresh stochastic rerun with Doppler-provided model keys; it should not
be expected to reproduce byte-identical outputs:

```bash
doppler run --project flux-worker --config dev -- npm run eval:repair-loop -- --max-turns 3 --out paper/eval/results/repair-loop-2026-05-27
```

The generated app files under `paper/eval/results/` remain ignored because they
are working outputs. If a benchmark run contributes numbers to a paper, archive
the corresponding `results.json` under `paper/eval/artifacts/<run-name>/` or
attach it to a tagged release before citing the numbers. Archived artifacts
should include top-level `metadata` with the run name, artifact date, source
commit, runner script, prompt IDs, model aliases, model IDs, and token/turn
limits.

## Suggested Evaluation Data

- A prompt suite covering web, mobile, and backend app-building tasks.
- Generated JSX/TSX outputs from one or more LLMs.
- Laint findings for each generated output.
- Human labels for whether each finding is valid, invalid, or ambiguous.
- Missed-defect labels for recall, when an independently reviewed corpus is available.
- TypeScript, framework build, web preview, mobile simulator/device preview, and runtime outcomes.
- Diagnostic-compliance outcomes after lint feedback: net finding reduction, rule-level resolved findings, newly introduced findings, turns to a lint-clean state, parse errors, and repair iteration counts.

## Prompt Grid

Run a small prompt-to-code grid with Doppler-provided model keys:

```bash
doppler run --project flux-worker --config dev -- npm run eval:prompt-grid
```

Useful options:

```bash
npm run eval:prompt-grid -- --limit 2
npm run eval:prompt-grid -- --models openai-gpt-5.5,anthropic-sonnet-4.6,google-3.1-pro
npm run eval:prompt-grid -- --out paper/eval/results/my-run
```

The runner writes raw generated files, `results.json`, `summary.md`, and `labels.todo.jsonl`
under `paper/eval/results/`. That directory is intentionally ignored by git.
4,724 changes: 4,724 additions & 0 deletions paper/eval/artifacts/full-grid-2026-05-17/results.json

Large diffs are not rendered by default.

2,302 changes: 2,302 additions & 0 deletions paper/eval/artifacts/initial-grid/results.json

Large diffs are not rendered by default.

11,617 changes: 11,617 additions & 0 deletions paper/eval/artifacts/repair-loop-2026-05-27/results.json

Large diffs are not rendered by default.

44 changes: 44 additions & 0 deletions paper/eval/prompts.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
[
{
"id": "taskflow-web",
"platform": "web",
"source": "refactor-bench",
"description": "React task management component with CRUD, search, filtering, modals, themes, and persistent UI preferences.",
"outputFile": "app/page.tsx"
},
{
"id": "chat-web",
"platform": "web",
"source": "refactor-bench",
"description": "Realtime chat application page with auth gate, message history, typing indicators, local draft persistence, and theme switching.",
"outputFile": "app/page.tsx"
},
{
"id": "event-planner-mobile",
"platform": "expo",
"source": "refactor-bench",
"description": "React Native event planning app screen with event browsing, RSVP management, calendar view, category filtering, search, location display, attendee lists, event creation modal, notifications, and user profiles.",
"outputFile": "src/screens/HomeScreen.tsx"
},
{
"id": "beauty-shop-mobile",
"platform": "expo",
"source": "refactor-bench",
"description": "Beauty and cosmetics shopping mobile app screen with wishlist, brand discovery, product categories, search, profile access, and bottom-tab navigation.",
"outputFile": "src/app/(tabs)/shop.tsx"
},
{
"id": "wallet-api-backend",
"platform": "backend",
"source": "custom",
"description": "Next.js route handler for wallet transfers with request validation, balance lookup, transaction creation, retry handling, and JSON responses.",
"outputFile": "app/api/wallet/transfer/route.ts"
},
{
"id": "insurance-reports-backend",
"platform": "backend",
"source": "refactor-bench",
"description": "Next.js route handler for insurance report aggregation with role-based access checks, filters, conversion-rate calculations, CSV export support, and error logging.",
"outputFile": "app/api/reports/route.ts"
}
]
108 changes: 108 additions & 0 deletions paper/generated/full-grid-tables.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
% Generated by npm run paper:tables.
% Source artifact: paper/eval/artifacts/full-grid-2026-05-17/results.json

\begin{table}[ht]
\centering
\begin{tabular}{lr}
\toprule
Metric & Value \\
\midrule
Prompts & 6 \\
Model aliases & 7 \\
Grid slots & 42 \\
Completed generations & 36 \\
Parse errors & 2 \\
Generation errors & 6 \\
Reported findings & 476 \\
\bottomrule
\end{tabular}
\caption{Expanded raw prompt-to-code benchmark run before detector-quality labeling.}
\label{tab:expanded-grid}
\end{table}

\begin{table}[ht]
\centering
\small
\begin{tabular}{lrrrrr}
\toprule
Model alias & Linted runs & Parse errors & Gen. errors & Findings & Findings/linted \\
\midrule
\texttt{anthropic-sonnet-4.6} & 6 & 0 & 0 & 127 & 21.2 \\
\texttt{anthropic-opus-4.6} & 6 & 0 & 0 & 123 & 20.5 \\
\texttt{openai-gpt-5.5} & 6 & 0 & 0 & 78 & 13.0 \\
\texttt{openai-gpt-5.4} & 5 & 1 & 0 & 59 & 11.8 \\
\texttt{google-3.1-pro} & 6 & 0 & 0 & 47 & 7.8 \\
\texttt{google-2.5-flash} & 5 & 1 & 0 & 42 & 8.4 \\
\texttt{moonshot-kimi-k2.6} & 0 & 0 & 6 & 0 & $--$ \\
\bottomrule
\end{tabular}
\caption{Expanded-grid findings by model. Linted runs exclude generation failures and parse failures.}
\label{tab:expanded-by-model}
\end{table}

\begin{table}[ht]
\centering
\small
\begin{tabular}{llrrrr}
\toprule
Prompt & Platform & Linted runs & Parse errors & Gen. errors & Findings \\
\midrule
\texttt{taskflow-web} & web & 6 & 0 & 1 & 125 \\
\texttt{chat-web} & web & 5 & 1 & 1 & 92 \\
\texttt{insurance-reports-backend} & backend & 6 & 0 & 1 & 84 \\
\texttt{event-planner-mobile} & expo & 6 & 0 & 1 & 77 \\
\texttt{beauty-shop-mobile} & expo & 6 & 0 & 1 & 61 \\
\texttt{wallet-api-backend} & backend & 5 & 1 & 1 & 37 \\
\bottomrule
\end{tabular}
\caption{Expanded-grid findings by prompt and target platform.}
\label{tab:expanded-by-prompt}
\end{table}

\begin{table}[ht]
\centering
\scriptsize
\begin{tabular}{llrr}
\toprule
Rule & Category & Findings & Share \\
\midrule
\texttt{no-inline-styles} & Tailwind CSS & 102 & 21.4\% \\
\texttt{no-silent-skip} & Code Style & 87 & 18.3\% \\
\texttt{no-type-assertion} & Code Style & 64 & 13.4\% \\
\texttt{no-emoji-icons} & Code Style & 37 & 7.8\% \\
\texttt{no-optional-props} & Code Style & 29 & 6.1\% \\
\texttt{scrollview-horizontal-flexgrow} & React Native / Expo & 25 & 5.3\% \\
\texttt{prefer-named-params} & Code Style & 17 & 3.6\% \\
\texttt{no-safeareaview} & React Native / Expo & 14 & 2.9\% \\
\texttt{browser-api-in-useeffect} & React / JSX & 12 & 2.5\% \\
\texttt{no-stylesheet-create} & React Native / Expo & 12 & 2.5\% \\
\texttt{textinput-keyboard-avoiding} & React Native / Expo & 12 & 2.5\% \\
\texttt{catch-must-log-to-sentry} & Error Handling & 11 & 2.3\% \\
Other rules & -- & 54 & 11.3\% \\
\bottomrule
\end{tabular}
\caption{Most frequent expanded-grid reported findings by rule. The top twelve rules account for most raw findings.}
\label{tab:expanded-by-rule}
\end{table}

\begin{table}[ht]
\centering
\scriptsize
\begingroup
\setlength{\tabcolsep}{3pt}
\begin{tabular}{lrrrrrrr}
\toprule
Prompt & GPT-5.5 & GPT-5.4 & Sonnet 4.6 & Opus 4.6 & G-3.1-Pro & G-2.5-Flash & Kimi K2.6 \\
\midrule
\texttt{taskflow-web} & 14 & 7 & 38 & 44 & 11 & 11 & G \\
\texttt{chat-web} & 14 & 17 & 37 & 11 & 13 & P & G \\
\texttt{event-planner-mobile} & 11 & 18 & 16 & 17 & 7 & 8 & G \\
\texttt{beauty-shop-mobile} & 11 & 10 & 8 & 16 & 7 & 9 & G \\
\texttt{wallet-api-backend} & 10 & P & 14 & 5 & 4 & 4 & G \\
\texttt{insurance-reports-backend} & 18 & 7 & 14 & 30 & 5 & 10 & G \\
\bottomrule
\end{tabular}
\endgroup
\caption{Run-level expanded-grid finding counts. P denotes a generated file that failed parsing; G denotes a generation failure.}
\label{tab:expanded-grid-matrix}
\end{table}
66 changes: 66 additions & 0 deletions paper/generated/repair-loop-tables.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
% Generated by npm run paper:tables.
% Source artifact: paper/eval/artifacts/repair-loop-2026-05-27/results.json

\begin{table}[ht]
\centering
\begin{tabular}{lr}
\toprule
Metric & Value \\
\midrule
Baseline records & 42 \\
Skipped baseline generation errors & 6 \\
Attempted repairs & 36 \\
Maximum repair turns & 3 \\
Baseline reported findings & 476 \\
Final reported findings & 101 \\
Net finding reduction & 375 (78.8\%) \\
Rule-level findings resolved & 445 \\
Rule-level findings introduced & 70 \\
Baseline parse errors & 2 \\
Final parse errors & 1 \\
Clean after one turn & 7 \\
Clean after max turns & 18 \\
Repair generation errors & 0 \\
\bottomrule
\end{tabular}
\caption{Diagnostic-compliance repair-loop pilot over the expanded grid. Each attempted repair feeds laint diagnostics back to the same model for up to three turns.}
\label{tab:repair-summary}
\end{table}

\begin{table}[ht]
\centering
\scriptsize
\begin{tabular}{lrrrrrr}
\toprule
Model & Initial & Final & Net red. & New & Clean final & Avg. turns \\
\midrule
Opus 4.6 & 123 & 11 & 91.1\% & 2 & 2/6 & 1.5 \\
GPT-5.5 & 78 & 8 & 89.7\% & 8 & 4/6 & 1.3 \\
Sonnet 4.6 & 127 & 63 & 50.4\% & 55 & 2/6 & 3.0 \\
GPT-5.4 & 59 & 6 & 89.8\% & 0 & 4/6 & 2.0 \\
G-3.1-Pro & 47 & 5 & 89.4\% & 2 & 4/6 & 1.5 \\
G-2.5-Flash & 42 & 8 & 81.0\% & 3 & 2/6 & 2.5 \\
\bottomrule
\end{tabular}
\caption{Diagnostic-compliance outcomes by model, excluding baseline generation failures. Net red. is net reported-finding reduction; New counts rule-level findings introduced during repair. Average turns is computed over runs that reached zero findings and no parse error.}
\label{tab:repair-by-model}
\end{table}

\begin{table}[ht]
\centering
\scriptsize
\begin{tabular}{llrrrr}
\toprule
Prompt & Platform & Initial & Final & Net red. & New \\
\midrule
\texttt{taskflow-web} & web & 125 & 4 & 96.8\% & 2 \\
\texttt{chat-web} & web & 92 & 5 & 94.6\% & 0 \\
\texttt{insurance-reports-backend} & backend & 84 & 5 & 94.0\% & 1 \\
\texttt{beauty-shop-mobile} & expo & 61 & 13 & 78.7\% & 6 \\
\texttt{wallet-api-backend} & backend & 37 & 4 & 89.2\% & 2 \\
\texttt{event-planner-mobile} & expo & 77 & 70 & 9.1\% & 59 \\
\bottomrule
\end{tabular}
\caption{Diagnostic-compliance outcomes by prompt and platform. New counts rule-level findings introduced during repair.}
\label{tab:repair-by-prompt}
\end{table}
Loading
Loading