Skip to content

Commit a526fbd

Browse files
coadaflorinCopilotjc-clark
authored
Add incremental analysis documentation for the CodeQL CLI (#61237)
Co-authored-by: coadaflorin <coadaflorin@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Joe Clark <31087804+jc-clark@users.noreply.github.com>
1 parent abd1363 commit a526fbd

2 files changed

Lines changed: 342 additions & 0 deletions

File tree

Lines changed: 341 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,341 @@
1+
---
2+
title: Using incremental analysis with the CodeQL CLI
3+
shortTitle: Speed up PR scans
4+
intro: 'Get faster {% data variables.product.prodname_codeql %} results on pull requests by analyzing only what changed. Incremental analysis can reduce scan times by up to 10x when you run the {% data variables.product.prodname_codeql_cli %} in your own CI/CD system.'
5+
allowTitleToDifferFromFilename: true
6+
product: '{% data reusables.gated-features.codeql %}'
7+
versions:
8+
fpt: '*'
9+
ghes: '*'
10+
ghec: '*'
11+
contentType: how-tos
12+
category:
13+
- Customize vulnerability detection with CodeQL
14+
---
15+
16+
## About incremental analysis
17+
18+
Full {% data variables.product.prodname_codeql %} scans on every pull request can be slow, especially in large codebases. If you run the {% data variables.product.prodname_codeql_cli %} in your own CI/CD system, incremental analysis gives you two ways to speed things up:
19+
20+
* **Diff-informed analysis** reports only alerts in lines you added or changed, so queries run faster and results are more relevant.
21+
* **Overlay analysis** reuses a cached database from your default branch instead of building one from scratch, cutting database creation and query evaluation time dramatically.
22+
23+
You can use these features independently or together. For most teams analyzing pull requests in established codebases, we recommend using both: overlay analysis for fast database creation and query evaluation, and diff-informed analysis for focused, relevant results.
24+
25+
If you use {% data variables.product.prodname_code_scanning %} default setup or the `codeql-action` on {% data variables.product.prodname_dotcom %}, incremental analysis is already handled automatically. This article is for teams running the {% data variables.product.prodname_codeql_cli %} directly in their own CI/CD infrastructure.
26+
27+
## Prerequisites
28+
29+
Before setting up incremental analysis, make sure you meet the following requirements:
30+
31+
* **{% data variables.product.prodname_codeql_cli %} bundle version:** 2.21.0 or later for diff-informed analysis; 2.23.8 or later for overlay analysis (with per-language minimums, see [Minimum CLI bundle versions](#minimum-cli-bundle-versions))
32+
* **Source root** must be inside a Git repository
33+
* **Git version** 2.38.0 or later (required for overlay analysis, specifically the `--format` option used by `git ls-files`)
34+
* **All files of interest** must be tracked by Git (not in `.gitignore`)
35+
* **Git index** must accurately reflect the source tree being analyzed
36+
* **Build mode:** Overlay analysis supports only `build-mode: none` (traced builds are not supported). Go works with overlay analysis despite not explicitly supporting this mode.
37+
38+
## Choosing an approach
39+
40+
| Scenario | Diff-informed | Overlay |
41+
|---|---|---|
42+
| Default branch push | No (not a PR) | overlay-base mode |
43+
| PR analysis (first time, no cache) | Yes | No (run full analysis) |
44+
| PR analysis (with cached base) | Yes | overlay mode |
45+
| Non-PR, non-default branch | No | No |
46+
47+
For complete working examples in various CI systems, see the [sample CodeQL pipeline configurations](https://github.com/advanced-security/sample-codeql-pipeline-config) repository.
48+
49+
## Diff-informed analysis
50+
51+
Diff-informed analysis is an optimization for pull request analysis. Instead of reporting all alerts found in the codebase, it reports only alerts in lines that were added or modified in the pull request diff.
52+
53+
### Step 1: Identify the diff ranges
54+
55+
You need the added or modified line ranges from the pull request diff. The input can come from any source (`git diff`, your CI platform's API, or another mechanism).
56+
57+
For each changed file, produce a list of ranges with the following structure:
58+
59+
* `path`: Absolute file path (always use forward slashes)
60+
* `startLine`: 1-based, inclusive start line
61+
* `endLine`: 1-based, inclusive end line
62+
63+
For example, given this unified diff (generated by `git diff`):
64+
65+
```text
66+
--- a/src/utils.ts
67+
+++ b/src/utils.ts
68+
@@ -2,7 +2,6 @@ import { helper } from './helper';
69+
70+
function existing() {
71+
const x = 1;
72+
- const unused = 2;
73+
return x;
74+
}
75+
76+
@@ -14,6 +13,8 @@ function validate(input: string) {
77+
function process(input: string) {
78+
// validate
79+
if (!input) return;
80+
+ const sanitized = input.trim();
81+
+ console.log(sanitized);
82+
return input;
83+
}
84+
85+
@@ -23,5 +24,5 @@ function format(value: number) {
86+
87+
function render(data: object) {
88+
const output = JSON.stringify(data);
89+
- return output;
90+
+ return `<div>${output}</div>`;
91+
}
92+
```
93+
94+
The resulting diff ranges for `src/utils.ts` would be:
95+
96+
* `["/path/to/repo/src/utils.ts", 16, 17]` (the two inserted lines in the second hunk)
97+
* `["/path/to/repo/src/utils.ts", 27, 27]` (the modified line in the third hunk)
98+
99+
The first hunk contains only a deletion, so it produces no range. Note that ranges use the "to" (new file) line numbers, not the "from" (old file) numbers.
100+
101+
**Special cases:**
102+
103+
* **Binary files or very large diffs** (no patch content available): Use the sentinel range `{path, startLine: 0, endLine: 0}` to indicate "entire file."
104+
* **Renamed files with no content changes**: Return an empty array (no ranges).
105+
* **Truncated diffs**: If your diff source is incomplete for large pull requests (for example, an API that limits the number of changed files), you should skip diff-informed analysis and run full analysis for that run.
106+
107+
For a reference implementation of diff parsing, see [`getDiffRanges()`](https://github.com/github/codeql-action/blob/v4.36.0/src/diff-informed-analysis-utils.ts) in the `codeql-action` source code.
108+
109+
### Step 2: Create a data extension pack
110+
111+
Create a temporary directory containing two files. This extension pack feeds into the `restrictAlertsTo` extensible predicate defined in the {% data variables.product.prodname_codeql %} standard library.
112+
113+
**`qlpack.yml`:**
114+
115+
```yaml
116+
name: my-ci/pr-diff-range
117+
version: 0.0.0
118+
library: true
119+
extensionTargets:
120+
codeql/util: '*' # Target the codeql/util pack where restrictAlertsTo is defined
121+
dataExtensions:
122+
- pr-diff-range.yml
123+
```
124+
125+
**`pr-diff-range.yml`:**
126+
127+
```yaml
128+
extensions:
129+
- addsTo:
130+
pack: codeql/util
131+
extensible: restrictAlertsTo
132+
checkPresence: false # Don't error if the predicate doesn't exist in older CLI versions
133+
data:
134+
# Each row: [filePath, startLine, endLine]
135+
- ["/path/to/repo/src/utils.ts", 16, 17]
136+
- ["/path/to/repo/src/utils.ts", 27, 27]
137+
```
138+
139+
Each data row is `[filePath, lineStart, lineEnd]`. Line numbers are 1-based. The special case `lineStart = 0, lineEnd = 0` denotes a whole-file match.
140+
141+
> [!IMPORTANT]
142+
> If the diff has zero added or modified lines (for example, only deletions), you must still provide a non-empty data extension with a sentinel entry `["", 0, 0]`. An empty `data` section would leave the `restrictAlertsTo` predicate inactive, which means all alerts would be produced—the opposite of the desired behavior.
143+
144+
### Step 3: Pass the extension pack to the {% data variables.product.prodname_codeql_cli %}
145+
146+
When running queries, add the following flags to `codeql database run-queries`:
147+
148+
```shell
149+
codeql database run-queries \
150+
--additional-packs=PATH_TO_EXTENSION_PACK \
151+
--extension-packs=my-ci/pr-diff-range \
152+
PATH_TO_DATABASE \
153+
QUERIES
154+
```
155+
156+
* `--additional-packs` tells {% data variables.product.prodname_codeql %} where to find the pack on disk. For more information, see [AUTOTITLE](/code-security/reference/code-scanning/codeql/codeql-cli-manual/database-run-queries).
157+
* `--extension-packs` tells {% data variables.product.prodname_codeql %} to load the named extension pack.
158+
159+
### Step 4: Exclude diagnostic queries
160+
161+
When using diff-informed analysis, you should exclude queries tagged with `exclude-from-incremental`. These diagnostic queries do not produce alerts (for example, metrics or code coverage), so they provide no value in an incremental context but still consume resources.
162+
163+
You can add this to your code scanning configuration file:
164+
165+
```yaml
166+
query-filters:
167+
- exclude:
168+
tags: exclude-from-incremental
169+
```
170+
171+
Alternatively, create a query suite file (`.qls`) that excludes those queries:
172+
173+
```yaml
174+
- description: Pull request queries for Java
175+
- import: codeql-suites/java-code-scanning.qls
176+
from: codeql/java-queries
177+
- exclude:
178+
tags contain: exclude-from-incremental
179+
```
180+
181+
For more information, see [AUTOTITLE](/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning#filtering-the-queries-in-a-codeql-pack).
182+
183+
### Step 5: Filter the SARIF output
184+
185+
After {% data variables.product.prodname_codeql %} generates the SARIF file, you must filter the output on the CI side to remove results whose locations fall outside the diff ranges.
186+
187+
For each result in the SARIF, check whether any of its `locations` or `relatedLocations` intersect with a diff range for that file. A location intersects a range when `range.startLine <= location.endLine` and `location.startLine <= range.endLine`. The special case `range.startLine == range.endLine == 0` matches any location in the file. Make sure SARIF artifact locations are resolved to the same absolute path format used in the diff ranges before comparing.
188+
189+
The `restrictAlertsTo` predicate permits but does not guarantee that queries omit out-of-range alerts, so CI-side filtering is required for stable results.
190+
191+
For a reference implementation of SARIF filtering, see [`filterAlertsByDiffRange()`](https://github.com/github/codeql-action/blob/v4.36.0/src/upload-lib.ts) in the `codeql-action` source code.
192+
193+
### Summary of CLI flags for diff-informed analysis
194+
195+
| CLI command | Flag | Purpose |
196+
|---|---|---|
197+
| `codeql database init` | `--codescanning-config=FILE` | Code scanning configuration file (for query filter) |
198+
| `codeql database run-queries` | `--additional-packs=DIR` | Location of the extension pack |
199+
| `codeql database run-queries` | `--extension-packs=my-ci/pr-diff-range` | Name of the extension pack to load |
200+
| `codeql database interpret-results` | `--sarif-run-property=incrementalMode=diff-informed` | (Optional) Tag SARIF with diff-informed metadata |
201+
202+
## Overlay analysis
203+
204+
Overlay analysis speeds up {% data variables.product.prodname_codeql %} database creation and query evaluation for pull requests by building on top of a pre-existing "base" database:
205+
206+
1. **On the default branch:** Build an "overlay-base" database (a full database with cached intermediate results). This can be any long-lived branch that pull requests target.
207+
1. **On pull requests:** Download the cached overlay-base database, then create a lightweight "overlay" database that only processes the changed files.
208+
209+
### Overlay-base mode (default branch)
210+
211+
Run overlay-base mode on your default or long-lived target branch after each merge to create and cache a base database.
212+
213+
#### 1. Initialize the database with `--overlay-base`
214+
215+
```shell
216+
codeql database init \
217+
--overlay-base \
218+
--db-cluster \
219+
PATH_TO_DATABASE \
220+
--source-root=PATH_TO_SOURCE \
221+
--language=LANGUAGE
222+
```
223+
224+
The `--overlay-base` flag tells {% data variables.product.prodname_codeql %} to build a database that can serve as a base for future overlay analysis.
225+
226+
#### 2. Build and extract as normal
227+
228+
Run any build steps and extraction as you normally would for your project.
229+
230+
#### 3. Record file OIDs
231+
232+
After extraction completes, record the Git object IDs (OIDs) of all tracked files under the source root. Run this command from your source root directory (`PATH_TO_SOURCE`). This snapshot is used later to determine which files changed.
233+
234+
```shell
235+
cd PATH_TO_SOURCE && git ls-files --recurse-submodules --format='%(objectname)_%(path)'
236+
```
237+
238+
Parse this output into a JSON map of `{ "relative/path": "git-oid" }` and store it alongside the database. The output includes files in Git submodules, which overlay analysis needs to accurately track all file changes between the base and the overlay.
239+
240+
#### 4. Run queries and preserve the cache
241+
242+
When running queries on an overlay-base database, do **not** pass `--expect-discarded-cache`. The cached intermediate results are what makes pull request builds fast. Discarding them would force full re-evaluation on every PR.
243+
244+
#### 5. Clean up and cache the database
245+
246+
After analysis, clean up the database using the `overlay` cleanup level:
247+
248+
```shell
249+
codeql database cleanup PATH_TO_DATABASE --cache-cleanup=overlay
250+
```
251+
252+
The `overlay` cleanup level preserves more cached data than the default `clear` level. Overlay mode reuses this cached data for efficient query evaluation on pull requests, so discarding it would eliminate the performance benefit.
253+
254+
Then store the database (including the OIDs file) in your caching system for later retrieval by pull request builds.
255+
256+
### Overlay mode (pull requests)
257+
258+
Run overlay mode on pull request builds to create a lightweight database on top of the cached base. If no compatible overlay-base database is available in the cache (for example, on the first run or after a {% data variables.product.prodname_codeql_cli %} version upgrade), skip `--overlay-changes` and run a normal full analysis instead. Cache keys should include at least the {% data variables.product.prodname_codeql_cli %} version and language set to avoid incompatible base databases.
259+
260+
#### 1. Download the cached overlay-base database
261+
262+
Retrieve the most recent overlay-base database from your cache. The database should include the OIDs file recorded during overlay-base mode.
263+
264+
#### 2. Compute changed files
265+
266+
Compare the OIDs recorded in the base database with the current Git state. Run this command from the same source root directory (`PATH_TO_SOURCE`) used during overlay-base mode:
267+
268+
```shell
269+
cd PATH_TO_SOURCE && git ls-files --recurse-submodules --format='%(objectname)_%(path)'
270+
```
271+
272+
Compare the two maps to find files that were added, removed, or modified (different OID). Write the result as a JSON file:
273+
274+
```json
275+
{
276+
"changes": ["src/modified-file.ts", "src/new-file.ts", "src/deleted-file.ts"]
277+
}
278+
```
279+
280+
The file paths must be relative to the source root.
281+
282+
#### 3. Initialize the database with `--overlay-changes`
283+
284+
Run `codeql database init` against the restored overlay-base database directory. The `PATH_TO_DATABASE` must point to the restored cached overlay-base database, not a new empty directory—the command extends the existing base for the pull request analysis.
285+
286+
```shell
287+
codeql database init \
288+
--overlay-changes=PATH_TO_OVERLAY_CHANGES_JSON \
289+
--db-cluster \
290+
PATH_TO_DATABASE \
291+
--source-root=PATH_TO_SOURCE \
292+
--language=LANGUAGE
293+
```
294+
295+
> [!IMPORTANT]
296+
> In overlay mode, do not pass `--overwrite` or `--force-overwrite`. You are building on top of the existing cached base database, not replacing it.
297+
298+
#### 4. Build, extract, and run queries as normal
299+
300+
Proceed with build, extraction, and query execution as normal. You can add the `--sarif-run-property` flag to your existing `codeql database interpret-results` command to tag the SARIF output with overlay metadata:
301+
302+
```shell
303+
codeql database interpret-results \
304+
--format=sarif-latest \
305+
--output=results.sarif \
306+
--sarif-run-property=incrementalMode=overlay \
307+
PATH_TO_DATABASE \
308+
QUERIES_OR_SUITES
309+
```
310+
311+
If both overlay and diff-informed analysis are active, use `incrementalMode=overlay,diff-informed`.
312+
313+
Alerts from incremental analysis appear in the pull request's code scanning results the same way as alerts from full scans. Any overlay-base database will work regardless of age, but fresher bases produce faster and more accurate results.
314+
315+
As with diff-informed analysis, exclude queries tagged `exclude-from-incremental` when using overlay mode. For details, see [Step 4: Exclude diagnostic queries](#step-4-exclude-diagnostic-queries).
316+
317+
### Summary of CLI flags for overlay analysis
318+
319+
| CLI command | Flag | Mode | Purpose |
320+
|---|---|---|---|
321+
| `codeql database init` | `--codescanning-config=FILE` | overlay | Code scanning configuration file (for query filter) |
322+
| `codeql database init` | `--overlay-base` | overlay-base | Build a base database for future overlay use |
323+
| `codeql database init` | `--overlay-changes=FILE` | overlay | Build overlay database using only changed files |
324+
| `codeql database init` | _(no `--overwrite`)_ | overlay | Don't overwrite the cached base database |
325+
| `codeql database run-queries` | _(no `--expect-discarded-cache`)_ | overlay-base | Preserve cached intermediate results |
326+
| `codeql database cleanup` | `--cache-cleanup=overlay` | overlay-base | Use overlay-specific cleanup level |
327+
| `codeql database interpret-results` | `--sarif-run-property=incrementalMode=overlay` | overlay | Tag SARIF with overlay metadata |
328+
329+
### Minimum CLI bundle versions
330+
331+
The base minimum version for overlay analysis is 2.23.8. Some languages require higher minimum versions:
332+
333+
| Language | Minimum {% data variables.product.prodname_codeql_cli %} bundle version |
334+
|---|---|
335+
| C/C++ | 2.25.0 |
336+
| C# | 2.24.1 |
337+
| Go | 2.24.2 |
338+
| Java | 2.23.8 |
339+
| JavaScript | 2.23.9 |
340+
| Python | 2.23.9 |
341+
| Ruby | 2.23.9 |

content/code-security/how-tos/find-and-fix-code-vulnerabilities/scan-from-the-command-line/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ children:
1414
- /testing-query-help-files
1515
- /download-databases
1616
- /check-out-source-code
17+
- /incremental-analysis
1718
- /specifying-command-options-in-a-codeql-configuration-file
1819
- /creating-database-bundle-for-troubleshooting
1920
redirect_from:

0 commit comments

Comments
 (0)