test: add PDF version and type extraction test fixtures by maximilien · Pull Request #66 · maximilien/weave-cli

maximilien · 2026-03-29T01:12:10Z

Summary

Addresses #8 — adds test fixtures and enables previously-skipped PDF extraction tests.

Changes

PDF version fixtures (1.3, 1.4, 1.7, 2.0): Generated via fpdf2 with version-patched headers to test extraction across PDF spec versions
PDF type fixtures: text-only, mixed (text+images), photo-heavy (8 images), scanned (image-only, no text layer)
Enabled tests: TestPDFVersionCompatibility, TestPDFTypeExtraction, TestPDFWithCMYKImages
Generator script: tests/fixtures/generate_test_pdfs.py for reproducible fixture regeneration

Test Results

All 3 test suites pass (7.2s total):

TestPDFVersionCompatibility: 5/5 subtests pass across PDF 1.3-2.0
TestPDFTypeExtraction: 5/5 subtests pass
TestPDFWithCMYKImages: passes with photo-heavy fixture

Notes

Image extraction from fpdf2-generated PDFs returns 0 images (different embedding than production tools). Tests use warning logs, not hard failures. Real-world PDF correctly extracts images.
Scanned PDF correctly extracts 2 images and generates fallback text.

Closes #8

Summary by CodeRabbit

Release Notes

Tests
- Added PDF test fixture generation supporting multiple document formats (text-only, mixed content, photo-heavy, and scanned documents)
- Added PDF version compatibility test fixtures for comprehensive version testing
- Enabled previously skipped PDF integration test cases

- Generate test PDFs for versions 1.3, 1.4, 1.7, and 2.0 - Add text-only, mixed, photo-heavy, and scanned PDF type fixtures - Enable previously-skipped TestPDFVersionCompatibility tests - Enable previously-skipped TestPDFTypeExtraction tests - Update TestPDFWithCMYKImages to use photo-heavy fixture - Include generate_test_pdfs.py for reproducible fixture generation All tests pass: version extraction works across PDF 1.3-2.0, text-only and scanned PDFs extract correctly, mixed content handles both text and image extraction. Closes #8

coderabbitai · 2026-03-29T01:12:53Z

📝 Walkthrough

Walkthrough

A new test fixture generator script creates multiple PDF files for testing PDF extraction across different versions and content types. Corresponding integration tests are updated to enable previously skipped test cases and run assertions against the newly generated fixtures.

Changes

Cohort / File(s)	Summary
PDF Fixture Generator `tests/fixtures/generate_test_pdfs.py`	New script with utilities to create versioned PDFs (by patching headers), text-only PDFs, mixed text+image PDFs, photo-heavy PDFs with multiple pages, and simulated scanned PDFs with large raster images. Includes helper functions for directory creation, standard content addition, and test image generation.
PDF Integration Tests `tests/pdf_integration_test.go`	Enabled previously skipped test cases in `TestPDFVersionCompatibility` and `TestPDFTypeExtraction` by removing skip reasons. Updated `TestPDFWithCMYKImages` to execute PDF extraction against `photo_heavy.pdf` fixture and assert extracted content, rather than unconditionally skipping.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 wiggles nose at PDFs
With versioned pages and images so bright,
Our fixtures now hop through the tests every night!
No more skips, just extracts with glee,
A PDF test suite, as good as can be! 📄✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and concisely describes the main change: adding PDF test fixtures for version and type extraction testing, which is the core purpose of the PR.
Linked Issues check	✅ Passed	The PR fully satisfies issue `#8` requirements: generates PDFs for versions 1.3, 1.4, 1.7, 2.0; creates fixtures for text-only, mixed, photo-heavy, and scanned content types; and enables previously-skipped tests to validate extraction behavior across these variants.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue `#8`: a PDF fixture generator script and updates to enable related tests. No unrelated code modifications are present.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch 8-pdf-version-extraction-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/pdf_integration_test.go (2)
276-349: ⚠️ Potential issue | 🟠 Major

Use minVersion to verify the fixture header before extraction.

Right now this matrix only proves that each file is readable. Because minVersion is never asserted, the test would still pass if all four “versioned” fixtures were accidentally generated with the same PDF header. Add a cheap %PDF-x.y prefix check so this actually guards the version-fixture contract.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/pdf_integration_test.go` around lines 276 - 349, Before calling
pdf.ExtractPDFContent, open the fixture file and read its header bytes and
assert it contains a PDF version prefix matching tc.minVersion; specifically,
for each test case (testCases / tc) read the first ~8 bytes of tc.pdfFile,
verify the header matches the pattern "%PDF-x.y" and that the x.y string equals
tc.minVersion (or fail the test with t.Fatalf/t.Errorf), then proceed to call
pdf.ExtractPDFContent; reference tc.minVersion, tc.pdfFile and
pdf.ExtractPDFContent to locate where to insert this cheap header check.
423-447: ⚠️ Potential issue | 🟠 Major

This still won’t catch image-extraction regressions.

These cases enable skipSmallImages=true with a 10240 byte cutoff, then downgrade missing images to a log line. For the synthetic fixtures, that means CI can stay green even when image extraction returns 0, which misses the regression signal this table is supposed to provide. Either disable the small-image filter for this matrix or encode best-effort cases explicitly, but expectedImages cases should fail when they miss their minimums.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/pdf_integration_test.go` around lines 423 - 447, The test currently
downgrades missing images to a log line which hides regressions; update the
image-verification so that when tc.expectedImages is true and len(imageData) <
tc.minImages the test fails (use t.Errorf or t.Fatalf) instead of t.Logf, and
for matrix entries that must reliably contain images either call
pdf.ExtractPDFContent with the small-image filter disabled (e.g. set the
image-size cutoff to 0 or the skipSmallImages flag to false) or add an explicit
best-effort flag on the test case to allow non-fatal expectations; change the
checks around imageData, tc.expectedImages and tc.minImages and adjust the
ExtractPDFContent call parameters accordingly to enforce failure on missing
expected images.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/fixtures/generate_test_pdfs.py`:
- Around line 9-15: BASE is hard-coded to /tmp and helper images are written to
fixed /tmp paths; change this to derive the fixture root from the script's
location (use the script file directory via __file__ /
pathlib.Path(__file__).resolve().parent) and update make_dirs/BASE references
accordingly, and replace fixed helper PNG paths with per-run temporary files
(use tempfile.NamedTemporaryFile or tempfile.mkstemp) so helper images are
unique per invocation; update any functions/variables that reference BASE and
the helper image names (e.g., BASE, make_dirs, and the code sections that
create/write the helper PNGs) to use the derived fixture root and temp file
paths.

In `@tests/pdf_integration_test.go`:
- Around line 463-475: The test is not validating CMYK handling because
photo_heavy.pdf is RGB; update the test around ExtractPDFContent to either (A)
skip this CMYK-specific assertion until a real CMYK fixture exists by calling
t.Skip with a short reason, or (B) add a true CMYK fixture (e.g.,
tests/fixtures/pdf_types/photo_heavy_cmyk.pdf produced from
generate_test_pdfs.py or a checked-in sample) and change fixturePath to that
CMYK file and assert on image extraction (check len(imageData) > 0 and no error)
to ensure CMYK decoding is exercised; reference the ExtractPDFContent call and
fixturePath/photo_heavy.pdf when making the change.

---

Outside diff comments:
In `@tests/pdf_integration_test.go`:
- Around line 276-349: Before calling pdf.ExtractPDFContent, open the fixture
file and read its header bytes and assert it contains a PDF version prefix
matching tc.minVersion; specifically, for each test case (testCases / tc) read
the first ~8 bytes of tc.pdfFile, verify the header matches the pattern
"%PDF-x.y" and that the x.y string equals tc.minVersion (or fail the test with
t.Fatalf/t.Errorf), then proceed to call pdf.ExtractPDFContent; reference
tc.minVersion, tc.pdfFile and pdf.ExtractPDFContent to locate where to insert
this cheap header check.
- Around line 423-447: The test currently downgrades missing images to a log
line which hides regressions; update the image-verification so that when
tc.expectedImages is true and len(imageData) < tc.minImages the test fails (use
t.Errorf or t.Fatalf) instead of t.Logf, and for matrix entries that must
reliably contain images either call pdf.ExtractPDFContent with the small-image
filter disabled (e.g. set the image-size cutoff to 0 or the skipSmallImages flag
to false) or add an explicit best-effort flag on the test case to allow
non-fatal expectations; change the checks around imageData, tc.expectedImages
and tc.minImages and adjust the ExtractPDFContent call parameters accordingly to
enforce failure on missing expected images.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 578ae30b-0ff6-409c-b2fb-be11edce27dc

📥 Commits

Reviewing files that changed from the base of the PR and between 067b8fe and 5aa77f5.

⛔ Files ignored due to path filters (8)

tests/fixtures/pdf_types/mixed.pdf is excluded by !**/*.pdf
tests/fixtures/pdf_types/photo_heavy.pdf is excluded by !**/*.pdf
tests/fixtures/pdf_types/scanned.pdf is excluded by !**/*.pdf
tests/fixtures/pdf_types/text_only.pdf is excluded by !**/*.pdf
tests/fixtures/pdf_versions/pdf_1.3.pdf is excluded by !**/*.pdf
tests/fixtures/pdf_versions/pdf_1.4.pdf is excluded by !**/*.pdf
tests/fixtures/pdf_versions/pdf_1.7.pdf is excluded by !**/*.pdf
tests/fixtures/pdf_versions/pdf_2.0.pdf is excluded by !**/*.pdf

📒 Files selected for processing (2)

tests/fixtures/generate_test_pdfs.py
tests/pdf_integration_test.go

coderabbitai · 2026-03-29T01:19:10Z

tests/fixtures/generate_test_pdfs.py

+BASE = "/tmp/weave-cli/tests/fixtures"
+MC = dict(new_x="LMARGIN", new_y="NEXT")  # multi_cell kwargs
+
+
+def make_dirs():
+    os.makedirs(f"{BASE}/pdf_versions", exist_ok=True)
+    os.makedirs(f"{BASE}/pdf_types", exist_ok=True)


⚠️ Potential issue | 🟠 Major

Avoid hard-coded /tmp paths for both fixture output and scratch assets.

BASE points at one specific checkout location, and the helper images are also written to fixed /tmp/... names. Re-running this from any other clone writes outside the repo, and parallel/local reruns can trample each other’s intermediates. Since this script exists to reproduce the checked-in fixtures, derive the fixture root from the script location and allocate per-run temp files for the PNGs.

Also applies to: 115-116, 173-174, 203-204

🧰 Tools

🪛 Ruff (0.15.7)

[error] 9-9: Probable insecure usage of temporary file or directory: "/tmp/weave-cli/tests/fixtures"

(S108)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/fixtures/generate_test_pdfs.py` around lines 9 - 15, BASE is hard-coded to /tmp and helper images are written to fixed /tmp paths; change this to derive the fixture root from the script's location (use the script file directory via __file__ / pathlib.Path(__file__).resolve().parent) and update make_dirs/BASE references accordingly, and replace fixed helper PNG paths with per-run temporary files (use tempfile.NamedTemporaryFile or tempfile.mkstemp) so helper images are unique per invocation; update any functions/variables that reference BASE and the helper image names (e.g., BASE, make_dirs, and the code sections that create/write the helper PNGs) to use the derived fixture root and temp file paths.

coderabbitai · 2026-03-29T01:19:10Z

tests/pdf_integration_test.go

+	// TODO: Add dedicated CMYK test PDF when available.
+	// For now, verify the photo_heavy fixture extracts without CMYK-related errors.
+	fixturePath := filepath.Join("fixtures", "pdf_types", "photo_heavy.pdf")
+	textData, imageData, err := pdf.ExtractPDFContent(fixturePath, 1000, true, 10240, 2000, false)
+	if err != nil {
+		t.Fatalf("ExtractPDFContent() error = %v", err)
+	}

-	// TODO: Add test with CMYK PDF when available
-	// Should verify:
-	// - CMYK images are extracted
-	// - Color space is preserved or converted correctly
-	// - No corruption in image data
+	t.Logf("Photo-heavy PDF: %d text chunks, %d images (CMYK placeholder test)", len(textData), len(imageData))
+
+	if len(textData) == 0 {
+		t.Error("Expected at least some text content from photo-heavy PDF")
+	}


⚠️ Potential issue | 🟠 Major

This is still not exercising CMYK handling.

photo_heavy.pdf is generated from RGB images in tests/fixtures/generate_test_pdfs.py, and this test only asserts that some text was extracted. A CMYK-specific decode regression can slip through here without failing CI. Either keep this skipped until a real CMYK fixture exists, or add a CMYK fixture and assert on image extraction for that path.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/pdf_integration_test.go` around lines 463 - 475, The test is not validating CMYK handling because photo_heavy.pdf is RGB; update the test around ExtractPDFContent to either (A) skip this CMYK-specific assertion until a real CMYK fixture exists by calling t.Skip with a short reason, or (B) add a true CMYK fixture (e.g., tests/fixtures/pdf_types/photo_heavy_cmyk.pdf produced from generate_test_pdfs.py or a checked-in sample) and change fixturePath to that CMYK file and assert on image extraction (check len(imageData) > 0 and no error) to ensure CMYK decoding is exercised; reference the ExtractPDFContent call and fixturePath/photo_heavy.pdf when making the change.

maximilien · 2026-03-29T01:48:17Z

tests should be in Golang if possible

maximilien added enhancement New feature or request test Testing related issues labels Mar 29, 2026

maximilien mentioned this pull request Mar 29, 2026

test: extract various PDFs from different years (versions) #8

Open

coderabbitai bot reviewed Mar 29, 2026

View reviewed changes

maximilien closed this Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add PDF version and type extraction test fixtures#66

test: add PDF version and type extraction test fixtures#66
maximilien wants to merge 1 commit intomainfrom
8-pdf-version-extraction-tests

maximilien commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 29, 2026

Uh oh!

coderabbitai bot Mar 29, 2026

Uh oh!

maximilien commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maximilien commented Mar 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Results

Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

maximilien commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maximilien commented Mar 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 29, 2026 •

edited

Loading