Skip to content

test: add PDF version and type extraction test fixtures#66

Closed
maximilien wants to merge 1 commit intomainfrom
8-pdf-version-extraction-tests
Closed

test: add PDF version and type extraction test fixtures#66
maximilien wants to merge 1 commit intomainfrom
8-pdf-version-extraction-tests

Conversation

@maximilien
Copy link
Copy Markdown
Owner

@maximilien maximilien commented Mar 29, 2026

Summary

Addresses #8 — adds test fixtures and enables previously-skipped PDF extraction tests.

Changes

  • PDF version fixtures (1.3, 1.4, 1.7, 2.0): Generated via fpdf2 with version-patched headers to test extraction across PDF spec versions
  • PDF type fixtures: text-only, mixed (text+images), photo-heavy (8 images), scanned (image-only, no text layer)
  • Enabled tests: TestPDFVersionCompatibility, TestPDFTypeExtraction, TestPDFWithCMYKImages
  • Generator script: tests/fixtures/generate_test_pdfs.py for reproducible fixture regeneration

Test Results

All 3 test suites pass (7.2s total):

  • TestPDFVersionCompatibility: 5/5 subtests pass across PDF 1.3-2.0
  • TestPDFTypeExtraction: 5/5 subtests pass
  • TestPDFWithCMYKImages: passes with photo-heavy fixture

Notes

  • Image extraction from fpdf2-generated PDFs returns 0 images (different embedding than production tools). Tests use warning logs, not hard failures. Real-world PDF correctly extracts images.
  • Scanned PDF correctly extracts 2 images and generates fallback text.

Closes #8

Summary by CodeRabbit

Release Notes

  • Tests
    • Added PDF test fixture generation supporting multiple document formats (text-only, mixed content, photo-heavy, and scanned documents)
    • Added PDF version compatibility test fixtures for comprehensive version testing
    • Enabled previously skipped PDF integration test cases

- Generate test PDFs for versions 1.3, 1.4, 1.7, and 2.0
- Add text-only, mixed, photo-heavy, and scanned PDF type fixtures
- Enable previously-skipped TestPDFVersionCompatibility tests
- Enable previously-skipped TestPDFTypeExtraction tests
- Update TestPDFWithCMYKImages to use photo-heavy fixture
- Include generate_test_pdfs.py for reproducible fixture generation

All tests pass: version extraction works across PDF 1.3-2.0,
text-only and scanned PDFs extract correctly, mixed content
handles both text and image extraction.

Closes #8
@maximilien maximilien added enhancement New feature or request test Testing related issues labels Mar 29, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 29, 2026

📝 Walkthrough

Walkthrough

A new test fixture generator script creates multiple PDF files for testing PDF extraction across different versions and content types. Corresponding integration tests are updated to enable previously skipped test cases and run assertions against the newly generated fixtures.

Changes

Cohort / File(s) Summary
PDF Fixture Generator
tests/fixtures/generate_test_pdfs.py
New script with utilities to create versioned PDFs (by patching headers), text-only PDFs, mixed text+image PDFs, photo-heavy PDFs with multiple pages, and simulated scanned PDFs with large raster images. Includes helper functions for directory creation, standard content addition, and test image generation.
PDF Integration Tests
tests/pdf_integration_test.go
Enabled previously skipped test cases in TestPDFVersionCompatibility and TestPDFTypeExtraction by removing skip reasons. Updated TestPDFWithCMYKImages to execute PDF extraction against photo_heavy.pdf fixture and assert extracted content, rather than unconditionally skipping.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 wiggles nose at PDFs
With versioned pages and images so bright,
Our fixtures now hop through the tests every night!
No more skips, just extracts with glee,
A PDF test suite, as good as can be! 📄✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely describes the main change: adding PDF test fixtures for version and type extraction testing, which is the core purpose of the PR.
Linked Issues check ✅ Passed The PR fully satisfies issue #8 requirements: generates PDFs for versions 1.3, 1.4, 1.7, 2.0; creates fixtures for text-only, mixed, photo-heavy, and scanned content types; and enables previously-skipped tests to validate extraction behavior across these variants.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #8: a PDF fixture generator script and updates to enable related tests. No unrelated code modifications are present.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 8-pdf-version-extraction-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/pdf_integration_test.go (2)

276-349: ⚠️ Potential issue | 🟠 Major

Use minVersion to verify the fixture header before extraction.

Right now this matrix only proves that each file is readable. Because minVersion is never asserted, the test would still pass if all four “versioned” fixtures were accidentally generated with the same PDF header. Add a cheap %PDF-x.y prefix check so this actually guards the version-fixture contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/pdf_integration_test.go` around lines 276 - 349, Before calling
pdf.ExtractPDFContent, open the fixture file and read its header bytes and
assert it contains a PDF version prefix matching tc.minVersion; specifically,
for each test case (testCases / tc) read the first ~8 bytes of tc.pdfFile,
verify the header matches the pattern "%PDF-x.y" and that the x.y string equals
tc.minVersion (or fail the test with t.Fatalf/t.Errorf), then proceed to call
pdf.ExtractPDFContent; reference tc.minVersion, tc.pdfFile and
pdf.ExtractPDFContent to locate where to insert this cheap header check.

423-447: ⚠️ Potential issue | 🟠 Major

This still won’t catch image-extraction regressions.

These cases enable skipSmallImages=true with a 10240 byte cutoff, then downgrade missing images to a log line. For the synthetic fixtures, that means CI can stay green even when image extraction returns 0, which misses the regression signal this table is supposed to provide. Either disable the small-image filter for this matrix or encode best-effort cases explicitly, but expectedImages cases should fail when they miss their minimums.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/pdf_integration_test.go` around lines 423 - 447, The test currently
downgrades missing images to a log line which hides regressions; update the
image-verification so that when tc.expectedImages is true and len(imageData) <
tc.minImages the test fails (use t.Errorf or t.Fatalf) instead of t.Logf, and
for matrix entries that must reliably contain images either call
pdf.ExtractPDFContent with the small-image filter disabled (e.g. set the
image-size cutoff to 0 or the skipSmallImages flag to false) or add an explicit
best-effort flag on the test case to allow non-fatal expectations; change the
checks around imageData, tc.expectedImages and tc.minImages and adjust the
ExtractPDFContent call parameters accordingly to enforce failure on missing
expected images.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/fixtures/generate_test_pdfs.py`:
- Around line 9-15: BASE is hard-coded to /tmp and helper images are written to
fixed /tmp paths; change this to derive the fixture root from the script's
location (use the script file directory via __file__ /
pathlib.Path(__file__).resolve().parent) and update make_dirs/BASE references
accordingly, and replace fixed helper PNG paths with per-run temporary files
(use tempfile.NamedTemporaryFile or tempfile.mkstemp) so helper images are
unique per invocation; update any functions/variables that reference BASE and
the helper image names (e.g., BASE, make_dirs, and the code sections that
create/write the helper PNGs) to use the derived fixture root and temp file
paths.

In `@tests/pdf_integration_test.go`:
- Around line 463-475: The test is not validating CMYK handling because
photo_heavy.pdf is RGB; update the test around ExtractPDFContent to either (A)
skip this CMYK-specific assertion until a real CMYK fixture exists by calling
t.Skip with a short reason, or (B) add a true CMYK fixture (e.g.,
tests/fixtures/pdf_types/photo_heavy_cmyk.pdf produced from
generate_test_pdfs.py or a checked-in sample) and change fixturePath to that
CMYK file and assert on image extraction (check len(imageData) > 0 and no error)
to ensure CMYK decoding is exercised; reference the ExtractPDFContent call and
fixturePath/photo_heavy.pdf when making the change.

---

Outside diff comments:
In `@tests/pdf_integration_test.go`:
- Around line 276-349: Before calling pdf.ExtractPDFContent, open the fixture
file and read its header bytes and assert it contains a PDF version prefix
matching tc.minVersion; specifically, for each test case (testCases / tc) read
the first ~8 bytes of tc.pdfFile, verify the header matches the pattern
"%PDF-x.y" and that the x.y string equals tc.minVersion (or fail the test with
t.Fatalf/t.Errorf), then proceed to call pdf.ExtractPDFContent; reference
tc.minVersion, tc.pdfFile and pdf.ExtractPDFContent to locate where to insert
this cheap header check.
- Around line 423-447: The test currently downgrades missing images to a log
line which hides regressions; update the image-verification so that when
tc.expectedImages is true and len(imageData) < tc.minImages the test fails (use
t.Errorf or t.Fatalf) instead of t.Logf, and for matrix entries that must
reliably contain images either call pdf.ExtractPDFContent with the small-image
filter disabled (e.g. set the image-size cutoff to 0 or the skipSmallImages flag
to false) or add an explicit best-effort flag on the test case to allow
non-fatal expectations; change the checks around imageData, tc.expectedImages
and tc.minImages and adjust the ExtractPDFContent call parameters accordingly to
enforce failure on missing expected images.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 578ae30b-0ff6-409c-b2fb-be11edce27dc

📥 Commits

Reviewing files that changed from the base of the PR and between 067b8fe and 5aa77f5.

⛔ Files ignored due to path filters (8)
  • tests/fixtures/pdf_types/mixed.pdf is excluded by !**/*.pdf
  • tests/fixtures/pdf_types/photo_heavy.pdf is excluded by !**/*.pdf
  • tests/fixtures/pdf_types/scanned.pdf is excluded by !**/*.pdf
  • tests/fixtures/pdf_types/text_only.pdf is excluded by !**/*.pdf
  • tests/fixtures/pdf_versions/pdf_1.3.pdf is excluded by !**/*.pdf
  • tests/fixtures/pdf_versions/pdf_1.4.pdf is excluded by !**/*.pdf
  • tests/fixtures/pdf_versions/pdf_1.7.pdf is excluded by !**/*.pdf
  • tests/fixtures/pdf_versions/pdf_2.0.pdf is excluded by !**/*.pdf
📒 Files selected for processing (2)
  • tests/fixtures/generate_test_pdfs.py
  • tests/pdf_integration_test.go

Comment on lines +9 to +15
BASE = "/tmp/weave-cli/tests/fixtures"
MC = dict(new_x="LMARGIN", new_y="NEXT") # multi_cell kwargs


def make_dirs():
os.makedirs(f"{BASE}/pdf_versions", exist_ok=True)
os.makedirs(f"{BASE}/pdf_types", exist_ok=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid hard-coded /tmp paths for both fixture output and scratch assets.

BASE points at one specific checkout location, and the helper images are also written to fixed /tmp/... names. Re-running this from any other clone writes outside the repo, and parallel/local reruns can trample each other’s intermediates. Since this script exists to reproduce the checked-in fixtures, derive the fixture root from the script location and allocate per-run temp files for the PNGs.

Also applies to: 115-116, 173-174, 203-204

🧰 Tools
🪛 Ruff (0.15.7)

[error] 9-9: Probable insecure usage of temporary file or directory: "/tmp/weave-cli/tests/fixtures"

(S108)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/generate_test_pdfs.py` around lines 9 - 15, BASE is hard-coded
to /tmp and helper images are written to fixed /tmp paths; change this to derive
the fixture root from the script's location (use the script file directory via
__file__ / pathlib.Path(__file__).resolve().parent) and update make_dirs/BASE
references accordingly, and replace fixed helper PNG paths with per-run
temporary files (use tempfile.NamedTemporaryFile or tempfile.mkstemp) so helper
images are unique per invocation; update any functions/variables that reference
BASE and the helper image names (e.g., BASE, make_dirs, and the code sections
that create/write the helper PNGs) to use the derived fixture root and temp file
paths.

Comment on lines +463 to +475
// TODO: Add dedicated CMYK test PDF when available.
// For now, verify the photo_heavy fixture extracts without CMYK-related errors.
fixturePath := filepath.Join("fixtures", "pdf_types", "photo_heavy.pdf")
textData, imageData, err := pdf.ExtractPDFContent(fixturePath, 1000, true, 10240, 2000, false)
if err != nil {
t.Fatalf("ExtractPDFContent() error = %v", err)
}

// TODO: Add test with CMYK PDF when available
// Should verify:
// - CMYK images are extracted
// - Color space is preserved or converted correctly
// - No corruption in image data
t.Logf("Photo-heavy PDF: %d text chunks, %d images (CMYK placeholder test)", len(textData), len(imageData))

if len(textData) == 0 {
t.Error("Expected at least some text content from photo-heavy PDF")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This is still not exercising CMYK handling.

photo_heavy.pdf is generated from RGB images in tests/fixtures/generate_test_pdfs.py, and this test only asserts that some text was extracted. A CMYK-specific decode regression can slip through here without failing CI. Either keep this skipped until a real CMYK fixture exists, or add a CMYK fixture and assert on image extraction for that path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/pdf_integration_test.go` around lines 463 - 475, The test is not
validating CMYK handling because photo_heavy.pdf is RGB; update the test around
ExtractPDFContent to either (A) skip this CMYK-specific assertion until a real
CMYK fixture exists by calling t.Skip with a short reason, or (B) add a true
CMYK fixture (e.g., tests/fixtures/pdf_types/photo_heavy_cmyk.pdf produced from
generate_test_pdfs.py or a checked-in sample) and change fixturePath to that
CMYK file and assert on image extraction (check len(imageData) > 0 and no error)
to ensure CMYK decoding is exercised; reference the ExtractPDFContent call and
fixturePath/photo_heavy.pdf when making the change.

@maximilien
Copy link
Copy Markdown
Owner Author

tests should be in Golang if possible

@maximilien maximilien closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request test Testing related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: extract various PDFs from different years (versions)

1 participant