test: add PDF version and type extraction test fixtures#66
test: add PDF version and type extraction test fixtures#66maximilien wants to merge 1 commit intomainfrom
Conversation
- Generate test PDFs for versions 1.3, 1.4, 1.7, and 2.0 - Add text-only, mixed, photo-heavy, and scanned PDF type fixtures - Enable previously-skipped TestPDFVersionCompatibility tests - Enable previously-skipped TestPDFTypeExtraction tests - Update TestPDFWithCMYKImages to use photo-heavy fixture - Include generate_test_pdfs.py for reproducible fixture generation All tests pass: version extraction works across PDF 1.3-2.0, text-only and scanned PDFs extract correctly, mixed content handles both text and image extraction. Closes #8
📝 WalkthroughWalkthroughA new test fixture generator script creates multiple PDF files for testing PDF extraction across different versions and content types. Corresponding integration tests are updated to enable previously skipped test cases and run assertions against the newly generated fixtures. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/pdf_integration_test.go (2)
276-349:⚠️ Potential issue | 🟠 MajorUse
minVersionto verify the fixture header before extraction.Right now this matrix only proves that each file is readable. Because
minVersionis never asserted, the test would still pass if all four “versioned” fixtures were accidentally generated with the same PDF header. Add a cheap%PDF-x.yprefix check so this actually guards the version-fixture contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/pdf_integration_test.go` around lines 276 - 349, Before calling pdf.ExtractPDFContent, open the fixture file and read its header bytes and assert it contains a PDF version prefix matching tc.minVersion; specifically, for each test case (testCases / tc) read the first ~8 bytes of tc.pdfFile, verify the header matches the pattern "%PDF-x.y" and that the x.y string equals tc.minVersion (or fail the test with t.Fatalf/t.Errorf), then proceed to call pdf.ExtractPDFContent; reference tc.minVersion, tc.pdfFile and pdf.ExtractPDFContent to locate where to insert this cheap header check.
423-447:⚠️ Potential issue | 🟠 MajorThis still won’t catch image-extraction regressions.
These cases enable
skipSmallImages=truewith a10240byte cutoff, then downgrade missing images to a log line. For the synthetic fixtures, that means CI can stay green even when image extraction returns0, which misses the regression signal this table is supposed to provide. Either disable the small-image filter for this matrix or encode best-effort cases explicitly, butexpectedImagescases should fail when they miss their minimums.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/pdf_integration_test.go` around lines 423 - 447, The test currently downgrades missing images to a log line which hides regressions; update the image-verification so that when tc.expectedImages is true and len(imageData) < tc.minImages the test fails (use t.Errorf or t.Fatalf) instead of t.Logf, and for matrix entries that must reliably contain images either call pdf.ExtractPDFContent with the small-image filter disabled (e.g. set the image-size cutoff to 0 or the skipSmallImages flag to false) or add an explicit best-effort flag on the test case to allow non-fatal expectations; change the checks around imageData, tc.expectedImages and tc.minImages and adjust the ExtractPDFContent call parameters accordingly to enforce failure on missing expected images.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/fixtures/generate_test_pdfs.py`:
- Around line 9-15: BASE is hard-coded to /tmp and helper images are written to
fixed /tmp paths; change this to derive the fixture root from the script's
location (use the script file directory via __file__ /
pathlib.Path(__file__).resolve().parent) and update make_dirs/BASE references
accordingly, and replace fixed helper PNG paths with per-run temporary files
(use tempfile.NamedTemporaryFile or tempfile.mkstemp) so helper images are
unique per invocation; update any functions/variables that reference BASE and
the helper image names (e.g., BASE, make_dirs, and the code sections that
create/write the helper PNGs) to use the derived fixture root and temp file
paths.
In `@tests/pdf_integration_test.go`:
- Around line 463-475: The test is not validating CMYK handling because
photo_heavy.pdf is RGB; update the test around ExtractPDFContent to either (A)
skip this CMYK-specific assertion until a real CMYK fixture exists by calling
t.Skip with a short reason, or (B) add a true CMYK fixture (e.g.,
tests/fixtures/pdf_types/photo_heavy_cmyk.pdf produced from
generate_test_pdfs.py or a checked-in sample) and change fixturePath to that
CMYK file and assert on image extraction (check len(imageData) > 0 and no error)
to ensure CMYK decoding is exercised; reference the ExtractPDFContent call and
fixturePath/photo_heavy.pdf when making the change.
---
Outside diff comments:
In `@tests/pdf_integration_test.go`:
- Around line 276-349: Before calling pdf.ExtractPDFContent, open the fixture
file and read its header bytes and assert it contains a PDF version prefix
matching tc.minVersion; specifically, for each test case (testCases / tc) read
the first ~8 bytes of tc.pdfFile, verify the header matches the pattern
"%PDF-x.y" and that the x.y string equals tc.minVersion (or fail the test with
t.Fatalf/t.Errorf), then proceed to call pdf.ExtractPDFContent; reference
tc.minVersion, tc.pdfFile and pdf.ExtractPDFContent to locate where to insert
this cheap header check.
- Around line 423-447: The test currently downgrades missing images to a log
line which hides regressions; update the image-verification so that when
tc.expectedImages is true and len(imageData) < tc.minImages the test fails (use
t.Errorf or t.Fatalf) instead of t.Logf, and for matrix entries that must
reliably contain images either call pdf.ExtractPDFContent with the small-image
filter disabled (e.g. set the image-size cutoff to 0 or the skipSmallImages flag
to false) or add an explicit best-effort flag on the test case to allow
non-fatal expectations; change the checks around imageData, tc.expectedImages
and tc.minImages and adjust the ExtractPDFContent call parameters accordingly to
enforce failure on missing expected images.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 578ae30b-0ff6-409c-b2fb-be11edce27dc
⛔ Files ignored due to path filters (8)
tests/fixtures/pdf_types/mixed.pdfis excluded by!**/*.pdftests/fixtures/pdf_types/photo_heavy.pdfis excluded by!**/*.pdftests/fixtures/pdf_types/scanned.pdfis excluded by!**/*.pdftests/fixtures/pdf_types/text_only.pdfis excluded by!**/*.pdftests/fixtures/pdf_versions/pdf_1.3.pdfis excluded by!**/*.pdftests/fixtures/pdf_versions/pdf_1.4.pdfis excluded by!**/*.pdftests/fixtures/pdf_versions/pdf_1.7.pdfis excluded by!**/*.pdftests/fixtures/pdf_versions/pdf_2.0.pdfis excluded by!**/*.pdf
📒 Files selected for processing (2)
tests/fixtures/generate_test_pdfs.pytests/pdf_integration_test.go
| BASE = "/tmp/weave-cli/tests/fixtures" | ||
| MC = dict(new_x="LMARGIN", new_y="NEXT") # multi_cell kwargs | ||
|
|
||
|
|
||
| def make_dirs(): | ||
| os.makedirs(f"{BASE}/pdf_versions", exist_ok=True) | ||
| os.makedirs(f"{BASE}/pdf_types", exist_ok=True) |
There was a problem hiding this comment.
Avoid hard-coded /tmp paths for both fixture output and scratch assets.
BASE points at one specific checkout location, and the helper images are also written to fixed /tmp/... names. Re-running this from any other clone writes outside the repo, and parallel/local reruns can trample each other’s intermediates. Since this script exists to reproduce the checked-in fixtures, derive the fixture root from the script location and allocate per-run temp files for the PNGs.
Also applies to: 115-116, 173-174, 203-204
🧰 Tools
🪛 Ruff (0.15.7)
[error] 9-9: Probable insecure usage of temporary file or directory: "/tmp/weave-cli/tests/fixtures"
(S108)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/fixtures/generate_test_pdfs.py` around lines 9 - 15, BASE is hard-coded
to /tmp and helper images are written to fixed /tmp paths; change this to derive
the fixture root from the script's location (use the script file directory via
__file__ / pathlib.Path(__file__).resolve().parent) and update make_dirs/BASE
references accordingly, and replace fixed helper PNG paths with per-run
temporary files (use tempfile.NamedTemporaryFile or tempfile.mkstemp) so helper
images are unique per invocation; update any functions/variables that reference
BASE and the helper image names (e.g., BASE, make_dirs, and the code sections
that create/write the helper PNGs) to use the derived fixture root and temp file
paths.
| // TODO: Add dedicated CMYK test PDF when available. | ||
| // For now, verify the photo_heavy fixture extracts without CMYK-related errors. | ||
| fixturePath := filepath.Join("fixtures", "pdf_types", "photo_heavy.pdf") | ||
| textData, imageData, err := pdf.ExtractPDFContent(fixturePath, 1000, true, 10240, 2000, false) | ||
| if err != nil { | ||
| t.Fatalf("ExtractPDFContent() error = %v", err) | ||
| } | ||
|
|
||
| // TODO: Add test with CMYK PDF when available | ||
| // Should verify: | ||
| // - CMYK images are extracted | ||
| // - Color space is preserved or converted correctly | ||
| // - No corruption in image data | ||
| t.Logf("Photo-heavy PDF: %d text chunks, %d images (CMYK placeholder test)", len(textData), len(imageData)) | ||
|
|
||
| if len(textData) == 0 { | ||
| t.Error("Expected at least some text content from photo-heavy PDF") | ||
| } |
There was a problem hiding this comment.
This is still not exercising CMYK handling.
photo_heavy.pdf is generated from RGB images in tests/fixtures/generate_test_pdfs.py, and this test only asserts that some text was extracted. A CMYK-specific decode regression can slip through here without failing CI. Either keep this skipped until a real CMYK fixture exists, or add a CMYK fixture and assert on image extraction for that path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/pdf_integration_test.go` around lines 463 - 475, The test is not
validating CMYK handling because photo_heavy.pdf is RGB; update the test around
ExtractPDFContent to either (A) skip this CMYK-specific assertion until a real
CMYK fixture exists by calling t.Skip with a short reason, or (B) add a true
CMYK fixture (e.g., tests/fixtures/pdf_types/photo_heavy_cmyk.pdf produced from
generate_test_pdfs.py or a checked-in sample) and change fixturePath to that
CMYK file and assert on image extraction (check len(imageData) > 0 and no error)
to ensure CMYK decoding is exercised; reference the ExtractPDFContent call and
fixturePath/photo_heavy.pdf when making the change.
|
tests should be in Golang if possible |
Summary
Addresses #8 — adds test fixtures and enables previously-skipped PDF extraction tests.
Changes
Test Results
All 3 test suites pass (7.2s total):
Notes
Closes #8
Summary by CodeRabbit
Release Notes