fix(docx): repair .docx zips with case-mismatched local headers (#1812)#1861
Open
haosenwang1018 wants to merge 1 commit into
Open
fix(docx): repair .docx zips with case-mismatched local headers (#1812)#1861haosenwang1018 wants to merge 1 commit into
haosenwang1018 wants to merge 1 commit into
Conversation
…ory in case only Closes microsoft#1812 Some .docx producers (notably certain legal-document systems and some older Microsoft Word builds) emit zip files where the local file header name and the central-directory name differ in case only — for example, ``customXml/item2.xml`` in the central directory but ``customXML/item2.xml`` in the local file header. Most zip tools accept this. Python's ``zipfile`` strictly validates and raises ``BadZipFile`` mid-conversion, surfaced to the user as:: DocxConverter threw BadZipFile with message: File name in directory 'customXml/item2.xml' and header b'customXML/item2.xml' differ. Per APPNOTE the central directory is authoritative. Add ``_fix_zip_name_casing`` that scans local file headers, finds entries that match the central name when lower-cased and have the same byte length (always true for ASCII case-only mismatches), and rewrites the header bytes in-memory to match. The patch is byte-length preserving, so no offset recomputation is needed. Call this from the start of ``pre_process_docx`` so every code path that runs through it benefits. If the central directory itself is unparseable we leave the stream alone so the caller surfaces the same error the unfixed code would have — no silent data loss. Tests: - ``test_setup_actually_reproduces_badzipfile`` is a guard that the test fixture really does trip ``BadZipFile``; if Python ever stops validating local-vs-central name parity, the regression below would pass for the wrong reason. - ``test_fix_zip_name_casing_repairs_mismatched_local_header`` and ``test_pre_process_docx_accepts_case_mismatched_archive`` cover the fix directly. - ``test_fix_zip_name_casing_passes_through_normal_archive`` is a regression guard against rewriting well-formed inputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@haosenwang1018 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue
Closes #1812
Root cause
Some .docx producers emit zip files where the local file header name and the central-directory name differ in case only (e.g.
customXml/item2.xmlcentrally,customXML/item2.xmllocally). Per APPNOTE the central directory is authoritative; most zip tools accept the mismatch but Python'szipfilestrictly validates and raisesBadZipFilemid-conversion. The user sees:Fix
_fix_zip_name_casingscans local file headers, finds entries that match the central name when lower-cased and have the same byte length (always true for ASCII case-only mismatches), and rewrites the header bytes in-memory to match. Byte-length preserving → no offset recomputation needed.pre_process_docxcalls it as the first step so every conversion path benefits.If the central directory itself is unparseable we leave the input alone so the caller still sees the original error — no silent data loss.
This is the same approach the issue body proposed; the implementation hardens the matching predicate (added
local.lower() == central.lower()so we only patch true case-only differences, not arbitrary same-length renames) and treatsBadZipFileon the central directory as out-of-scope.Tests
test_setup_actually_reproduces_badzipfile— guard that the test fixture really does triggerBadZipFileagainst unfixedzipfile. Without this, a future Python that loosens the validation would let the regression test below pass for the wrong reason.test_fix_zip_name_casing_repairs_mismatched_local_header— the helper rewrites a deliberately mismatched archive into one that opens cleanly.test_pre_process_docx_accepts_case_mismatched_archive— end-to-end: a mismatched .docx no longer crashespre_process_docxand the post-processed archive still contains both the body and the originally-mismatched entry.test_fix_zip_name_casing_passes_through_normal_archive— regression guard against unwanted rewrites of well-formed input.