Skip to content

fix: handle UnicodeEncodeError on Windows terminals (closes #1802)#1856

Open
unn-Known1 wants to merge 1 commit into
microsoft:mainfrom
unn-Known1:fix/unicode-encoding-error-1802
Open

fix: handle UnicodeEncodeError on Windows terminals (closes #1802)#1856
unn-Known1 wants to merge 1 commit into
microsoft:mainfrom
unn-Known1:fix/unicode-encoding-error-1802

Conversation

@unn-Known1
Copy link
Copy Markdown

Summary

Resolves issue #1802 by handling UnicodeEncodeError that occurs on Windows when the terminal's charmap codec (cp1252) cannot encode certain Unicode characters.

Changes

Modified __main__.py to wrap print() calls in try/except blocks to provide a graceful fallback using errors='replace' when the default encoding fails.

Files Changed

  • packages/markitdown/src/markitdown/__main__.py

    • _handle_output(): Added UnicodeEncodeError handling for stdout output
    • _exit_with_error(): Added UnicodeEncodeError handling for error messages
  • packages/markitdown/tests/test_module_misc.py

    • Added test_unicode_encoding_in_cli test to verify the fix works with various Unicode characters

Test Coverage

The new test verifies Unicode handling with:

  • Chinese characters (世界, 你好)
  • Emojis (🌍)
  • Greek letters (αβγδ)

All existing tests continue to pass.

Resolves: microsoft#1802

Modified __main__.py to handle UnicodeEncodeError that occurs on Windows
when the terminal's charmap codec (cp1252) cannot encode certain Unicode
characters. The fix wraps print() calls in try/except blocks to provide
a graceful fallback using errors='replace' when the default encoding
fails.

Changes:
- _handle_output(): Added UnicodeEncodeError handling for stdout output
- _exit_with_error(): Added UnicodeEncodeError handling for error messages

Added test_unicode_encoding_in_cli test to verify the fix works with
various Unicode characters including Chinese, Japanese, emojis, and
Greek letters.
@unn-Known1
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@kyledaviesdev
Copy link
Copy Markdown

Hey there! I'm still quite new to Python's cross-platform terminal handling. I noticed you chose to fix the UnicodeEncodeError by modifying the stdout stream directly. Could you clarify why that method is preferred? Keep up the good work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants