Skip to content

Word PDF Lossless Export#3585

Merged
m417z merged 15 commits intoramensoftware:mainfrom
JoeYe-233:main
Mar 20, 2026
Merged

Word PDF Lossless Export#3585
m417z merged 15 commits intoramensoftware:mainfrom
JoeYe-233:main

Conversation

@JoeYe-233
Copy link
Contributor

This pull request introduces a new Windhawk mod for Microsoft Word, aiming to solve the long-standing issue of image quality loss when exporting documents to PDF. The mod hooks into Word's internal graphics pipeline to prevent image downsampling and JPEG compression, ensuring lossless export of images. It includes robust dynamic symbol scanning and adapts to both 32-bit and 64-bit Office versions.

Key enhancements for PDF export quality:

Image Quality Improvements

  • Hooks DOCEXIMAGE::HrComputeSize to prevent downsampling by forcing the target image size to match the original and clearing the resample flag, ensuring pixel-perfect exports.
  • Hooks DOCEXIMAGE::HrCheckForLosslessOutput to intercept attempts to use JPEG compression and force Word to use lossless FLATE (Zlib) compression instead, bypassing hidden internal optimization.

Robustness and Compatibility

  • Implements dynamic symbol scanning using DbgHelp as a fallback when the official Windhawk API fails, increasing reliability across Office versions and architectures.
  • Supports both 32-bit and 64-bit Office by adapting memory offsets and calling conventions, ensuring broad compatibility.

Documentation and User Guidance

  • Provides comprehensive in-mod documentation and test results, including before/after image comparisons and guidance for troubleshooting and verifying lossless exports.

@JoeYe-233
Copy link
Contributor Author

Hi, thanks for the review and the constructive feedback! I completely agree with standardizing the modding approach, but I'd like to clarify why the official WindhawkUtils::SYMBOL_HOOK API specifically fails for this target, which forced me into the fallback approach.

1. Why the official API fails on mso.dll:
The issue isn't with Windhawk's symbol loading itself, but rather how the DIA SDK (which HookSymbols relies on under the hood for exact string matching) interacts with Microsoft Office's stripped Public PDBs.

Microsoft heavily strips private type information from Office PDBs. Additionally, mso.dll extensively relies on ordinal-only exports to hide its C++ class implementations.

Because types like class Gdiplus::PointF const * are missing proper type definitions in the stripped PDB, the DIA SDK often truncates the undecorated name, replaces types with placeholders, or formats the mangled name in a way that makes exact string matching (via SymFromName or Windhawk's exact symbol hook) practically impossible.

I tested dozens of exact mangled/undecorated string variations with WindhawkUtils::SYMBOL_HOOK and all of them returned FALSE. (For detailed test results, please check [The Test Mod] section below.) Using SymEnumSymbols with strstr (fuzzy matching) is currently the only robust way to locate HrComputeSize and HrCheckForLosslessOutput in mso.dll because it bypasses DIA's strict string-matching requirements and successfully catches the partial mangled names left in the PDB.

2. Regarding the hardcoded Windhawk internal path:
Hardcoding %ProgramData%\Windhawk\Engine\Symbols was a desperate workaround to feed the downloaded PDB into my manual DbgHelp fallback. Since relying on internal paths is brittle, is there a recommended way or a Windhawk API to get the cached PDB path for a module if I must use a custom DbgHelp scanner? If not, I can refactor it to rely on the standard Windows symbol server path (srv*...) for the fallback.

3. Hooking LoadLibraryExW:
This is a great suggestion. I will remove the Sleep() polling in the scout thread and implement a LoadLibraryExW hook to cleanly catch the loading of mso.dll.

I'd love to use the official HookSymbols API if possible, but mso.dll's symbol obfuscation makes it extremely stubborn. Let me know how you'd prefer me to handle the symbol resolution given these Office-specific PDB limitations, and I'll update the mod accordingly!


[The Test Mod]
To definitively answer how the official API fails, I wrote a dedicated test mod solely to evaluate WindhawkUtils::HookSymbols against mso.dll. Instead of relying on a single string format, this test feeds the API a massive "shotgun" array of every conceivable string variation for the target functions.

As you will see in the code below, the array includes the raw native mangled names, meticulously formatted undecorated names (with various spacing, struct/class, and modifier differences), bare function names, wildcard strings, and even the flawed signatures generated by IDA Pro. The goal was simple: if the symbol is resolvable by the official API in any format, this exhaustive list would catch at least one.

// ==WindhawkMod==
// @id            word-pdf-lossless-api-test-ultimate
// @name          Word PDF Lossless (API Ultimate Test)
// @description   The final absolute shotgun test for Windhawk SYMBOL_HOOK API.
// @version       1.0
// @author        Joe Ye
// @include       winword.exe
// @compilerOptions -lversion
// ==/WindhawkMod==

#include <windhawk_utils.h>
#include <windows.h>
#include <atomic>

// =============================================================
// Basic structure definitions
// =============================================================
namespace Gdiplus {
    struct PointF {
        float X;
        float Y;
    };
}

std::atomic<bool> g_bMsoHooked{false};

// =============================================================
// Hook proxy function definitions
// =============================================================
typedef HRESULT (__thiscall *HrComputeSize_t)(void* pThis, float* p1, const Gdiplus::PointF* p2);
HrComputeSize_t pOrig_HrComputeSize = nullptr;

HRESULT __fastcall Hook_HrComputeSize(void* pThis, void* edx_dummy, float* p1, const Gdiplus::PointF* p2) {
    Wh_Log(L"[Hook] DOCEXIMAGE::HrComputeSize intercepted successfully!");
    return pOrig_HrComputeSize(pThis, p1, p2); 
}

typedef HRESULT (__thiscall *HrCheckForLosslessOutput_t)(void* pThis, int arg1);
HrCheckForLosslessOutput_t pOrig_HrCheckForLosslessOutput = nullptr;

HRESULT __fastcall Hook_HrCheckForLosslessOutput(void* pThis, void* edx_dummy, int arg1) {
    Wh_Log(L"[Hook] DOCEXIMAGE::HrCheckForLosslessOutput intercepted successfully!");
    return pOrig_HrCheckForLosslessOutput(pThis, arg1);
}

// =============================================================
// Core logic: Windhawk official API ultimate test
// =============================================================
void ScanAndHookMso() {
    HMODULE hMso = GetModuleHandleW(L"mso.dll");
    if (!hMso || g_bMsoHooked.exchange(true)) return;

    Wh_Log(L"[Test Ultimate] Starting official API ultimate symbol matching test...");

    WindhawkUtils::SYMBOL_HOOK officialHook[] = {
        {
            {
                // 1. Bare function name (most extreme test)
                L"HrComputeSize",
                L"DOCEXIMAGE::HrComputeSize",
                
                // 2. With wildcards (testing if Windhawk supports wildcard expansion)
                L"*HrComputeSize*",
                L"*DOCEXIMAGE::HrComputeSize*",
                
                // 3. Native Mangled Name
                L"?HrComputeSize@DOCEXIMAGE@@AAEJPAMPBVPointF@Gdiplus@@@Z",
                
                // 4. Standard exact undecorated series (various minor space and modifier differences)
                L"private: long __thiscall DOCEXIMAGE::HrComputeSize(float *,class Gdiplus::PointF const *)",
                L"private: long __thiscall DOCEXIMAGE::HrComputeSize(float *,struct Gdiplus::PointF const *)",
                L"long __thiscall DOCEXIMAGE::HrComputeSize(float *,class Gdiplus::PointF const *)",
                L"long __thiscall DOCEXIMAGE::HrComputeSize(float *,struct Gdiplus::PointF const *)",
                L"DOCEXIMAGE::HrComputeSize(float *,class Gdiplus::PointF const *)",
                L"DOCEXIMAGE::HrComputeSize(float *,struct Gdiplus::PointF const *)",
                L"DOCEXIMAGE::HrComputeSize(float*,class Gdiplus::PointF const*)", // No spaces version
                
                // 5. IDA hallucination version
                L"public: virtual long __cdecl DOCEXIMAGE::HrComputeSize(float *,struct Gdiplus::PointF const *)",
                L"long __cdecl DOCEXIMAGE::HrComputeSize(float *,struct Gdiplus::PointF const *)",

                L"private: long __thiscall DOCEXIMAGE::HrComputeSize(float *,class Gdiplus::PointF const *)"
            },
            (void**)&pOrig_HrComputeSize,
            (void*)Hook_HrComputeSize,
            false // Don't force exception on failure
        },
        {
            {
                // 1. Bare function name
                L"HrCheckForLosslessOutput",
                L"DOCEXIMAGE::HrCheckForLosslessOutput",
                
                // 2. With wildcards
                L"*HrCheckForLosslessOutput*",
                L"*DOCEXIMAGE::HrCheckForLosslessOutput*",
                
                // 3. Native Mangled Name
                L"?HrCheckForLosslessOutput@DOCEXIMAGE@@MBEJH@Z",
                
                // 4. Standard exact undecorated series
                L"protected: virtual long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)",
                L"virtual long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)",
                L"long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)",
                L"protected: long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)",
                L"DOCEXIMAGE::HrCheckForLosslessOutput(int)",
                
                // 5. IDA hallucination version
                L"int __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)",
                L"int DOCEXIMAGE::HrCheckForLosslessOutput(int)",

                L"protected: virtual long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)const"
            },
            (void**)&pOrig_HrCheckForLosslessOutput,
            (void*)Hook_HrCheckForLosslessOutput,
            false
        }
    };

    bool bResult = WindhawkUtils::HookSymbols(hMso, officialHook, ARRAYSIZE(officialHook));
    
    if (bResult) {
        Wh_Log(L"[Test Ultimate] Official API returned TRUE! Shotgun coverage successful!");
    } else {
        Wh_Log(L"[Test Ultimate] Official API still returned FALSE! Proves that no matter how the name is written, it doesn't work.");
    }

    if (pOrig_HrComputeSize) {
        Wh_Log(L"[Test Ultimate] -> DOCEXIMAGE::HrComputeSize resolved successfully!");
    } else {
        Wh_Log(L"[Test Ultimate] -> DOCEXIMAGE::HrComputeSize resolution completely failed...");
    }

    if (pOrig_HrCheckForLosslessOutput) {
        Wh_Log(L"[Test Ultimate] -> DOCEXIMAGE::HrCheckForLosslessOutput resolved successfully!");
    } else {
        Wh_Log(L"[Test Ultimate] -> DOCEXIMAGE::HrCheckForLosslessOutput resolution completely failed...");
    }
}

DWORD WINAPI ScoutThread(LPVOID lpParam) {
    HMODULE hMso = nullptr;
    while (!hMso) {
        hMso = GetModuleHandleW(L"mso.dll");
        if (!hMso) Sleep(500);
    }
    Sleep(500); // Allow time for initialization
    ScanAndHookMso();
    return 0;
}

BOOL Wh_ModInit() {
    Wh_Log(L"Word PDF Lossless API Ultimate Test Loaded");
    CreateThread(nullptr, 0, ScoutThread, nullptr, 0, nullptr);
    return TRUE;
}

void Wh_ModUninit() {}

[Explanation of the Results]
Here is the execution log from running the test mod:

10:23:56.481 22376 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test-ultimate] [153:Wh_ModInit]: Word PDF Lossless API Ultimate Test Loaded
10:23:57.496 22376 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test-ultimate] [53:ScanAndHookMso]: [Test Ultimate] Starting official API ultimate symbol matching test...
10:23:59.357 22376 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test-ultimate] [125:ScanAndHookMso]: [Test Ultimate] Official API still returned FALSE! Proves that no matter how the name is written, it doesn't work.
10:23:59.357 22376 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test-ultimate] [131:ScanAndHookMso]: [Test Ultimate] -> DOCEXIMAGE::HrComputeSize resolution completely failed...
10:23:59.358 22376 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test-ultimate] [137:ScanAndHookMso]: [Test Ultimate] -> DOCEXIMAGE::HrCheckForLosslessOutput resolution completely failed...

As you can see, despite testing every possible naming permutation, HookSymbols universally returns FALSE.

Notice the ~1.9-second processing time between starting the test (57.496) and the failure log (59.357). This indicates that Windhawk's engine is successfully downloading and parsing the massive mso.dll PDB. However, the underlying parser simply cannot reconcile our strings with what remains in the stripped Public PDB. Because Microsoft strips private type info and obfuscates these internal classes, the strict string matching required by the official API hits a brick wall.


[Final Proof: Dumping the Exact Strings from DbgHelp]

To completely rule out the possibility that my string array was simply missing the "correct" format, I wrote a routine to hook into SymEnumSymbols and exported the exact strings that the Windows DbgHelp API itself sees and undecorates for these targets in mso.dll.

Here is the exact output from the engine when it hits the targets:

10:13:02.128 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [203:Wh_ModInit]: Word PDF Lossless API Test Loaded
10:13:02.635 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [124:ScanAndHookMso]: [Main] Phase 1: Attempting to call the official Windhawk API (triggers automatic PDB download)...
10:13:04.665 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [153:ScanAndHookMso]: [Main] Official API still failed. Initiating DbgHelp brute-force takeover...
10:13:04.665 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [172:ScanAndHookMso]: [Fallback] Found local PDB: C:\ProgramData\Windhawk\Engine\Symbols\MSO.pdb\D0D6FC43D4DD4146A1E13E586626204C2
10:13:04.735 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [184:ScanAndHookMso]: [Fallback] Starting exhaustive symbol scan for mso.dll...
10:13:05.849 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [75:SymEnumCallback]: ==================================================
10:13:05.849 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [76:SymEnumCallback]: [Symbol Dump] Captured DOCEXIMAGE::HrCheckForLosslessOutput !
10:13:05.850 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [77:SymEnumCallback]: -> DbgHelp Raw Name: ?HrCheckForLosslessOutput@DOCEXIMAGE@@MBEJH@Z
10:13:05.850 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [78:SymEnumCallback]: -> DbgHelp Fully Undecorated Name: protected: virtual long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)const
10:13:05.850 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [79:SymEnumCallback]: ==================================================
10:13:06.027 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [64:SymEnumCallback]: ==================================================
10:13:06.027 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [65:SymEnumCallback]: [Symbol Dump] Captured DOCEXIMAGE::HrComputeSize !
10:13:06.027 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [66:SymEnumCallback]: -> DbgHelp Raw Name: ?HrComputeSize@DOCEXIMAGE@@AAEJPAMPBVPointF@Gdiplus@@@Z
10:13:06.028 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [67:SymEnumCallback]: -> DbgHelp Fully Undecorated Name: private: long __thiscall DOCEXIMAGE::HrComputeSize(float *,class Gdiplus::PointF const *)
10:13:06.028 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [68:SymEnumCallback]: ==================================================
10:13:07.619 20584 WINWORD.EXE  [WH] [local@word-pdf-lossless-api-test] [188:ScanAndHookMso]: [Fallback] Brute-force takeover surgery complete!

As we can see, the undecorated string output by the symbol engine matches exactly what I provided in my previous test array (L"private: long __thiscall DOCEXIMAGE::HrComputeSize(float *,class Gdiplus::PointF const *)", L"protected: virtual long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)const"). Yet, the official HookSymbols API still universally returns FALSE.

This undeniably confirms that the underlying DIA parser used by Windhawk's official API fails to reconcile these specific obfuscated/stripped entries within the Office Public PDB, despite being fed the mathematically perfect string. This is why the fuzzy substring matching (strstr) fallback is currently the only technically viable way to locate these specific functions without breaking the mod. It's not a preference to "reinvent the wheel," but an absolute necessity to bypass the strict signature-matching limitations when dealing with Microsoft Office binaries.

I am completely on board with refactoring the code to use a LoadLibraryExW hook for better architecture, but the custom DbgHelp strstr approach is a hard requirement for Office binaries due to this DIA SDK limitation.

@m417z
Copy link
Member

m417z commented Mar 19, 2026

I only skimmed over your message, I'll read it carefully later, but have you tried:

@JoeYe-233
Copy link
Contributor Author

JoeYe-233 commented Mar 19, 2026

Output of Windhawk Symbol Helper that contains the target function names:

[0035AAA6] ?HrComputeSize@DOCEXIMAGE@@AAEJPAMPBVPointF@Gdiplus@@@Z
[00357576] ?HrCheckForLosslessOutput@DOCEXIMAGE@@MBEJH@Z

Which is every bit the same as // 3. Native Mangled Name used in the aforementioned test script.

BTW, why not embed the symbol helper in the sidebar of the mod editor? There is plenty of space. I think most people will not notice this helpful tool.


Wh_FindFirstSymbol seems to work this time. Thanks for the reminder. I'll consider using this function if it works reliably.

@m417z
Copy link
Member

m417z commented Mar 19, 2026

Output of Windhawk Symbol Helper that contains the target function names:

[0035AAA6] ?HrComputeSize@DOCEXIMAGE@@AAEJPAMPBVPointF@Gdiplus@@@Z
[00357576] ?HrCheckForLosslessOutput@DOCEXIMAGE@@MBEJH@Z

These are the decorated symbols. What are the undecorated symbols? Can you upload the dll?

If only decorated symbols can be used, Have you tried using WindhawkUtils::HookSymbols with noUndecoratedSymbols?

BTW, why not embed the symbol helper in the sidebar of the mod editor? There is plenty of space. I think most people will not notice this helpful tool.

That simply wasn't a priority. It could be nice, maybe one day. For now, I'll add it to the API comments.

@JoeYe-233
Copy link
Contributor Author

Sure, here are all the decorated, undecorated symbols:

[0035AAA6] ?HrComputeSize@DOCEXIMAGE@@AAEJPAMPBVPointF@Gdiplus@@@Z
[0035AAA6] private: long __thiscall DOCEXIMAGE::HrComputeSize(float *,class Gdiplus::PointF const *)

[00357576] ?HrCheckForLosslessOutput@DOCEXIMAGE@@MBEJH@Z
[00357576] protected: virtual long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)const 

They are also, like what I said before, every bit the same as what's in the test mod.

Here is the dll you asked for (along with the pdb file Windhawk downloaded in case you ever need it): https://drive.google.com/file/d/1vp9_Oe9W09GUd8sDPlf8qOr2jluq2Gss/view?usp=drive_link


WindhawkUtils::HookSymbols with noUndecoratedSymbols does seem to work. Thanks again for the technical insight, patience and guidance—this is definitely the optimal way to handle such stubborn Office PDBs. I'll further verify it on 64-bit Office tomorrow and push the updated version if appropriate. I have to go to sleep now.

BTW I'd like to say this method is really really niche, like, there is essentially 0 mod ever used this method in the whole windhawk-mods repo, or even contained noUndecoratedSymbols in their C++ executable code. It's kind of insane that MS forced us to take such extreme methods.

@m417z
Copy link
Member

m417z commented Mar 19, 2026

Could it be that the issue is simply that protected: virtual long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)const ends with a space? Have you tried that?


if (GetModuleHandleW(L"mso.dll")) {
// If already loaded, start thread directly
CreateThread(nullptr, 0, DelayedHookThread, nullptr, 0, nullptr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned handle isn't closed. And why is the thread needed at all?

@JoeYe-233
Copy link
Contributor Author

JoeYe-233 commented Mar 20, 2026

Could it be that the issue is simply that protected: virtual long __thiscall DOCEXIMAGE::HrCheckForLosslessOutput(int)const ends with a space? Have you tried that?

Interesting. Yes, adding a space solved the problem. But actually I'm still kind of favoring the noUndecoratedSymbols way because it's theoretically faster and in case MS fixed the space issue someday in the future.


Thanks for the review!

Good catch on the handle leak. I totally missed CloseHandle(). I will wrap the CreateThread calls to immediately close the returned handle so the thread object cleans up properly.

As for why the thread is needed: mso.dll is massive, and downloading/parsing its PDB via WindhawkUtils::HookSymbols takes several / tens of seconds. If I run ScanAndHookMso() synchronously inside Wh_ModInit or the LoadLibraryExW hook callback, it completely blocks the thread and causes Microsoft Word to freeze/hang during startup while it waits for the symbol resolution to finish. Offloading the symbol hooking to a background thread prevents this UI freeze and allows Word to launch smoothly.

I'll push a commit to fix the unclosed handles. Thanks for pointing that out.

@m417z m417z merged commit de2754e into ramensoftware:main Mar 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants