Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 18, 2025

Fix multi-byte character decoding in Executable.waitForExitAsync

  • Understand the issue: multi-byte UTF-8 characters can be split across chunk boundaries
  • Identify solution: set stream encoding explicitly before attaching handlers
  • Update waitForExitAsync to set encoding on stdout/stderr streams
  • Update getProcessInfoByIdAsync to set encoding on stdout stream
  • Create test cases for multi-byte character handling
  • Run existing tests to ensure no regressions (238 tests passed)
  • Validate the fix manually (multi-byte characters decoded correctly)
  • Request code review and address feedback
  • Run security checks (no issues found)
  • Add patch changefile for node-core-library

Summary

This PR fixes an issue where Executable.waitForExitAsync and getProcessInfoByIdAsync incorrectly decoded multi-byte UTF-8 characters when they were split across chunk boundaries from stdout and stderr streams.

Root Cause

Calling chunk.toString(encoding) on individual buffer chunks produces invalid characters when multi-byte UTF-8 sequences are split across chunk boundaries.

Solution

Set the stream encoding explicitly using stream.setEncoding(encoding) before attaching data handlers. When encoding is set on a stream, Node.js automatically uses a StringDecoder internally which maintains state across chunks and properly handles character boundaries.

Changes Made

  1. Updated waitForExitAsync: Set encoding on stdout/stderr streams before attaching handlers (when encoding is not 'buffer')
  2. Updated getProcessInfoByIdAsync: Set encoding to 'utf8' on stdout stream
  3. Added test case: Comprehensive test that writes multi-byte characters (Chinese + emoji) byte-by-byte to verify proper decoding
  4. Added changefile: Patch changefile for @rushstack/node-core-library

Testing

  • All 238 existing tests pass
  • New test specifically validates multi-byte character handling
  • Manual verification confirms proper decoding
  • Security scan: No vulnerabilities detected
Original prompt

This section details on the original issue you should resolve

<issue_title>[node-core-library] Executable: Incorrect decoding of multi-byte characters depending on buffering</issue_title>
<issue_description>## Summary
Executable.waitForExitAsync currently assumes that the raw chunks received from the stdout and stderr stream will always end at a character boundary, which is not required to be true when writing multi-byte characters.

Repro steps

inner.js

const { setTimeout } = require('node:timers/promises');

const unicodeString = "Hello, 世界"; // "Hello, World" in Chinese
const encoded = Buffer.from(unicodeString, 'utf8');
async function writeChars() {
    for (let i = 0; i < encoded.length; i++) {
        process.stdout.write(encoded.subarray(i, i + 1));
        await setTimeout(0); // wait for a single async tick
    }
}
writeChars();

outer.js

const child_process = require('child_process');

const decoder = new TextDecoder('utf-8');

let decodedOutput = '';
let byChunkOutput = [];
// Execute ./inner.js using Node.js and log the raw binary content of each chunk of stdout
const child = child_process.spawn('node', ['./inner.js'], {
    stdio: ['ignore', 'pipe', 'inherit']
});
child.stdout.on('data', (chunk) => {
    console.log('Received chunk:', chunk);
    byChunkOutput.push(chunk.toString());
    decodedOutput += decoder.decode(chunk, { stream: true });
});
child.on('close', (code) => {
    console.log(`Child process exited with code ${code}`);
    decodedOutput += decoder.decode(); // flush the decoder
    console.log('Decoded output:', decodedOutput);
    console.log('By-chunk output:', byChunkOutput.join(''));
});

This prints:

Received chunk: <Buffer 48>
Received chunk: <Buffer 65>
Received chunk: <Buffer 6c>
Received chunk: <Buffer 6c>
Received chunk: <Buffer 6f>
Received chunk: <Buffer 2c>
Received chunk: <Buffer 20>
Received chunk: <Buffer e4>
Received chunk: <Buffer b8>
Received chunk: <Buffer 96>
Received chunk: <Buffer e7>
Received chunk: <Buffer 95>
Received chunk: <Buffer 8c>
Child process exited with code 0
Decoded output: Hello, 世界
By-chunk output: Hello, ������

Details

Executable needs to either concatenate, then decode, or use one of the multi-byte-aware stateful decoders (TextDecoder or the legacy string_decoder).

Standard questions

Please answer these questions to help us investigate your issue more quickly:

Question Answer
Package name: @rushstack/node-core-library
Package version? 5.19.1
Operating system? Linux
Would you consider contributing a PR? Yes
Node.js version (node -v)? 20.13.1
</issue_description>

<agent_instructions>Please update Executable.waitForExitAsync (and any other Executable APIs that merge input from multiple chunks of unknown encoding) to explicitly set the stream encoding for stdout and stderr before attaching handlers.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@dmichon-msft Only an issue if we don't explicitly set the stream encoding to `utf-8`.

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI changed the title [WIP] Fix incorrect decoding of multi-byte characters in executable Fix multi-byte UTF-8 decoding in Executable stream handling Dec 18, 2025
Copilot AI requested a review from dmichon-msft December 18, 2025 01:09
Copy link
Contributor

@dmichon-msft dmichon-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot This change needs a patch changefile for node-core-library.

@dmichon-msft dmichon-msft marked this pull request as ready for review January 2, 2026 23:15
@iclanton
Copy link
Member

iclanton commented Jan 7, 2026

@copilot - Generate a changefile documenting this change. It should be a patch bump.

@iclanton iclanton moved this from Needs triage to In Progress in Bug Triage Jan 7, 2026
Copy link
Contributor Author

Copilot AI commented Jan 7, 2026

@copilot - Generate a changefile documenting this change. It should be a patch bump.

Generated patch changefile documenting the multi-byte character fix in commit b3e7021.

@copilot This change needs a patch changefile for node-core-library.

Added patch changefile in commit b3e7021.

for (let i = 0; i < encoded.length; i++) {
process.stdout.write(encoded.subarray(i, i + 1));
// Small delay to ensure each byte is in a separate chunk
await setTimeout(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to remove this timeout by reducing the chunk size to just one byte?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[node-core-library] Executable: Incorrect decoding of multi-byte characters depending on buffering

4 participants