Skip to content

[SPARK-57653][CORE] Make UTF8String.getByte honor its documented out-of-bounds contract#56723

Closed
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/utf8-getbyte-jdk25
Closed

[SPARK-57653][CORE] Make UTF8String.getByte honor its documented out-of-bounds contract#56723
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/utf8-getbyte-jdk25

Conversation

@HyukjinKwon

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

UTF8String.getByte(int) is documented as "If byte index is invalid, returns 0", but the implementation performed an unchecked Platform.getByte(base, offset + byteIndex), returning adjacent/uninitialized memory for out-of-range indices. This PR adds the bounds check so the method honors its documented contract:

public byte getByte(int byteIndex) {
  if (byteIndex < 0 || byteIndex >= numBytes) {
    return 0;
  }
  return Platform.getByte(base, offset + byteIndex);
}

Why are the changes needed?

Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, making UTF8StringSuite.testGetByte fail with expected: <0> but was: <47> in the Maven (Scala 2.13, JDK 25) scheduled build. The previous 0 was never guaranteed — it depended on memory layout. The fix makes the behavior deterministic across JDKs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTF8StringSuite (including testGetByte).

Before (failing on master, JDK 25): UTF8StringSuite.testGetByteexpected: <0> but was: <47>
https://github.com/apache/spark/actions/runs/28035606804/job/82996960386

After (this fix, JDK 25): UTF8StringSuiteTests: succeeded 34, failed 0; testGetByte passes
https://github.com/HyukjinKwon/spark/actions/runs/28065895861/job/83089981477

This pull request and its description were written by Isaac.

…of-bounds contract

### What changes were proposed in this pull request?
`UTF8String.getByte(int)` is documented as "If byte index is invalid, returns 0",
but the implementation performed an unchecked `Platform.getByte(base, offset + byteIndex)`,
returning adjacent/uninitialized memory for out-of-range indices. This adds the bounds
check so the method matches its contract.

### Why are the changes needed?
Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, failing
`UTF8StringSuite.testGetByte` with 'expected 0 but got 47' in the Maven (Scala 2.13,
JDK 25) scheduled build. The previous '0' was never guaranteed; the fix makes behavior
deterministic across JDKs.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing `UTF8StringSuite` (incl. `testGetByte`), verified green under JDK 25.

Co-authored-by: Isaac

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a deterministic unit test case because this seems to be a flaky test in CI? The lastest Java 25 run passed.

Screenshot 2026-06-23 at 22 32 32

@HyukjinKwon HyukjinKwon left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated code review (self-review) — verdict: looks correct, no blocking findings. Posting as a comment, not an approval.

Design (Pass A): The change is a localized guard that makes the implementation honor the method's existing Javadoc ("If byte index is invalid, returns 0"). No layer/responsibility concerns; no peer analogue diverged from (getByte is the only UTF8String accessor carrying this out-of-bounds contract).

Correctness (Pass B):

  • Bounds check is correct — byteIndex >= numBytes treats numBytes as an exclusive upper bound (largest valid index is numBytes - 1), and byteIndex < 0 covers negatives. Valid indices [0, numBytes) are unchanged.
  • Matches every assertion in UTF8StringSuite.testGetByte (-1, length, length + 1) for both the valid and invalid-UTF-8 strings.
  • Prior behavior was undefined for out-of-range indices (Platform.getByte read adjacent memory); under JDK 25 that returned 47 instead of 0. The fix makes it deterministic.

Nit (non-blocking): getByte appears on some per-byte paths (e.g. numBytesForFirstByte); the added comparison is negligible, just noting it for awareness.

Verification:

@HyukjinKwon

Copy link
Copy Markdown
Member Author

Merged to master and branch-4.x.

HyukjinKwon added a commit that referenced this pull request Jun 24, 2026
…of-bounds contract

### What changes were proposed in this pull request?
`UTF8String.getByte(int)` is documented as *"If byte index is invalid, returns 0"*, but the implementation performed an unchecked `Platform.getByte(base, offset + byteIndex)`, returning adjacent/uninitialized memory for out-of-range indices. This PR adds the bounds check so the method honors its documented contract:

```java
public byte getByte(int byteIndex) {
  if (byteIndex < 0 || byteIndex >= numBytes) {
    return 0;
  }
  return Platform.getByte(base, offset + byteIndex);
}
```

### Why are the changes needed?
Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, making `UTF8StringSuite.testGetByte` fail with `expected: <0> but was: <47>` in the **Maven (Scala 2.13, JDK 25)** scheduled build. The previous `0` was never guaranteed — it depended on memory layout. The fix makes the behavior deterministic across JDKs.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing `UTF8StringSuite` (including `testGetByte`).

**Before (failing on master, JDK 25):** `UTF8StringSuite.testGetByte` — `expected: <0> but was: <47>`
https://github.com/apache/spark/actions/runs/28035606804/job/82996960386

**After (this fix, JDK 25):** `UTF8StringSuite` — *Tests: succeeded 34, failed 0*; `testGetByte` passes
https://github.com/HyukjinKwon/spark/actions/runs/28065895861/job/83089981477

This pull request and its description were written by Isaac.

Closes #56723 from HyukjinKwon/ci-fix/utf8-getbyte-jdk25.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
(cherry picked from commit faebce8)
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM, thank you @HyukjinKwon and @dongjoon-hyun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants