[SPARK-57653][CORE] Make UTF8String.getByte honor its documented out-of-bounds contract by HyukjinKwon · Pull Request #56723 · apache/spark

HyukjinKwon · 2026-06-24T05:23:15Z

What changes were proposed in this pull request?

UTF8String.getByte(int) is documented as "If byte index is invalid, returns 0", but the implementation performed an unchecked Platform.getByte(base, offset + byteIndex), returning adjacent/uninitialized memory for out-of-range indices. This PR adds the bounds check so the method honors its documented contract:

public byte getByte(int byteIndex) {
  if (byteIndex < 0 || byteIndex >= numBytes) {
    return 0;
  }
  return Platform.getByte(base, offset + byteIndex);
}

Why are the changes needed?

Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, making UTF8StringSuite.testGetByte fail with expected: <0> but was: <47> in the Maven (Scala 2.13, JDK 25) scheduled build. The previous 0 was never guaranteed — it depended on memory layout. The fix makes the behavior deterministic across JDKs.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTF8StringSuite (including testGetByte).

Before (failing on master, JDK 25): UTF8StringSuite.testGetByte — expected: <0> but was: <47>
https://github.com/apache/spark/actions/runs/28035606804/job/82996960386

After (this fix, JDK 25): UTF8StringSuite — Tests: succeeded 34, failed 0; testGetByte passes
https://github.com/HyukjinKwon/spark/actions/runs/28065895861/job/83089981477

This pull request and its description were written by Isaac.

…of-bounds contract ### What changes were proposed in this pull request? `UTF8String.getByte(int)` is documented as "If byte index is invalid, returns 0", but the implementation performed an unchecked `Platform.getByte(base, offset + byteIndex)`, returning adjacent/uninitialized memory for out-of-range indices. This adds the bounds check so the method matches its contract. ### Why are the changes needed? Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, failing `UTF8StringSuite.testGetByte` with 'expected 0 but got 47' in the Maven (Scala 2.13, JDK 25) scheduled build. The previous '0' was never guaranteed; the fix makes behavior deterministic across JDKs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing `UTF8StringSuite` (incl. `testGetByte`), verified green under JDK 25. Co-authored-by: Isaac

dongjoon-hyun

Could you add a deterministic unit test case because this seems to be a flaky test in CI? The lastest Java 25 run passed.

HyukjinKwon

Automated code review (self-review) — verdict: looks correct, no blocking findings. Posting as a comment, not an approval.

Design (Pass A): The change is a localized guard that makes the implementation honor the method's existing Javadoc ("If byte index is invalid, returns 0"). No layer/responsibility concerns; no peer analogue diverged from (getByte is the only UTF8String accessor carrying this out-of-bounds contract).

Correctness (Pass B):

Bounds check is correct — byteIndex >= numBytes treats numBytes as an exclusive upper bound (largest valid index is numBytes - 1), and byteIndex < 0 covers negatives. Valid indices [0, numBytes) are unchanged.
Matches every assertion in UTF8StringSuite.testGetByte (-1, length, length + 1) for both the valid and invalid-UTF-8 strings.
Prior behavior was undefined for out-of-range indices (Platform.getByte read adjacent memory); under JDK 25 that returned 47 instead of 0. The fix makes it deterministic.

Nit (non-blocking): getByte appears on some per-byte paths (e.g. numBytesForFirstByte); the added comparison is negligible, just noting it for awareness.

Verification:

Before (master, JDK 25, fail): https://github.com/apache/spark/actions/runs/28035606804/job/82996960386 — testGetByte expected 0 but got 47
After (this PR, JDK 25, pass): https://github.com/HyukjinKwon/spark/actions/runs/28065895861/job/83089981477 — UTF8StringSuite 34 passed

HyukjinKwon · 2026-06-24T10:40:54Z

Merged to master and branch-4.x.

…of-bounds contract ### What changes were proposed in this pull request? `UTF8String.getByte(int)` is documented as *"If byte index is invalid, returns 0"*, but the implementation performed an unchecked `Platform.getByte(base, offset + byteIndex)`, returning adjacent/uninitialized memory for out-of-range indices. This PR adds the bounds check so the method honors its documented contract: ```java public byte getByte(int byteIndex) { if (byteIndex < 0 || byteIndex >= numBytes) { return 0; } return Platform.getByte(base, offset + byteIndex); } ``` ### Why are the changes needed? Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, making `UTF8StringSuite.testGetByte` fail with `expected: <0> but was: <47>` in the **Maven (Scala 2.13, JDK 25)** scheduled build. The previous `0` was never guaranteed — it depended on memory layout. The fix makes the behavior deterministic across JDKs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing `UTF8StringSuite` (including `testGetByte`). **Before (failing on master, JDK 25):** `UTF8StringSuite.testGetByte` — `expected: <0> but was: <47>` https://github.com/apache/spark/actions/runs/28035606804/job/82996960386 **After (this fix, JDK 25):** `UTF8StringSuite` — *Tests: succeeded 34, failed 0*; `testGetByte` passes https://github.com/HyukjinKwon/spark/actions/runs/28065895861/job/83089981477 This pull request and its description were written by Isaac. Closes #56723 from HyukjinKwon/ci-fix/utf8-getbyte-jdk25. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com> (cherry picked from commit faebce8) Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>

uros-b

Late LGTM, thank you @HyukjinKwon and @dongjoon-hyun!

dongjoon-hyun reviewed Jun 24, 2026

View reviewed changes

dongjoon-hyun approved these changes Jun 24, 2026

View reviewed changes

HyukjinKwon commented Jun 24, 2026

View reviewed changes

HyukjinKwon closed this in faebce8 Jun 24, 2026

uros-b reviewed Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57653][CORE] Make UTF8String.getByte honor its documented out-of-bounds contract#56723

[SPARK-57653][CORE] Make UTF8String.getByte honor its documented out-of-bounds contract#56723
HyukjinKwon wants to merge 1 commit into
apache:masterfrom
HyukjinKwon:ci-fix/utf8-getbyte-jdk25

HyukjinKwon commented Jun 24, 2026

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

HyukjinKwon left a comment

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

uros-b left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

HyukjinKwon commented Jun 24, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 24, 2026

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun left a comment •

edited

Loading