[SPARK-57653][CORE] Make UTF8String.getByte honor its documented out-of-bounds contract#56723
[SPARK-57653][CORE] Make UTF8String.getByte honor its documented out-of-bounds contract#56723HyukjinKwon wants to merge 1 commit into
Conversation
…of-bounds contract ### What changes were proposed in this pull request? `UTF8String.getByte(int)` is documented as "If byte index is invalid, returns 0", but the implementation performed an unchecked `Platform.getByte(base, offset + byteIndex)`, returning adjacent/uninitialized memory for out-of-range indices. This adds the bounds check so the method matches its contract. ### Why are the changes needed? Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, failing `UTF8StringSuite.testGetByte` with 'expected 0 but got 47' in the Maven (Scala 2.13, JDK 25) scheduled build. The previous '0' was never guaranteed; the fix makes behavior deterministic across JDKs. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing `UTF8StringSuite` (incl. `testGetByte`), verified green under JDK 25. Co-authored-by: Isaac
HyukjinKwon
left a comment
There was a problem hiding this comment.
Automated code review (self-review) — verdict: looks correct, no blocking findings. Posting as a comment, not an approval.
Design (Pass A): The change is a localized guard that makes the implementation honor the method's existing Javadoc ("If byte index is invalid, returns 0"). No layer/responsibility concerns; no peer analogue diverged from (getByte is the only UTF8String accessor carrying this out-of-bounds contract).
Correctness (Pass B):
- Bounds check is correct —
byteIndex >= numBytestreatsnumBytesas an exclusive upper bound (largest valid index isnumBytes - 1), andbyteIndex < 0covers negatives. Valid indices[0, numBytes)are unchanged. - Matches every assertion in
UTF8StringSuite.testGetByte(-1,length,length + 1) for both the valid and invalid-UTF-8 strings. - Prior behavior was undefined for out-of-range indices (
Platform.getByteread adjacent memory); under JDK 25 that returned47instead of0. The fix makes it deterministic.
Nit (non-blocking): getByte appears on some per-byte paths (e.g. numBytesForFirstByte); the added comparison is negligible, just noting it for awareness.
Verification:
- Before (master, JDK 25, fail): https://github.com/apache/spark/actions/runs/28035606804/job/82996960386 —
testGetByteexpected 0 but got 47 - After (this PR, JDK 25, pass): https://github.com/HyukjinKwon/spark/actions/runs/28065895861/job/83089981477 —
UTF8StringSuite34 passed
|
Merged to master and branch-4.x. |
…of-bounds contract
### What changes were proposed in this pull request?
`UTF8String.getByte(int)` is documented as *"If byte index is invalid, returns 0"*, but the implementation performed an unchecked `Platform.getByte(base, offset + byteIndex)`, returning adjacent/uninitialized memory for out-of-range indices. This PR adds the bounds check so the method honors its documented contract:
```java
public byte getByte(int byteIndex) {
if (byteIndex < 0 || byteIndex >= numBytes) {
return 0;
}
return Platform.getByte(base, offset + byteIndex);
}
```
### Why are the changes needed?
Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, making `UTF8StringSuite.testGetByte` fail with `expected: <0> but was: <47>` in the **Maven (Scala 2.13, JDK 25)** scheduled build. The previous `0` was never guaranteed — it depended on memory layout. The fix makes the behavior deterministic across JDKs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing `UTF8StringSuite` (including `testGetByte`).
**Before (failing on master, JDK 25):** `UTF8StringSuite.testGetByte` — `expected: <0> but was: <47>`
https://github.com/apache/spark/actions/runs/28035606804/job/82996960386
**After (this fix, JDK 25):** `UTF8StringSuite` — *Tests: succeeded 34, failed 0*; `testGetByte` passes
https://github.com/HyukjinKwon/spark/actions/runs/28065895861/job/83089981477
This pull request and its description were written by Isaac.
Closes #56723 from HyukjinKwon/ci-fix/utf8-getbyte-jdk25.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
(cherry picked from commit faebce8)
Signed-off-by: Hyukjin Kwon <hyukjin.kwon@databricks.com>
uros-b
left a comment
There was a problem hiding this comment.
Late LGTM, thank you @HyukjinKwon and @dongjoon-hyun!

What changes were proposed in this pull request?
UTF8String.getByte(int)is documented as "If byte index is invalid, returns 0", but the implementation performed an uncheckedPlatform.getByte(base, offset + byteIndex), returning adjacent/uninitialized memory for out-of-range indices. This PR adds the bounds check so the method honors its documented contract:Why are the changes needed?
Under JDK 25 the out-of-bounds read returned non-zero adjacent memory, making
UTF8StringSuite.testGetBytefail withexpected: <0> but was: <47>in the Maven (Scala 2.13, JDK 25) scheduled build. The previous0was never guaranteed — it depended on memory layout. The fix makes the behavior deterministic across JDKs.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing
UTF8StringSuite(includingtestGetByte).Before (failing on master, JDK 25):
UTF8StringSuite.testGetByte—expected: <0> but was: <47>https://github.com/apache/spark/actions/runs/28035606804/job/82996960386
After (this fix, JDK 25):
UTF8StringSuite— Tests: succeeded 34, failed 0;testGetBytepasseshttps://github.com/HyukjinKwon/spark/actions/runs/28065895861/job/83089981477
This pull request and its description were written by Isaac.