Skip to content

fix: Avoid overlay panic on valid Unicode input, Postgres compatibility#22046

Open
neilconway wants to merge 1 commit intoapache:mainfrom
neilconway:neilc/fix-overlay-unicode
Open

fix: Avoid overlay panic on valid Unicode input, Postgres compatibility#22046
neilconway wants to merge 1 commit intoapache:mainfrom
neilconway:neilc/fix-overlay-unicode

Conversation

@neilconway
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

overlay mixed up character-based and byte-based indexing, which resulted in either panicking or returning incorrect results on multi-byte Unicode input. While looking at this, I also fixed a few places where overlay's behavior was inconsistent with Postgres and the SQL spec, refactored the code, and added a benchmark.

What changes are included in this PR?

  • Fix panic or wrong results on multibyte Unicode input
  • Align edge-case behavior with Postgres: start past end appends replacement after the full input, start <= 0 errors
  • Refactored the code to reduce redundancy and clarify variable names
  • Add SLT tests for Unicode input and behavior changes for Postgres compatibility
  • Add benchmark, in advance of subsequent performance optimization work

Are these changes tested?

Yes, and new tests added.

Are there any user-facing changes?

Yes. The behavior of overlay has changed for multibyte Unicode inputs, as well as the edge cases enumerated above.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels May 6, 2026
Copy link
Copy Markdown
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neilconway
Nice improvement overall. I just had one small performance suggestion around the four-argument overlay path.

if start_pos < 1 {
return exec_err!("negative substring length not allowed");
}
let string_char_len = string.chars().count() as i64;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small optimization idea: the four-argument path calls string.chars().count() to clamp len, but overlay_one ends up computing the character length again internally. Since overlay_one already handles lengths that extend past the end of the string, it might be simpler to pass len through directly or move the clamp logic into overlay_one so we avoid the extra Unicode scan per row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

overlay behavior inconsistent with Postgres / SQL spec overlay panics on valid utf-8 input

2 participants