feat: fix NTILE distribution logic#22051
Conversation
| } else { | ||
| // base > 0 here: i >= large_rows is only reachable when remainder < n, | ||
| // which forces base >= 1 (otherwise large_rows would equal num_rows). | ||
| remainder + (i - large_rows) / base + 1 |
There was a problem hiding this comment.
Thank you ! perhaps a good idea to add tests for remainder = 0 ? Like say 10 rows distributed by 5
| let num_rows = num_rows as u64; | ||
| let mut vec: Vec<u64> = Vec::new(); | ||
| let n = u64::min(self.n, num_rows); | ||
| let n = self.n; |
There was a problem hiding this comment.
Should we also reject NTILE(0) directly ?
mbutrovich
left a comment
There was a problem hiding this comment.
Minor issue exposed by the new code. Thanks @comphead!
There was a problem hiding this comment.
I think we need a guard for the unsigned path to match the signed path below for n == 0. The new code will panic if n is 0 in the unsigned case.
|
Good thinking, @coderfender :) |
|
Thanks @mbutrovich @coderfender non valid inputs, like 0, negative and NULL are covered in NTILE create_udwf_window_expr which DF and also Comet calling.
But I'll add tests to assert it UPD: double checked, those tests already in place and passing. |
So you're saying if n <= 0 {
return exec_err!("NTILE requires a positive integer");
}is dead code then? We should remove it if it's not possible and replace with a |
The evaluator can be simplified, however it is the existing code, thanks for helping to make it cleaner. Addressed |
mbutrovich
left a comment
There was a problem hiding this comment.
Approved pending CI, thanks @comphead!
Which issue does this PR close?
Rationale for this change
Root cause — datafusion/functions-window/src/ntile.rs:170-176 used the linear-interpolation formula i * n / num_rows to assign bucket numbers. That formula spreads the larger buckets evenly through the partition, but the SQL standard requires the first num_rows mod n buckets to be the larger ones. For
NTILE(4) over 10 rows it produced bucket sizes 3,2,3,2 instead of 3,3,2,2
Fix — replaced the formula with the front-loaded distribution: compute base = num_rows / n and remainder = num_rows % n, treat the first remainder * (base+1) rows as the "large bucket" region (each bucket of size base+1), and the remainder as size-base buckets. The n > num_rows case (existing test
NTILE(9223377) over 3 rows) still works because base = 0 forces every row into the "large" arm where each bucket holds one row.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?