Skip to content

[SPARK-56769][SQL] Add fast path for date_trunc WEEK/MONTH/QUARTER/YEAR#55736

Open
Licht-T wants to merge 4 commits intoapache:masterfrom
Licht-T:date-trunc-fastpath-phase3
Open

[SPARK-56769][SQL] Add fast path for date_trunc WEEK/MONTH/QUARTER/YEAR#55736
Licht-T wants to merge 4 commits intoapache:masterfrom
Licht-T:date-trunc-fastpath-phase3

Conversation

@Licht-T
Copy link
Copy Markdown
Contributor

@Licht-T Licht-T commented May 7, 2026

What changes were proposed in this pull request?

This PR extends the offset-arithmetic + DST-equality-guard fast path introduced in SPARK-56663 from MIN/HR/DAY to the date-level units WEEK / MONTH / QUARTER / YEAR.

The framework for offset-based truncation -- resolve offset once, apply, truncate in the local frame, re-apply, DST guard, fall back on DST-cross or arithmetic overflow -- is identical for every level above SECOND. Only the "truncate in local frame" step varies. This PR inlines SPARK-56663's truncToUnitFast together with the new date-level path directly into truncTimestamp, and keeps a single private truncTimestampSlow as a complete reference implementation that the fast path falls back to:

def truncTimestamp(micros: Long, level: Int, zoneId: ZoneId): Long = {
  // MICROSECOND / MILLISECOND / SECOND short-circuits (no zone work).
  // Offset arithmetic for every other level.
  // DST guard, fallback to truncTimestampSlow.
}

private def truncTimestampSlow(micros: Long, level: Int, zoneId: ZoneId): Long

The local-frame truncation step is the only thing the fast path branches on:

  • MICROSECOND / MILLISECOND / SECOND - pure UTC floorMod (zone offsets have at most second precision per java.time.ZoneOffset; no zone information needed).
  • MINUTE / HOUR / DAY - shifted-local floorMod against the unit micros.
  • WEEK / MONTH / QUARTER / YEAR - compute local epoch-day by integer division, run truncDate in the local-day frame, multiply back to local micros.

Everything else (offset resolve via rules.getOffset, addExact / subtractExact, DST guard via offset-equality at the candidate, slow-path fallback) is shared.

The DST guard fires correctly for the new date-level cases - for example, YEAR truncation of a March instant in America/Los_Angeles produces a candidate at Jan 1 (which is in PST, offset -8) while the original is in PDT (offset -7); the offsets differ, so the path falls back to the slow microsToDays / daysToMicros route which uses ZonedDateTime.resolveLocal to land on Jan 1 00:00 PST.

This PR also rewrites TRUNC_TO_QUARTER from IsoFields.DAY_OF_QUARTER (a TemporalAdjuster that produces a fresh LocalDate) to a direct withMonth(firstMonthOfQuarter).withDayOfMonth(1) chain on the existing LocalDate. Saves one allocation + the adjuster overhead per call.

truncTimestampSlow covers every level explicitly so it serves as a self-contained reference implementation - the fast path's correctness can be verified against it case-by-case.

Why are the changes needed?

SPARK-33404 (Nov 2020) routed every date_trunc level above SECOND through microsToInstant().atZone(zoneId).truncatedTo(unit) for correctness, costing ~5.5× throughput per the follow-up benchmark PR (#30338). SPARK-56663 recovered most of that for MIN/HR/DAY using the offset-arithmetic + DST-guard pattern. This PR extends the same recovery to WEEK / MONTH / QUARTER / YEAR - the levels that drive monthly/quarterly aggregations in analytics workloads.

DateTimeBenchmark Truncation results, wholestage on, ns/row on a 12th Gen Intel i7-1260P (master = pre-SPARK-56663):

level master baseline this PR speedup
WEEK 165.2 78.2 2.11×
MONTH 181.9 92.2 1.97×
MM 182.2 92.5 1.97×
MON 182.9 92.7 1.97×
QUARTER 216.8 108.8 1.99×
YEAR 205.2 96.7 2.12×
YYYY 205.8 96.9 2.12×
YY 206.3 96.0 2.15×

Time-level units (MIN/HR/DAY/SECOND) and trunc(date, ...) are unchanged within noise; the hot path for those levels is byte-identical to SPARK-56663 after the unification.

Does this PR introduce any user-facing change?

No. The output of date_trunc is identical to master in all cases, including DST-spanning truncations (verified by the offset-equality guard + slow-path fallback, plus the new tests). Only the internal implementation changes.

How was this patch tested?

  • DateTimeUtilsSuite - all 66 tests pass, including:
    • SPARK-33404: test truncTimestamp when time zone offset from UTC has a granularity of seconds, extended to also exercise WEEK / MONTH / QUARTER / YEAR with the 1769-10-17 LMT timestamp across every available zone (the existing loop already covered SECOND/MILLI/MICRO; SPARK-56663 added HOUR/DAY; this PR completes the matrix).
    • The existing truncTimestamp test, which loops WEEK / MONTH / QUARTER / YEAR for 2015 timestamps across every zone.
    • New test truncTimestamp date-level units across DST boundaries - covers YEAR / QUARTER truncation that crosses the LA spring-forward (DST guard fires, fallback path runs) and MONTH truncation entirely within DST (fast path stays).
  • DateExpressionsSuite - all tests pass (no changes to expression-level code, only the underlying DateTimeUtils helpers).
  • DateTimeBenchmark re-run via the GitHub Actions Run benchmarks workflow on this fork for JDK 17, 21, and 25; results committed back to the branch.

Was this patch authored or co-authored using generative AI tooling?

Yes, co-authored with Claude Code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant