Skip to content

[Enhancement] Fuse REX_EXTRACT calls that share (field, pattern) to a single Matcher invocation #5499

@RyanL1997

Description

@RyanL1997

Problem

PPL rex with N named capture groups runs the regex matcher N times per row, even though all N groups can be filled from a single Matcher.find() result. The cost is structural to the current lowering, not a bug.

CalciteRelNodeVisitor.innerRex (core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java:378-413) parses the pattern, finds all (?<name>...) groups, and emits one REX_EXTRACT(field, pattern, name_i) UDF call per group. Each call independently runs the matcher:

// RexExtractFunction.executeExtraction, core/.../udf/RexExtractFunction.java:121-138
Pattern compiledPattern = RegexCommonUtils.getCompiledPattern(pattern);  // cached
Matcher matcher = compiledPattern.matcher(text);
if (matcher.find()) { return extractor.apply(matcher); }

Pattern compilation is cached globally (RegexCommonUtils.getCompiledPattern at line 48-55), so compilation isn't the cost. The cost is matcher.find() running N times over the same text per row. There's no CSE — each REX_EXTRACT call differs in the third argument (groupName), so Calcite treats them as independent expressions.

The same applies to multiple sequential rex commands on the same field: each rex emits its own REX_EXTRACT calls; no fusion happens across commands either.

Concrete impact — a typical access-log analytics query with four sequential rex commands on the same field:

source=<index> | where match(body, '<keyword>')
| rex field=body "field_a=(?<field_a>[^\s]+)"
| rex field=body "field_b=(?<field_b>\d+)"
| rex field=body "field_c=\"(?<field_c>[^\"]+)\""
| rex field=body "field_d=\"(?<field_d>[^\"]+)\""
| ...

…runs matcher.find() 4× per row. Combining them into a single multi-group rex doesn't help (still 4 UDF calls, just with a more expensive combined pattern). On high-volume log indices this is a meaningful per-row cost multiplier.

Proposed fix

Add a fused UDF REX_EXTRACT_ALL(field, pattern) returning a MAP<VARCHAR, VARCHAR> (or a struct row) containing all named groups from one matcher invocation. Modify innerRex so when the pattern has ≥2 named groups, emit a single REX_EXTRACT_ALL call and project each named group as MAP_GET(rex_result, \"name_i\"). For single-group patterns keep the current direct call (no map overhead).

For the multi-rex-on-same-field case (the query above), add a Calcite HEP rule that fuses adjacent REX_EXTRACT / REX_EXTRACT_ALL calls with identical (field, pattern) operands across consecutive projections. That handles the case where the user wrote four separate rex commands rather than one combined one.

The visitor change alone fixes single-rex multi-group; the HEP rule extends the fix to the more common multi-rex pattern in real queries.

Files to touch

  • New UDF: core/src/main/java/org/opensearch/sql/expression/function/udf/RexExtractAllFunction.java
  • Registration: core/src/main/java/org/opensearch/sql/expression/function/PPLFuncImpTable.java and BuiltinFunctionName.java
  • Visitor: core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java:391-413 (multi-group branch emits one call + map projections)
  • Optional HEP rule: core/src/main/java/org/opensearch/sql/calcite/plan/rule/RexExtractFusionRule.java, register in HEP_PROGRAM

Verification

  • CalcitePPLRexTest cases covering 1-group, 2-group, and 3-group patterns; verify the lowered plan contains 1 REX_EXTRACT_ALL (not N REX_EXTRACTs) for ≥2 groups.
  • Integration test on TEST_INDEX_BANK (or similar) with a multi-group pattern, asserting result equivalence with the current behavior.
  • Microbenchmark (or rough timing) on a match-filtered index showing per-row cost flat in N (the number of named groups) instead of linear.

Out of scope

  • The change doesn't alter public rex syntax or semantics — same input, same output, fewer matcher invocations.
  • Ingest-time extraction via grok/dissect is the broader perf recommendation for users but orthogonal to this code change.

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageperformanceMake it fast!

Type

No type
No fields configured for issues without a type.

Projects

Status
Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions