Describe the problem
Profiling with async-profiler on native execution shows a significant amount of time spent in metric reporting. This happens on the per-batch hot path, so the cost is paid for every batch returned from DataFusion back to the JVM.
Where the cost comes from
update_comet_metric is called from jni_api.rs:
- Once for every batch returned to the JVM (
executePlan fast path)
- Periodically during
ScanExec polling (every ~100 polls, gated by metrics_update_interval)
- At stream end
For each call, the pipeline does the following:
Rust side (native/core/src/execution/metrics/utils.rs):
to_native_metric_node recursively walks the SparkPlan tree.
- Per node: allocates a fresh
HashMap<String, i64>, inserts each metric with name.to_string() (a per-metric allocation), and pushes the children into a new Vec.
- Serializes the resulting
NativeMetricNode protobuf via prost's encode_to_vec(): allocates a Vec<u8>.
byte_array_from_slice allocates a new Java byte[] and copies the protobuf bytes in.
- One JNI upcall to
set_all_from_bytes([B)V.
JVM side (CometMetricNode.scala):
Metric.NativeMetricNode.parseFrom(bytes): full protobuf tree re-allocation.
- Recursive walk pairing the deserialized tree with the local
CometMetricNode tree via zip(children).
- Per metric:
metrics.get(metricName) lookup on the node's Map[String, SQLMetric], then SQLMetric.set(v).
So every update cycle includes: a tree of per-node HashMap allocations, per-metric String allocations, protobuf serialization, a Java byte-array allocation and copy, protobuf deserialization into a fresh tree, and a per-metric hash-map lookup on the JVM side, none of which carry state across cycles even though the tree shape and metric names are stable for the lifetime of the plan.
Impact
Visible in async-profiler flame graphs as a disproportionate share of native-thread CPU during query execution, especially for plans with many nodes and small-to-medium batch sizes (where update frequency is high relative to per-batch work).
Describe the problem
Profiling with async-profiler on native execution shows a significant amount of time spent in metric reporting. This happens on the per-batch hot path, so the cost is paid for every batch returned from DataFusion back to the JVM.
Where the cost comes from
update_comet_metricis called fromjni_api.rs:executePlanfast path)ScanExecpolling (every ~100 polls, gated bymetrics_update_interval)For each call, the pipeline does the following:
Rust side (
native/core/src/execution/metrics/utils.rs):to_native_metric_noderecursively walks theSparkPlantree.HashMap<String, i64>, inserts each metric withname.to_string()(a per-metric allocation), and pushes the children into a newVec.NativeMetricNodeprotobuf viaprost'sencode_to_vec(): allocates aVec<u8>.byte_array_from_sliceallocates a new Javabyte[]and copies the protobuf bytes in.set_all_from_bytes([B)V.JVM side (
CometMetricNode.scala):Metric.NativeMetricNode.parseFrom(bytes): full protobuf tree re-allocation.CometMetricNodetree viazip(children).metrics.get(metricName)lookup on the node'sMap[String, SQLMetric], thenSQLMetric.set(v).So every update cycle includes: a tree of per-node
HashMapallocations, per-metricStringallocations, protobuf serialization, a Java byte-array allocation and copy, protobuf deserialization into a fresh tree, and a per-metric hash-map lookup on the JVM side, none of which carry state across cycles even though the tree shape and metric names are stable for the lifetime of the plan.Impact
Visible in async-profiler flame graphs as a disproportionate share of native-thread CPU during query execution, especially for plans with many nodes and small-to-medium batch sizes (where update frequency is high relative to per-batch work).