Fix Hadoop multi-value string null value handling to match native batch #18944

jtuglu1 · 2026-01-22T23:19:11Z

Description

Doing some more digging, I found another unfortunate data difference between native batch (on-cluster) and Hadoop batch ingest. Ingesting a multi-value string ["a","b",null] with Hadoop is treated as ["a","b","null"] and in native batch, this correctly ingests to ["a","b",null]. This is difference appears to be a bug in all Druid versions(even latest). While this will not affect the current null handling migration, this will affect the future Hadoop -> native batch ingestion migration that will also need to take place.
Hadoop doesn't allow for all-null columns in segments, it simply excludes them from the segment. I've updated the Hadoop job to support running druid.indexer.task.storeEmptyColumns=true, which allows us to store all NULL columns (how native/streaming ingest work today).

Related to:

Release note

Fix Hadoop null value handling to match native batch and allow v10 segment creation.

BREAKING CHANGES

Hadoop ingests will now process multi-value string inputs like ["a","b",null] -> ["a","b",null] instead of ["a","b","null"] to match native batch ingestion.
Hadoop ingests will now by default keep columns with all NULL values, instead of excluding them from the segment.
useStringValueOfNullInLists parameter in RowBasedColumnSelectorFactory.java‎ has been removed.

This PR has:

gianm

Seems like a good change to me. This behavior was reverted in #15190 but the additional changes in this patch look like they would deal with the original problem. Have you run this through a real world test to confirm that everything works as expected?

indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java

indexing-hadoop/src/test/java/org/apache/druid/indexer/InputRowSerdeTest.java

processing/src/main/java/org/apache/druid/data/input/Rows.java

jtuglu1 · 2026-01-23T08:02:49Z

@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment(-Ddruid.indexer.task.storeEmptyColumns=false by default). Native batch in latest version does... . This is the key difference that showed up in the segment diff. #12279. Do you know why this is?

For example, if you were to ingest [null] multi-value string, native batch would see this correctly as null, whereas Hadoop would not even include this column in the segment which causes headaches when switching from Hadoop to native batch.

gianm · 2026-01-23T10:46:39Z

@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment(-Ddruid.indexer.task.storeEmptyColumns=false by default). Native batch in latest version does... . This is the key difference that showed up in the segment diff. #12279. Do you know why this is?

There's a comment in IndexMergerV9 that says Hadoop indexing uses a constructor that doesn't support the storeEmptyColumns configuration yet. In that spot it's hard-coded to false. I suppose it would make more sense for that to be hard-coded to TaskConfig.DEFAULT_STORE_EMPTY_COLUMNS, i.e., true. It would make even more sense for it to respect the actual configuration, which would mean switching to some logic like the TaskToolboxFactory uses:

config.buildV10()
            ? indexMergerV10Factory.create()
            : indexMergerV9Factory.create(
                task.getContextValue(Tasks.STORE_EMPTY_COLUMNS_KEY, config.isStoreEmptyColumns())
            )

Rather than injecting the IndexMergerV9 directly.

jtuglu1 · 2026-01-23T16:19:27Z

@gianm thanks, that's what I thought. I guess I just wanted to make sure there wasn't any critical piece of null merging that was missing/incompatible with Hadoop for both V9/V10.

indexing-hadoop/src/test/java/org/apache/druid/indexer/HadoopNullValueIngestionTest.java

.../avro-extensions/src/test/java/org/apache/druid/data/input/AvroStreamInputRowParserTest.java

indexing-hadoop/src/main/java/org/apache/druid/indexer/HadoopDruidIndexerConfig.java

indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java

clintropolis · 2026-01-23T22:04:28Z

processing/src/main/java/org/apache/druid/data/input/Rows.java

    } else if (inputValue instanceof List) {
      // guava's toString function fails on null objects, so please do not use it
-      return ((List<?>) inputValue).stream().map(String::valueOf).collect(Collectors.toList());
+      return ((List<?>) inputValue).stream().map(Evals::asString).collect(Collectors.toList());


i think this is a good change because the old behavior was wack, but I'm still tracing through to try to determine the actual impacts of this change.

Besides the hadoop impact, which you fix in this PR, this method seems like it will mostly impact callers of Row.getDimension as well as the toGroupKey method of this class since it calls getDimension.

luckily there are a relatively small number of 'production' callers of these methods

Row.getDimension:

Rows.toGroupKey:

which look mostly related to partitioning. I think we need to determine if the null -> 'null' coercion is important for these callers, and if so, do the coercion there. I'm uncertain currently but will keep trying to figure it out.

@clintropolis I can scope this PR to hadoop-only by creating a separate implementation of the Rows.* methods?

it feels worth figuring this out since the old code seems quite odd to be doing what it is at this layer, so I want to keep looking. It probably would be fine though if we can't figure it out?

I've summarized the breaking changes in the release notes section.

I've gone through the callers of these methods and provided some extra tests where I can. From what I can tell, things should work.

jtuglu1 · 2026-01-23T22:40:38Z

@clintropolis @gianm Question: do you know why this is null? I had to update this to get it to work properly (actually ingest null-only columns).

gianm · 2026-01-24T04:32:44Z

@clintropolis @gianm Question: do you know why this is null? I had to update this to get it to work properly (actually ingest null-only columns).

I do not know. It's the DimensionsSpec parameter, but I don't recall what happens when that isn't provided.

jtuglu1 · 2026-01-26T18:03:49Z

@gianm @clintropolis Looks like with this change there's a single byte difference between native/hadoop-outputted segments in handling of this multi-value string: [null, "a", "b"]. With this change, Hadoop properly encodes the dictionary cardinality as 3, whereas native seems to incorrectly encode it as 2. The byte difference is in the dictionary cardinality (hadoop represents as 3, while native represents as 2). It doesn't show up in queries, but I believe difference is in VSizeColumnarMultiInts.

processing/src/main/java/org/apache/druid/segment/RowBasedColumnSelectorFactory.java

github-actions bot added the Area - Batch Ingestion label Jan 22, 2026

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from d271c47 to b1313f3 Compare January 22, 2026 23:23

jtuglu1 requested a review from clintropolis January 22, 2026 23:23

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch 2 times, most recently from 807fe75 to 788fd25 Compare January 23, 2026 00:54

jtuglu1 requested a review from maytasm January 23, 2026 01:27

jtuglu1 added the Release Notes label Jan 23, 2026

Fix Hadoop multi-value string null value handling to match native batch

12073e9

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 788fd25 to 12073e9 Compare January 23, 2026 01:51

gianm reviewed Jan 23, 2026

View reviewed changes

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch 2 times, most recently from e29849f to 9b5c25d Compare January 23, 2026 18:08

jtuglu1 requested a review from gianm January 23, 2026 18:15

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 9b5c25d to cd6b5eb Compare January 23, 2026 18:32

github-advanced-security bot found potential problems Jan 23, 2026

View reviewed changes

indexing-hadoop/src/test/java/org/apache/druid/indexer/HadoopNullValueIngestionTest.java Fixed Show fixed Hide fixed

indexing-hadoop/src/test/java/org/apache/druid/indexer/HadoopNullValueIngestionTest.java Fixed Show fixed Hide fixed

More changes

7f66e89

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from cd6b5eb to 7f66e89 Compare January 23, 2026 21:07

clintropolis reviewed Jan 23, 2026

View reviewed changes

More changes

4e7f878

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 2603679 to 4e7f878 Compare January 26, 2026 18:28

jtuglu1 requested a review from clintropolis January 26, 2026 20:35

github-actions bot added Area - Segment Format and Ser/De Area - Ingestion labels Jan 26, 2026

Add tests for Rows.toGroupKey and Rows.objectToStrings

8086fcd

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 1dd64b9 to 8086fcd Compare January 28, 2026 08:39

clintropolis reviewed Jan 29, 2026

View reviewed changes

processing/src/main/java/org/apache/druid/segment/RowBasedColumnSelectorFactory.java Outdated Show resolved Hide resolved

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from b7e8f58 to ace1b33 Compare January 29, 2026 20:39

github-actions bot added the Area - Querying label Jan 29, 2026

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from ace1b33 to a511163 Compare January 29, 2026 20:44

Remove useStringValueOfNullInLists config

60d72b9

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from a511163 to 60d72b9 Compare January 29, 2026 20:47

jtuglu1 requested a review from clintropolis January 29, 2026 23:19

clintropolis approved these changes Jan 30, 2026

View reviewed changes

processing/src/main/java/org/apache/druid/segment/RowBasedColumnSelectorFactory.java Outdated Show resolved Hide resolved

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from ae7efba to 85106f8 Compare January 30, 2026 02:40

Remove final usages of String.valueOf

066a6b4

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 85106f8 to 066a6b4 Compare January 30, 2026 03:37

jtuglu1 merged commit e889201 into apache:master Jan 30, 2026
45 of 46 checks passed

Fix Hadoop multi-value string null value handling to match native batch #18944

Fix Hadoop multi-value string null value handling to match native batch #18944

Uh oh!

Conversation

jtuglu1 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

BREAKING CHANGES

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jtuglu1 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jan 23, 2026

Uh oh!

jtuglu1 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clintropolis Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

jtuglu1 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

clintropolis Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

jtuglu1 Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

jtuglu1 Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

jtuglu1 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jan 24, 2026

Uh oh!

jtuglu1 commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jtuglu1 commented Jan 22, 2026 •

edited

Loading

jtuglu1 commented Jan 23, 2026 •

edited

Loading

jtuglu1 commented Jan 23, 2026 •

edited

Loading

jtuglu1 commented Jan 23, 2026 •

edited

Loading

jtuglu1 commented Jan 26, 2026 •

edited

Loading