Skip to content

Conversation

@jtuglu1
Copy link
Contributor

@jtuglu1 jtuglu1 commented Jan 22, 2026

Description

  1. Doing some more digging, I found another unfortunate data difference between native batch (on-cluster) and Hadoop batch ingest. Ingesting a multi-value string ["a","b",null] with Hadoop is treated as ["a","b","null"] and in native batch, this correctly ingests to ["a","b",null]. This is difference appears to be a bug in all Druid versions(even latest). While this will not affect the current null handling migration, this will affect the future Hadoop -> native batch ingestion migration that will also need to take place.
  2. Hadoop doesn't allow for all-null columns in segments, it simply excludes them from the segment. I've updated the Hadoop job to support running druid.indexer.task.storeEmptyColumns=true, which allows us to store all NULL columns (how native/streaming ingest work today).

Related to:

Release note

Fix Hadoop null value handling to match native batch and allow v10 segment creation.

BREAKING CHANGES
  1. Hadoop ingests will now process multi-value string inputs like ["a","b",null] -> ["a","b",null] instead of ["a","b","null"] to match native batch ingestion.
  2. Hadoop ingests will now by default keep columns with all NULL values, instead of excluding them from the segment.
  3. useStringValueOfNullInLists parameter in RowBasedColumnSelectorFactory.java‎ has been removed.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from d271c47 to b1313f3 Compare January 22, 2026 23:23
@jtuglu1 jtuglu1 requested a review from clintropolis January 22, 2026 23:23
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch 2 times, most recently from 807fe75 to 788fd25 Compare January 23, 2026 00:54
@jtuglu1 jtuglu1 requested a review from maytasm January 23, 2026 01:27
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 788fd25 to 12073e9 Compare January 23, 2026 01:51
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good change to me. This behavior was reverted in #15190 but the additional changes in this patch look like they would deal with the original problem. Have you run this through a real world test to confirm that everything works as expected?

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Jan 23, 2026

@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment(-Ddruid.indexer.task.storeEmptyColumns=false by default). Native batch in latest version does... . This is the key difference that showed up in the segment diff. #12279. Do you know why this is?

For example, if you were to ingest [null] multi-value string, native batch would see this correctly as null, whereas Hadoop would not even include this column in the segment which causes headaches when switching from Hadoop to native batch.

@gianm
Copy link
Contributor

gianm commented Jan 23, 2026

@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment(-Ddruid.indexer.task.storeEmptyColumns=false by default). Native batch in latest version does... . This is the key difference that showed up in the segment diff. #12279. Do you know why this is?

There's a comment in IndexMergerV9 that says Hadoop indexing uses a constructor that doesn't support the storeEmptyColumns configuration yet. In that spot it's hard-coded to false. I suppose it would make more sense for that to be hard-coded to TaskConfig.DEFAULT_STORE_EMPTY_COLUMNS, i.e., true. It would make even more sense for it to respect the actual configuration, which would mean switching to some logic like the TaskToolboxFactory uses:

config.buildV10()
            ? indexMergerV10Factory.create()
            : indexMergerV9Factory.create(
                task.getContextValue(Tasks.STORE_EMPTY_COLUMNS_KEY, config.isStoreEmptyColumns())
            )

Rather than injecting the IndexMergerV9 directly.

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Jan 23, 2026

@gianm thanks, that's what I thought. I guess I just wanted to make sure there wasn't any critical piece of null merging that was missing/incompatible with Hadoop for both V9/V10.

@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch 2 times, most recently from e29849f to 9b5c25d Compare January 23, 2026 18:08
@jtuglu1 jtuglu1 requested a review from gianm January 23, 2026 18:15
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 9b5c25d to cd6b5eb Compare January 23, 2026 18:32
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from cd6b5eb to 7f66e89 Compare January 23, 2026 21:07
} else if (inputValue instanceof List) {
// guava's toString function fails on null objects, so please do not use it
return ((List<?>) inputValue).stream().map(String::valueOf).collect(Collectors.toList());
return ((List<?>) inputValue).stream().map(Evals::asString).collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is a good change because the old behavior was wack, but I'm still tracing through to try to determine the actual impacts of this change.

Besides the hadoop impact, which you fix in this PR, this method seems like it will mostly impact callers of Row.getDimension as well as the toGroupKey method of this class since it calls getDimension.

luckily there are a relatively small number of 'production' callers of these methods

Row.getDimension:
Image

Rows.toGroupKey:
Image

which look mostly related to partitioning. I think we need to determine if the null -> 'null' coercion is important for these callers, and if so, do the coercion there. I'm uncertain currently but will keep trying to figure it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis I can scope this PR to hadoop-only by creating a separate implementation of the Rows.* methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels worth figuring this out since the old code seems quite odd to be doing what it is at this layer, so I want to keep looking. It probably would be fine though if we can't figure it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've summarized the breaking changes in the release notes section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the callers of these methods and provided some extra tests where I can. From what I can tell, things should work.

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Jan 23, 2026

@clintropolis @gianm Question: do you know why this is null? I had to update this to get it to work properly (actually ingest null-only columns).

@gianm
Copy link
Contributor

gianm commented Jan 24, 2026

@clintropolis @gianm Question: do you know why this is null? I had to update this to get it to work properly (actually ingest null-only columns).

I do not know. It's the DimensionsSpec parameter, but I don't recall what happens when that isn't provided.

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Jan 26, 2026

@gianm @clintropolis Looks like with this change there's a single byte difference between native/hadoop-outputted segments in handling of this multi-value string: [null, "a", "b"]. With this change, Hadoop properly encodes the dictionary cardinality as 3, whereas native seems to incorrectly encode it as 2. The byte difference is in the dictionary cardinality (hadoop represents as 3, while native represents as 2). It doesn't show up in queries, but I believe difference is in VSizeColumnarMultiInts.

@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 1dd64b9 to 8086fcd Compare January 28, 2026 08:39
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from b7e8f58 to ace1b33 Compare January 29, 2026 20:39
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from ace1b33 to a511163 Compare January 29, 2026 20:44
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from a511163 to 60d72b9 Compare January 29, 2026 20:47
@jtuglu1 jtuglu1 requested a review from clintropolis January 29, 2026 23:19
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from ae7efba to 85106f8 Compare January 30, 2026 02:40
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 85106f8 to 066a6b4 Compare January 30, 2026 03:37
@jtuglu1 jtuglu1 merged commit e889201 into apache:master Jan 30, 2026
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants