Skip to content

chore: add validation for TransparentHugePage config#8015

Open
mxj220 wants to merge 8 commits intomainfrom
markibrahim/thp-config-validate
Open

chore: add validation for TransparentHugePage config#8015
mxj220 wants to merge 8 commits intomainfrom
markibrahim/thp-config-validate

Conversation

@mxj220
Copy link
Contributor

@mxj220 mxj220 commented Mar 4, 2026

What this PR does / why we need it:
Added validation for TransparentHugePage config to ensure that legal values are used.
Which issue(s) this PR fixes:

Fixes #

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds validation for CustomLinuxOSConfig Transparent Huge Pages (THP) settings to ensure only kernel-supported values are accepted before generating Linux node bootstrapping artifacts.

Changes:

  • Introduces constants for valid THP enabled and defrag values and adds (*CustomLinuxOSConfig).ValidateTHPConfig() validation.
  • Wires THP validation into the Linux path of GetNodeBootstrapping.
  • Adds unit tests for ValidateTHPConfig().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pkg/agent/datamodel/types.go Adds THP allowed-value constants and ValidateTHPConfig() to validate transparentHugePageEnabled/Defrag.
pkg/agent/datamodel/types_test.go Adds table-driven tests covering valid/invalid THP configurations.
pkg/agent/bakerapi.go Calls THP validation before Linux bootstrapping generation.

Copilot AI review requested due to automatic review settings March 4, 2026 22:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

The latest Buf updates on your PR. Results from workflow Buf CI / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 6, 2026, 11:05 PM

Copilot AI review requested due to automatic review settings March 5, 2026 17:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

"never": true,
}

if v := osConfig.GetTransparentHugepageSupport(); v != "" && !validEnabled[strings.ToLower(v)] {
Copy link
Contributor

@cameronmeissner cameronmeissner Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty string check is redundant since validEnabled[strings.ToLower(v)] would resolve to false regardless if v is empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, if I didn't skip on empty string, then an empty string would throw the warning instead of proceeding silently. May be unnecessarily concerning to a user who intentionally left the config blank

Copy link
Collaborator

@Devinwong Devinwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

if config.AgentPoolProfile.IsWindows() {
validateAndSetWindowsNodeBootstrappingConfiguration(config)
} else {
config.AgentPoolProfile.CustomLinuxOSConfig.ValidateTHPConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might rename this to ValidateAndDefaultTHPConfig(), since this defaults the respective settings to empty strings if the input is invalid

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this applies here and in aks-node-controller)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to this rename, but the empty strings are also just for best practices. The original code leaves the invalid string in the config but it doesn't do anything, which causes the default to happen anyway.

osConfig.TransparentHugepageSupport = ""
}
if v := osConfig.GetTransparentDefrag(); v != "" && !validDefrag[strings.ToLower(v)] {
log.Printf("WARNING: invalid transparent_defrag value %q, must be one of: always, defer, defer+madvise, madvise, never; resetting to default", v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure log is the correct mechanism by which we should be logging these messages out - @r2k1 @Devinwong can clarify maybe

// ValidateTHPConfig checks that TransparentHugepageSupport and TransparentDefrag
// contain valid kernel values when set. Invalid values are logged as warnings
// and reset to empty (kernel default).
func ValidateTHPConfig(cfg *aksnodeconfigv1.Configuration) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the actual intended behavior? if this is a customer-exposed setting, wouldn't we want RP to return a 4xx if they specify an invalid value, rather than falling back to empty for TransparentHugepageSupport and GetTransparentDefrag? for that to happen, this validation would need to happen within the RP rather than here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Tim:

"I'm in two minds about this. If we abort with an error then any user doing something strange in prod right now will suddenly have failing nodepool scale operations. Which could cause a sev2 or more depending on the Cx. But if we silently ignore the value, then customer's clusters might regress."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In its current state, our code writes the invalid user string to the file, i.e. "false." The user intended "false" to disable THP, but since it's the wrong key word, it does nothing, which causes the default behavior ("always") to occur.

By clearing the user string, I'm not changing the code's existing behavior, I'm just cleaning it up a bit.

Copilot AI review requested due to automatic review settings March 6, 2026 22:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants